Back
Featured image of post Very Fast Data Loading with {vroom}

Very Fast Data Loading with {vroom}

The vroom package performs similarly as readr except that it reads data in a much faster way

1. Loading CSV Files with vroom()

readr is pretty good for reading relatively large data frames. Consider the following data frame. It is a 7.3 MB CSV file that showcases several Genomic and Genetic information through 22083 rows and 45 columns. You can find the source of the data frame here.

Let’s use readr to load the file:

start <- Sys.time()

genomic <- readr::read_csv(file = "genomic.csv")

end <- Sys.time()

end - start

We can see from the above that it took approximately 300 milliseconds to read the 7MB file (not bad). Now let’s try with a larger file. The following data frame has a weight of 631 MB. It contains 3.717.964 Tweets about top companies from 2015 to 2020. You can find the data frame here.

Let’s use readr at first:

start <- Sys.time()

tweets_readr <- readr::read_csv(file = "Tweet.csv")

end <- Sys.time()

end - start

As expected, it took a longer to read the csv file, approximately 16 seconds which is not alarming but for a production environment it can be annoying. Let’s use vroom and see how it handles the situation:

start <- Sys.time()

tweets_vroom <- vroom::vroom(
  file = "Tweet.csv",
  show_col_types = FALSE
)
end <- Sys.time()

end - start

Yes! with vroom it’s considerably faster.

We can get a more robust comparison using the microbenchmark package which allows us to run code expressions several time and compare the minimum, average or median time of execution.

library(microbenchmark)

microbenchmark(
  readr::read_csv(file = "genomic.csv"), 
  vroom::vroom(file = "genomic.csv"), 
  times = 3
)

Just as with readr and readxl, it is possible to speed up data loading by selecting a set of column instead of the overall data frame. In our example, let’s say we’re only interested in the Tweet message itself, we can load the corresponding column, just as follows:

start <- Sys.time()

tweet_sm <- vroom::vroom(
  file = "Tweet.csv",
  col_select = body
)

end <- Sys.time()

end - start

Exercise 1:

🧠 🧠 🧠 🧠 🧠 > Q1: Using the microbenchmark package, compare readr::read_csv() with base R read.csv. Use the genomic.csv and then the Tweet.csv file, What do you observe? > Q2: Still with microbenchmark, compare readr::read_csv() with vroom::vroom() using the genomic.csv file, is there a difference? > Q3: data.table is considered as one of the most performant packages in terms of speed, it has a function called fread() that reads CSV files. Compare data.table::fread() and vroom::vroom(), which package is the fastest? > Q4: Using Sys.time() write a function that returns the time of execution of an expression. 🧠 🧠 🧠 🧠 🧠

  1. Reading multiple files at once with vroom()

If you have multiple CSV files that share the same column names and you want to concatenate them into one big file, using the vroom package in conjunction with another small package called fs, it’s a breeze to achieve that.

Let’s copy our Tweet.csv file, 5 times and put them inside a separate folder. We will call the folder many_tweet. The fs package allows us to get the paths of each of those CSV files inside the directory (not that here I’m not providing the full path within the path parameter as I’m working inside a project and you should too):

tweet_paths <- fs::dir_ls(path = "many_tweets/")

We can even get the sizes of our csv files:

file_info <- fs::dir_info(path = "many_tweets/")

Of course, we can get the sum of the overall weight of the directory:

sum(file_info$size)

So, if you run the above code, you’ll see that we have a directory with more than 3GB of csv files. Let’s read it with vroom():


conc_data <- vroom::vroom(file = tweet_paths)

As expected, we end up with an overall file that contains all the rows from the 5 individual CSV files. If you inspect conc_data, you’ll see that it has exactly 18.589.820 rows which corresponds to 3.717.964 (number of row in each individual file) * 5 (number of files). What’s really cool is that it took only about 5 seconds to read 3GB of files.

It is also possible to set an ID to the concatenated file so that you can track the origin of each row. This is extremely useful, especially in situation shere the name of the file conveys some information.


conc_data_id <- vroom::vroom(file = tweet_paths, 
                             id = "origin")

Exercise 2:

🧠 🧠 🧠 🧠 🧠 > Q1: Without modifying the folder of Tweet CSV files, concatenante Tweet3, Tweet4 and Tweet5 into one big file. Don’t forget to identify the source of each file. > Q2: Do the same as in Q1, but now make sure that the tweet_id column is the first column on the left. Also, during the loading process, rename the writer column to poster. > Q3: Read the same data this time, make sure that the comment_num, retweet_num and like_num are of integer type. All the remaining columns should be of character type. > Q4: Read all the same data except without the comment_num column. 🧠 🧠 🧠 🧠 🧠

Built with Hugo
Theme Stack designed by Jimmy