1. Loading CSV Files with vroom()
readr
is pretty good for reading relatively large data frames. Consider the following data frame. It is a 7.3 MB CSV file that showcases several Genomic and Genetic information through 22083 rows and 45 columns. You can find the source of the data frame here.
Let’s use readr to load the file:
start <- Sys.time()
genomic <- readr::read_csv(file = "genomic.csv")
end <- Sys.time()
end - start
We can see from the above that it took approximately 300 milliseconds to read the 7MB file (not bad). Now let’s try with a larger file. The following data frame has a weight of 631 MB. It contains 3.717.964 Tweets about top companies from 2015 to 2020. You can find the data frame here.
Let’s use readr
at first:
start <- Sys.time()
tweets_readr <- readr::read_csv(file = "Tweet.csv")
end <- Sys.time()
end - start
As expected, it took a longer to read the csv file, approximately 16 seconds which is not alarming but for a production environment it can be annoying. Let’s use vroom
and see how it handles the situation:
start <- Sys.time()
tweets_vroom <- vroom::vroom(
file = "Tweet.csv",
show_col_types = FALSE
)
end <- Sys.time()
end - start
Yes! with vroom
it’s considerably faster.
We can get a more robust comparison using the microbenchmark
package which allows us to run code expressions several time and compare the minimum, average or median time of execution.
library(microbenchmark)
microbenchmark(
readr::read_csv(file = "genomic.csv"),
vroom::vroom(file = "genomic.csv"),
times = 3
)
Just as with readr
and readxl
, it is possible to speed up data loading by selecting a set of column instead of the overall data frame. In our example, let’s say we’re only interested in the Tweet message itself, we can load the corresponding column, just as follows:
start <- Sys.time()
tweet_sm <- vroom::vroom(
file = "Tweet.csv",
col_select = body
)
end <- Sys.time()
end - start
Exercise 1:
🧠🧠🧠🧠🧠> Q1: Using the microbenchmark package, compare readr::read_csv() with base R read.csv. Use the genomic.csv and then the Tweet.csv file, What do you observe? > Q2: Still with microbenchmark, compare readr::read_csv() with vroom::vroom() using the genomic.csv file, is there a difference? > Q3: data.table is considered as one of the most performant packages in terms of speed, it has a function called fread() that reads CSV files. Compare data.table::fread() and vroom::vroom(), which package is the fastest? > Q4: Using Sys.time() write a function that returns the time of execution of an expression. 🧠🧠🧠🧠ðŸ§
- Reading multiple files at once with
vroom()
If you have multiple CSV files that share the same column names and you want to concatenate them into one big file, using the vroom
package in conjunction with another small package called fs
, it’s a breeze to achieve that.
Let’s copy our Tweet.csv
file, 5 times and put them inside a separate folder. We will call the folder many_tweet
. The fs
package allows us to get the paths of each of those CSV files inside the directory (not that here I’m not providing the full path within the path
parameter as I’m working inside a project and you should too):
tweet_paths <- fs::dir_ls(path = "many_tweets/")
We can even get the sizes of our csv files:
file_info <- fs::dir_info(path = "many_tweets/")
Of course, we can get the sum of the overall weight of the directory:
sum(file_info$size)
So, if you run the above code, you’ll see that we have a directory with more than 3GB of csv files. Let’s read it with vroom()
:
conc_data <- vroom::vroom(file = tweet_paths)
As expected, we end up with an overall file that contains all the rows from the 5 individual CSV files. If you inspect conc_data
, you’ll see that it has exactly 18.589.820 rows which corresponds to 3.717.964 (number of row in each individual file) * 5 (number of files). What’s really cool is that it took only about 5 seconds to read 3GB of files.
It is also possible to set an ID
to the concatenated file so that you can track the origin of each row. This is extremely useful, especially in situation shere the name of the file conveys some information.
conc_data_id <- vroom::vroom(file = tweet_paths,
id = "origin")
Exercise 2:
🧠🧠🧠🧠ðŸ§
> Q1: Without modifying the folder of Tweet CSV files, concatenante Tweet3, Tweet4 and Tweet5 into one big file. Don’t forget to identify the source of each file.
> Q2: Do the same as in Q1, but now make sure that the tweet_id
column is the first column on the left. Also, during the loading process, rename the writer
column to poster
.
> Q3: Read the same data this time, make sure that the comment_num
, retweet_num
and like_num
are of integer type. All the remaining columns should be of character type.
> Q4: Read all the same data except without the comment_num
column.
🧠🧠🧠🧠ðŸ§