1. Reading a `CSV` File

A Comma-Separated Values file (with a .csv extension) is a text file that has values (usually) separated by a comma (,) and in which each row is considered as an observation. It is used to store data in tabular format and is extensively used in production due to its flexibility and light weight.

Take a look at the following csv file which is hosted as a Github Gist. The CSV file is a representation of the famous mpg data frame which is part of the ggplot2 package.

Firstly, It is important to notice that the first row is particular, it will be used to identify/name the columns. Secondly, you can see that each value is separated by a comma. Finally, each row will represent an individual observation.

So, how to read a comma-separated values file?

We will use the read_csv() function. The first argument of the function is the most important. It indicates the path to the file to import. Given the fact that our CSV file is stored remotely (not locally), we can just use the URL of the file to read it:

library(tidyverse)

url <- "https://gist.githubusercontent.com/feddelegrand7/b366864aabf9653361f461cbf972d97c/raw/a62c4672f2f5824b2634a66c948e6258d7c65323/mpg.csv"

mpg_data <- read_csv(file = url)

head(mpg_data)
## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <dbl> <dbl> <chr>      <chr> <dbl> <dbl> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa~
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa~

When reading the data readr prints a set of information. It specifies the dimension of the data frame, 234 rows and 11 columns. It also specified the delimiter used to parse the data, in this case the ,. Finally, you get useful insights on the data structures, namely the type of each column. In our example, we can see that 6 columns are of type character while 5 are of type double.

You can always get the columns types using the spec function:

readr::spec(x = mpg_data)
## cols(
##   manufacturer = col_character(),
##   model = col_character(),
##   displ = col_double(),
##   year = col_double(),
##   cyl = col_double(),
##   trans = col_character(),
##   drv = col_character(),
##   cty = col_double(),
##   hwy = col_double(),
##   fl = col_character(),
##   class = col_character()
## )

If you want to get a more compact representation of the types you can use:

as.character(spec(mpg_data))
## [1] "ccdddccddcc"

readr detects the columns types using a set of the data. By default, It uses 1000 rows to infer the data structure. If your data frame has fewer rows, it will use all the rows that will be read. For more info, see the parameters guess_max and n_max. One can override this behaviour and set the data types manually using the col_types parameter:

url <- "https://gist.githubusercontent.com/feddelegrand7/b366864aabf9653361f461cbf972d97c/raw/a62c4672f2f5824b2634a66c948e6258d7c65323/mpg.csv"

mpg_data <- read_csv(file = url, col_types = "ccccccccccc", guess_max = 1)

Exercise 1:

🧠 🧠 🧠 🧠 🧠

Q1: Read the above CSV data frame as a local file.

You will have to:

Create an R Project;
Download the CSV file into your R Project directory (Ideally create a data directory);
Create a new script and load the CSV file into your R session;
Print the first six (06) observations.

🧠 🧠 🧠 🧠 🧠

2. Cleaning on the Fly

Consider the following data frame. It’s a CSV file from Kaggle that contains some information about World Cup games from 1930 until 2018.

If you check the file, you can see that it has 5 comments at the top. Therefore, if we read it directly using read_csv() we would get parsing failures:

url <- "https://gist.githubusercontent.com/feddelegrand7/e2f5e302c45f48ed9dcc69e8539f9710/raw/9273fe41dd7f6bf3cebd5c415ffac20be0343d24/world_cup.csv"

data_world_cup <- read_csv(url)
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 8241 Columns: 1-- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): # The data frame comes from Kaggle:
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(data_world_cup, 10)
## # A tibble: 10 x 1
##    `# The data frame comes from Kaggle:`                                        
##    <chr>                                                                        
##  1 # Description: -------------------------------------------------------------~
##  2 # Scores and half-time scores of all euro cup games from 1930 until and incl~
##  3 # Includes all regular matches fully played.                                 
##  4 # Excludes forfeits or otherwise no fully played matches.                    
##  5 edition,year,venue,round,team_1,team_2,score,score_halftime                  
##  6 1930-URUGUAY,1930,Montevideo.,GROUP_STAGE,Yugoslavia (<U+0408><U+0443><U+0433><U+043E><U+0441><U+043B><U+0430><U+0432><U+0438><U+0458><U+0430>),Bolivia,4~
##  7 1930-URUGUAY,1930,Montevideo.,1/2_FINAL,Uruguay,Yugoslavia (<U+0408><U+0443><U+0433><U+043E><U+0441><U+043B><U+0430><U+0432><U+0438><U+0458><U+0430>),6-1~
##  8 1930-URUGUAY,1930,Montevideo.,1/2_FINAL,Argentina,USA,6-1,1-0                
##  9 1930-URUGUAY,1930,Montevideo.,GROUP_STAGE,Argentina,Chile,3-1,2-1            
## 10 1930-URUGUAY,1930,Montevideo.,GROUP_STAGE,Uruguay,Romania (România),4-0,4-0

You can see that read_csv gives pretty good informative messages concerning the source of the failure and the number of parsing failures, in this example 8235 which corresponds to the number of rows (comments not included). You can use the problems function get a tibble of the parsing failures.

parsing_failures <- problems(data_world_cup)

head(parsing_failures, 10)
## # A tibble: 10 x 5
##      row   col expected  actual    file 
##    <int> <int> <chr>     <chr>     <chr>
##  1     6     8 1 columns 8 columns ""   
##  2     7     8 1 columns 8 columns ""   
##  3     8     8 1 columns 8 columns ""   
##  4     9     8 1 columns 8 columns ""   
##  5    10     8 1 columns 8 columns ""   
##  6    11     8 1 columns 8 columns ""   
##  7    12     8 1 columns 8 columns ""   
##  8    13     8 1 columns 8 columns ""   
##  9    14     8 1 columns 8 columns ""   
## 10    15     8 1 columns 8 columns ""

Most of the time readr will read the data even though there is a parsing issue. It is true that you can always inspect the issues using the problems function however you might want to stop the execution of your script whenever there’s a parsing problem (especially if your script takes a long time to execute). In this context, readr provides the stop_for_problems function which will throw an error when the parsing process fails:

world_cup_url <- "https://gist.githubusercontent.com/feddelegrand7/e2f5e302c45f48ed9dcc69e8539f9710/raw/9273fe41dd7f6bf3cebd5c415ffac20be0343d24/world_cup.csv"

world_cup_data <- read_csv(file = world_cup_url, show_col_types = FALSE)
## Warning: One or more parsing issues, see `problems()` for details

stop_for_problems(world_cup_data)
## Error: 8235 parsing failures

Fortunately, it is relatively easy to fix this issue. We can use the skip parameter of read_csv() and set it to 5, thus indicating that we want to skip the first 5 rows which correspond to the commented rows in our CSV file:

read_csv(url, skip = 5)
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 8235 Columns: 8-- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): edition, venue, round, team_1, team_2, score, score_halftime
## dbl (1): year
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 8,235 x 8
##    edition       year venue       round       team_1 team_2 score score_halftime
##    <chr>        <dbl> <chr>       <chr>       <chr>  <chr>  <chr> <chr>         
##  1 1930-URUGUAY  1930 Montevideo. GROUP_STAGE Yugos~ Boliv~ 4-0   0-0           
##  2 1930-URUGUAY  1930 Montevideo. 1/2_FINAL   Urugu~ Yugos~ 6-1   3-1           
##  3 1930-URUGUAY  1930 Montevideo. 1/2_FINAL   Argen~ USA    6-1   1-0           
##  4 1930-URUGUAY  1930 Montevideo. GROUP_STAGE Argen~ Chile  3-1   2-1           
##  5 1930-URUGUAY  1930 Montevideo. GROUP_STAGE Urugu~ Roman~ 4-0   4-0           
##  6 1930-URUGUAY  1930 Montevideo. GROUP_STAGE Parag~ Belgi~ 1-0   1-0           
##  7 1930-URUGUAY  1930 Montevideo. GROUP_STAGE Brazi~ Boliv~ 4-0   1-0           
##  8 1930-URUGUAY  1930 Montevideo. _FINAL      Urugu~ Argen~ 4-2   1-2           
##  9 1930-URUGUAY  1930 Montevideo. GROUP_STAGE Argen~ Mexic~ 6-3   3-1           
## 10 1930-URUGUAY  1930 Montevideo. GROUP_STAGE Urugu~ Peru ~ 1-0   0-0           
## # ... with 8,225 more rows

It works! but wait we still get two parsing failures!!! at row 8235 and 8236. If you check the CSV again, you’ll see that there are two comments at the bottom of the file. The rows number 8235 and 8236 correspond to the position of the failure, ignoring the first 5 skipped rows and the columns’ names row.

Now, how we’re going to deal with the last two comments? remember that in the CSV file, all the comments begin with #. We can tell read_csv() to get rid of all the comments that start with #:

read_csv(url, comment = "#")
## Rows: 8234 Columns: 8-- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): edition, venue, round, team_1, team_2, score, score_halftime
## dbl (1): year
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 8,234 x 8
##    edition       year venue       round       team_1 team_2 score score_halftime
##    <chr>        <dbl> <chr>       <chr>       <chr>  <chr>  <chr> <chr>         
##  1 1930-URUGUAY  1930 Montevideo. GROUP_STAGE Yugos~ Boliv~ 4-0   0-0           
##  2 1930-URUGUAY  1930 Montevideo. 1/2_FINAL   Urugu~ Yugos~ 6-1   3-1           
##  3 1930-URUGUAY  1930 Montevideo. 1/2_FINAL   Argen~ USA    6-1   1-0           
##  4 1930-URUGUAY  1930 Montevideo. GROUP_STAGE Argen~ Chile  3-1   2-1           
##  5 1930-URUGUAY  1930 Montevideo. GROUP_STAGE Urugu~ Roman~ 4-0   4-0           
##  6 1930-URUGUAY  1930 Montevideo. GROUP_STAGE Parag~ Belgi~ 1-0   1-0           
##  7 1930-URUGUAY  1930 Montevideo. GROUP_STAGE Brazi~ Boliv~ 4-0   1-0           
##  8 1930-URUGUAY  1930 Montevideo. _FINAL      Urugu~ Argen~ 4-2   1-2           
##  9 1930-URUGUAY  1930 Montevideo. GROUP_STAGE Argen~ Mexic~ 6-3   3-1           
## 10 1930-URUGUAY  1930 Montevideo. GROUP_STAGE Urugu~ Peru ~ 1-0   0-0           
## # ... with 8,224 more rows

Exercise 2:

🧠 🧠 🧠 🧠 🧠

Q1: Consider the following CSV file. It has commented rows at the beginning and at the end. Try to get rid of first commented rows.

Q2: Without looking at the file, guess the sign of the commented rows at the end of the file.

Q3: Get rid of all the commented lines.

Q4: The data is a bit voluminous, using the n_max parameter, import the first 100 rows only.

🧠 🧠 🧠 🧠 🧠

3. What if the Separator is not a Comma?

Many European firms when producing their CSV data, use a semi-colon (;) to separate the values. This is mainly due to the fact that the numerical records use the comma , as the decimal mark instead of the dot .. So for example, you’ll see 0,5 instead of O.5 for a half. You’ll also see a millions written in the following numeric format: 1.000.000,00. Here the comma , is used to denote the start of the decimal values and the dot . is what is called a grouping mark, used to display large values in a convenient way (by grouping the thousands).

As an example, consider the following CSV file which represents the different train ticket subscription’s schemes in Paris, France.

For this particular case, where the separator is a semi-colon (;), readr provides the read_csv2() function that allows us to parse the semi-colon separated values file properly:

url <- "https://gist.githubusercontent.com/feddelegrand7/2c3e0c2f1fe76497ad7c8ce1b6d88853/raw/c1dfcb731dd95a31a0f8db5821fe71a6dfd7c139/train_abo.csv"

paris_train_abo <- read_csv2(url)
## i Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 79 Columns: 14-- Column specification --------------------------------------------------------
## Delimiter: ";"
## chr   (2): titre_de_transport, commentaire
## dbl  (11): toutes_zones, zone_1_a_2, zone_1_a_3, zone_1_a_4, zone_1_a_5, zon...
## date  (1): date
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

paris_train_abo
## # A tibble: 79 x 14
##    titre_de_transport   date       toutes_zones zone_1_a_2 zone_1_a_3 zone_1_a_4
##    <chr>                <date>            <dbl>      <dbl>      <dbl>      <dbl>
##  1 Mobilis 1J           2014-01-01         NA          6.8       9.05       11.2
##  2 Paris Visite 2J Adu~ 2014-01-01         NA         NA        17.6        NA  
##  3 Paris Visite 5J Adu~ 2014-01-01         NA         NA        34.7        NA  
##  4 Forfait Solidarité ~ 2015-01-01         NA         17.5      22.3        27.0
##  5 Paris Visite 1J Enf~ 2017-08-01         NA         NA         6          NA  
##  6 Paris Visite 1J Adu~ 2013-01-01         NA         NA        10.6        NA  
##  7 Paris Visite 5J Adu~ 2013-01-01         NA         NA        33.7        NA  
##  8 Forfait Solidarité ~ 2017-08-01         18.8       NA        NA          NA  
##  9 Imagine R étudiant ~ 2016-08-01        334.        NA        NA          NA  
## 10 Paris Visite 5J Adu~ 2017-08-01         NA         NA        38.4        NA  
## # ... with 69 more rows, and 8 more variables: zone_1_a_5 <dbl>,
## #   zone_2_a_3 <dbl>, zone_2_a_4 <dbl>, zone_2_a_5 <dbl>, zone_3_a_4 <dbl>,
## #   zone_3_a_5 <dbl>, zone_4_a_5 <dbl>, commentaire <chr>

Exercise 3:

🧠 🧠 🧠 🧠 🧠 > Import the following CSV file.

Q1. After reading the data, you’ll find that all the columns (except the date) are of character type. Why?

Q2. Using one of the read_csv2() parameters correct Q1 so that you find the same column types as in the previous example (see above).

🧠 🧠 🧠 🧠 🧠

So, let’s summarize, we should use read_csv() when the delimiter is a comma , and read_csv2() when the delimiter is a semi-colon ; but what if the delimiter is | or a - or a ### or any symbol. In this case, there is a flexible function provided by readr called read_delim() which allows to customize our CSV properties (delimiter, grouping mark and others).

As an example consider the following CSV file. It’s a real data frame from the French open data portal that showcases the annual electricity production from natural gas in France.

Looking at the data, we can see that the delimiter is a semi-colon ;, so we can be tempted to use the read_csv2() function however if look closely, the decimal mark is not a comma , but a dot .. Remember that read_csv2() consider a . as a grouping mark and a , as a decimal mark. As a consequence, a value of 485.167 will be considered as 485167 by read_csv2() which is obviously inaccurate:

url <- "https://gist.githubusercontent.com/feddelegrand7/5031ffd852acaef2e797056471db3a1c/raw/ea5e46dd7cb7c27c9fda1206f4e82e2403ec33c2/electricity_prod.csv"

read_csv2(url)
## i Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 28 Columns: 9-- Column specification --------------------------------------------------------
## Delimiter: ";"
## chr (4): site, statut, operateur, point_geo
## dbl (2): annee_de_reference, date_de_mise_en_service
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 28 x 9
##    annee_de_reference site  statut puissance_insta~ longitude_site latitude_site
##                 <dbl> <chr> <chr>             <dbl>          <dbl>         <dbl>
##  1               2019 Land~ En pr~             4220         -40667        485167
##  2               2019 SPEM~ En se~             4350       -2137728       4730861
##  3               2019 Pont~ En se~             4400          38692        502268
##  4               2020 Toul  En se~             4130        5913879      48694169
##  5               2020 Blén~ En se~             4300        6079634      48867195
##  6               2020 Sain~ En se~             8600          66999        491488
##  7               2019 Mont~ En se~             3700          28511        483771
##  8               2020 Mont~ En se~             3700          28511        483771
##  9               2020 Land~ En pr~             4220         -40667        485167
## 10               2020 Mart~ En se~             9300        5021728      43359135
## # ... with 18 more rows, and 3 more variables: operateur <chr>,
## #   date_de_mise_en_service <dbl>, point_geo <chr>

Using read_delim(), we can overcome this situation and import our CSV file correctly:

read_delim(
  file = url, 
  delim = ";",
  locale = locale(decimal_mark = ".")
)
## Rows: 28 Columns: 9-- Column specification --------------------------------------------------------
## Delimiter: ";"
## chr (4): site, statut, operateur, point_geo
## dbl (5): annee_de_reference, puissance_installee_mw, longitude_site, latitud...
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 28 x 9
##    annee_de_reference site  statut puissance_insta~ longitude_site latitude_site
##                 <dbl> <chr> <chr>             <dbl>          <dbl>         <dbl>
##  1               2019 Land~ En pr~              422          -4.07          48.5
##  2               2019 SPEM~ En se~              435          -2.14          47.3
##  3               2019 Pont~ En se~              440           3.87          50.2
##  4               2020 Toul  En se~              413           5.91          48.7
##  5               2020 Blén~ En se~              430           6.08          48.9
##  6               2020 Sain~ En se~              860           6.70          49.1
##  7               2019 Mont~ En se~              370           2.85          48.4
##  8               2020 Mont~ En se~              370           2.85          48.4
##  9               2020 Land~ En pr~              422          -4.07          48.5
## 10               2020 Mart~ En se~              930           5.02          43.4
## # ... with 18 more rows, and 3 more variables: operateur <chr>,
## #   date_de_mise_en_service <dbl>, point_geo <chr>

Here, we’ve used the locale parameter which takes the locale() function as argument. Inside the locale() function, we’ve indicated that our decimal mark is a . (locale(decimal_mark = ".")).

Exercise 4:

🧠 🧠 🧠 🧠 🧠 > Q1: Read the following CSV file. 🧠 🧠 🧠 🧠 🧠

4. Column Specification in `readr`

When reading data, readr uses some heuristics to guess the type of each column. By default, it looks at the first 1000 rows in order to make an accurate guess. Further, using the guess_max parameter, it is possible to modify the default number of rows needed to determine the type of each column. It is an arbitrary decision. Setting guess_max to a lower value will increase speed and decrease accuracy. Conversely, a higher value will increase the accuracy but will slow down the data loading process. There is one type of data that will never be guessed though: the factor type. Hopefully, readr allows us to set manually each column type. This is a more robust way to load data. Therefore, in order to set a particular variable type to factor within the data loading process, we need to specify it manually using the col_type parameter.

Let’s go back to our first data example, the mpg csv file:

url <- "https://gist.githubusercontent.com/feddelegrand7/b366864aabf9653361f461cbf972d97c/raw/a62c4672f2f5824b2634a66c948e6258d7c65323/mpg.csv"

mpg <- read_csv(url)
## Rows: 234 Columns: 11-- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (6): manufacturer, model, trans, drv, fl, class
## dbl (5): displ, year, cyl, cty, hwy
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

mpg
## # A tibble: 234 x 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto~ f        18    29 p     comp~
##  2 audi         a4           1.8  1999     4 manu~ f        21    29 p     comp~
##  3 audi         a4           2    2008     4 manu~ f        20    31 p     comp~
##  4 audi         a4           2    2008     4 auto~ f        21    30 p     comp~
##  5 audi         a4           2.8  1999     6 auto~ f        16    26 p     comp~
##  6 audi         a4           2.8  1999     6 manu~ f        18    26 p     comp~
##  7 audi         a4           3.1  2008     6 auto~ f        18    27 p     comp~
##  8 audi         a4 quattro   1.8  1999     4 manu~ 4        18    26 p     comp~
##  9 audi         a4 quattro   1.8  1999     4 auto~ 4        16    25 p     comp~
## 10 audi         a4 quattro   2    2008     4 manu~ 4        20    28 p     comp~
## # ... with 224 more rows

Let’s say that for whatever reason, we wanted to set the type of the year variable to factor. We can easily operate on the col_types parameter as seen previously but this time, we’ll use a list to specify the name of the column to act on:

mpg_fct <- read_csv(
  file = url, 
  col_types = list(
    year = "f"
  )
)

class(mpg_fct$year)
## [1] "factor"
levels(mpg_fct$year)
## [1] "1999" "2008"

readr was smart enough to detect the levels of the factor variable. Now, what if our goal was to extend the levels of the year variable. From our data set, it takes only two distinct values 1999 and 2008 but what if in the future, we will get data about other years, maybe 2022? When it comes to flexibility, we need to use the col_*** family function:


mpg_fct <- read_csv(
  file = url, 
  col_types = list(
    year = col_factor(
      levels = c("1999", "2008", "2022")
    )
  )
)

class(mpg_fct$year)
## [1] "factor"
levels(mpg_fct$year)
## [1] "1999" "2008" "2022"

Exercise 4:

🧠 🧠 🧠 🧠 🧠 > Consider the following TidyTuesday CSV file that exihibits a survey about managers salary among other information. You can find more info here.

Q1: Set the guess_max parameter to 0. What do you observe? Q2: With guess_max = 1000 and using the col_types parameter, make sure that the timestamp column is of type date-time, the annual_salary and other_monetary_comp columns of type double, while all the other variables should be of type character. The following page might help for the date-time parsing. 🧠 🧠 🧠 🧠 🧠

1. Reading a CSV File

Exercise 1:

2. Cleaning on the Fly

Exercise 2:

3. What if the Separator is not a Comma?

Exercise 3:

Exercise 4:

4. Column Specification in readr

Exercise 4:

1. Reading a `CSV` File

4. Column Specification in `readr`