1. Reading a CSV
File
A Comma-Separated Values file (with a .csv
extension) is a text file that has values (usually) separated by a comma (,
) and in which each row is considered as an observation. It is used to store data in tabular format and is extensively used in production due to its flexibility and light weight.
Take a look at the following csv file which is hosted as a Github Gist
. The CSV file is a representation of the famous mpg data frame which is part of the ggplot2
package.
Firstly, It is important to notice that the first row is particular, it will be used to identify/name the columns. Secondly, you can see that each value is separated by a comma. Finally, each row will represent an individual observation.
So, how to read a comma-separated values file?
We will use the read_csv()
function. The first argument of the function is the most important. It indicates the path to the file to import. Given the fact that our CSV file is stored remotely (not locally), we can just use the URL
of the file to read it:
library(tidyverse)
url <- "https://gist.githubusercontent.com/feddelegrand7/b366864aabf9653361f461cbf972d97c/raw/a62c4672f2f5824b2634a66c948e6258d7c65323/mpg.csv"
mpg_data <- read_csv(file = url)
head(mpg_data)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~
When reading the data readr
prints a set of information. It specifies the dimension of the data frame, 234 rows and 11 columns. It also specified the delimiter used to parse the data, in this case the ,
. Finally, you get useful insights on the data structures, namely the type
of each column. In our example, we can see that 6 columns are of type character
while 5 are of type double
.
You can always get the columns types using the spec
function:
readr::spec(x = mpg_data)
## cols(
## manufacturer = col_character(),
## model = col_character(),
## displ = col_double(),
## year = col_double(),
## cyl = col_double(),
## trans = col_character(),
## drv = col_character(),
## cty = col_double(),
## hwy = col_double(),
## fl = col_character(),
## class = col_character()
## )
If you want to get a more compact representation of the types you can use:
as.character(spec(mpg_data))
## [1] "ccdddccddcc"
readr
detects the columns types
using a set of the data. By default, It uses 1000 rows
to infer the data structure. If your data frame has fewer rows, it will use all the rows that will be read. For more info, see the parameters guess_max
and n_max
. One can override this behaviour and set the data types
manually using the col_types
parameter:
url <- "https://gist.githubusercontent.com/feddelegrand7/b366864aabf9653361f461cbf972d97c/raw/a62c4672f2f5824b2634a66c948e6258d7c65323/mpg.csv"
mpg_data <- read_csv(file = url, col_types = "ccccccccccc", guess_max = 1)
Exercise 1:
🧠🧠🧠🧠ðŸ§
Q1: Read the above CSV data frame as a local file.
You will have to:
Create an R Project;
Download the CSV file into your R Project directory (Ideally create a data directory);
Create a new script and load the CSV file into your R session;
Print the first six (06) observations.
🧠🧠🧠🧠ðŸ§
2. Cleaning on the Fly
Consider the following data frame. It’s a CSV file from Kaggle that contains some information about World Cup games from 1930 until 2018.
If you check the file, you can see that it has 5 comments at the top. Therefore, if we read it directly using read_csv()
we would get parsing failures:
url <- "https://gist.githubusercontent.com/feddelegrand7/e2f5e302c45f48ed9dcc69e8539f9710/raw/9273fe41dd7f6bf3cebd5c415ffac20be0343d24/world_cup.csv"
data_world_cup <- read_csv(url)
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 8241 Columns: 1-- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): # The data frame comes from Kaggle:
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data_world_cup, 10)
## # A tibble: 10 x 1
## `# The data frame comes from Kaggle:`
## <chr>
## 1 # Description: -------------------------------------------------------------~
## 2 # Scores and half-time scores of all euro cup games from 1930 until and incl~
## 3 # Includes all regular matches fully played.
## 4 # Excludes forfeits or otherwise no fully played matches.
## 5 edition,year,venue,round,team_1,team_2,score,score_halftime
## 6 1930-URUGUAY,1930,Montevideo.,GROUP_STAGE,Yugoslavia (<U+0408><U+0443><U+0433><U+043E><U+0441><U+043B><U+0430><U+0432><U+0438><U+0458><U+0430>),Bolivia,4~
## 7 1930-URUGUAY,1930,Montevideo.,1/2_FINAL,Uruguay,Yugoslavia (<U+0408><U+0443><U+0433><U+043E><U+0441><U+043B><U+0430><U+0432><U+0438><U+0458><U+0430>),6-1~
## 8 1930-URUGUAY,1930,Montevideo.,1/2_FINAL,Argentina,USA,6-1,1-0
## 9 1930-URUGUAY,1930,Montevideo.,GROUP_STAGE,Argentina,Chile,3-1,2-1
## 10 1930-URUGUAY,1930,Montevideo.,GROUP_STAGE,Uruguay,Romania (România),4-0,4-0
You can see that read_csv
gives pretty good informative messages concerning the source of the failure and the number of parsing failures, in this example 8235 which corresponds to the number of rows (comments not included). You can use the problems
function get a tibble
of the parsing failures
.
parsing_failures <- problems(data_world_cup)
head(parsing_failures, 10)
## # A tibble: 10 x 5
## row col expected actual file
## <int> <int> <chr> <chr> <chr>
## 1 6 8 1 columns 8 columns ""
## 2 7 8 1 columns 8 columns ""
## 3 8 8 1 columns 8 columns ""
## 4 9 8 1 columns 8 columns ""
## 5 10 8 1 columns 8 columns ""
## 6 11 8 1 columns 8 columns ""
## 7 12 8 1 columns 8 columns ""
## 8 13 8 1 columns 8 columns ""
## 9 14 8 1 columns 8 columns ""
## 10 15 8 1 columns 8 columns ""
Most of the time readr
will read the data even though there is a parsing issue. It is true that you can always inspect the issues using the problems
function however you might want to stop the execution of your script whenever there’s a parsing problem (especially if your script takes a long time to execute). In this context, readr
provides the stop_for_problems
function which will throw an error when the parsing process fails:
world_cup_url <- "https://gist.githubusercontent.com/feddelegrand7/e2f5e302c45f48ed9dcc69e8539f9710/raw/9273fe41dd7f6bf3cebd5c415ffac20be0343d24/world_cup.csv"
world_cup_data <- read_csv(file = world_cup_url, show_col_types = FALSE)
## Warning: One or more parsing issues, see `problems()` for details
stop_for_problems(world_cup_data)
## Error: 8235 parsing failures
Fortunately, it is relatively easy to fix this issue. We can use the skip
parameter of read_csv()
and set it to 5
, thus indicating that we want to skip the first 5 rows which correspond to the commented rows in our CSV file:
read_csv(url, skip = 5)
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 8235 Columns: 8-- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): edition, venue, round, team_1, team_2, score, score_halftime
## dbl (1): year
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 8,235 x 8
## edition year venue round team_1 team_2 score score_halftime
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1930-URUGUAY 1930 Montevideo. GROUP_STAGE Yugos~ Boliv~ 4-0 0-0
## 2 1930-URUGUAY 1930 Montevideo. 1/2_FINAL Urugu~ Yugos~ 6-1 3-1
## 3 1930-URUGUAY 1930 Montevideo. 1/2_FINAL Argen~ USA 6-1 1-0
## 4 1930-URUGUAY 1930 Montevideo. GROUP_STAGE Argen~ Chile 3-1 2-1
## 5 1930-URUGUAY 1930 Montevideo. GROUP_STAGE Urugu~ Roman~ 4-0 4-0
## 6 1930-URUGUAY 1930 Montevideo. GROUP_STAGE Parag~ Belgi~ 1-0 1-0
## 7 1930-URUGUAY 1930 Montevideo. GROUP_STAGE Brazi~ Boliv~ 4-0 1-0
## 8 1930-URUGUAY 1930 Montevideo. _FINAL Urugu~ Argen~ 4-2 1-2
## 9 1930-URUGUAY 1930 Montevideo. GROUP_STAGE Argen~ Mexic~ 6-3 3-1
## 10 1930-URUGUAY 1930 Montevideo. GROUP_STAGE Urugu~ Peru ~ 1-0 0-0
## # ... with 8,225 more rows
It works! but wait we still get two parsing failures!!! at row 8235
and 8236
. If you check the CSV again, you’ll see that there are two comments at the bottom of the file. The rows number 8235
and 8236
correspond to the position of the failure, ignoring the first 5 skipped rows and the columns’ names row.
Now, how we’re going to deal with the last two comments? remember that in the CSV file, all the comments begin with #
. We can tell read_csv()
to get rid of all the comments that start with #
:
read_csv(url, comment = "#")
## Rows: 8234 Columns: 8-- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): edition, venue, round, team_1, team_2, score, score_halftime
## dbl (1): year
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 8,234 x 8
## edition year venue round team_1 team_2 score score_halftime
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1930-URUGUAY 1930 Montevideo. GROUP_STAGE Yugos~ Boliv~ 4-0 0-0
## 2 1930-URUGUAY 1930 Montevideo. 1/2_FINAL Urugu~ Yugos~ 6-1 3-1
## 3 1930-URUGUAY 1930 Montevideo. 1/2_FINAL Argen~ USA 6-1 1-0
## 4 1930-URUGUAY 1930 Montevideo. GROUP_STAGE Argen~ Chile 3-1 2-1
## 5 1930-URUGUAY 1930 Montevideo. GROUP_STAGE Urugu~ Roman~ 4-0 4-0
## 6 1930-URUGUAY 1930 Montevideo. GROUP_STAGE Parag~ Belgi~ 1-0 1-0
## 7 1930-URUGUAY 1930 Montevideo. GROUP_STAGE Brazi~ Boliv~ 4-0 1-0
## 8 1930-URUGUAY 1930 Montevideo. _FINAL Urugu~ Argen~ 4-2 1-2
## 9 1930-URUGUAY 1930 Montevideo. GROUP_STAGE Argen~ Mexic~ 6-3 3-1
## 10 1930-URUGUAY 1930 Montevideo. GROUP_STAGE Urugu~ Peru ~ 1-0 0-0
## # ... with 8,224 more rows
Exercise 2:
🧠🧠🧠🧠ðŸ§
Q1: Consider the following CSV file. It has commented rows at the beginning and at the end. Try to get rid of first commented rows.
Q2: Without looking at the file, guess the sign of the commented rows at the end of the file.
Q3: Get rid of all the commented lines.
Q4: The data is a bit voluminous, using the
n_max
parameter, import the first 100 rows only.
🧠🧠🧠🧠ðŸ§
3. What if the Separator is not a Comma?
Many European firms when producing their CSV data, use a semi-colon (;
) to separate the values. This is mainly due to the fact that the numerical records use the comma ,
as the decimal mark instead of the dot .
. So for example, you’ll see 0,5
instead of O.5
for a half. You’ll also see a millions written in the following numeric format: 1.000.000,00
. Here the comma ,
is used to denote the start of the decimal values and the dot .
is what is called a grouping mark
, used to display large values in a convenient way (by grouping the thousands).
As an example, consider the following CSV file which represents the different train ticket subscription’s schemes in Paris, France.
For this particular case, where the separator is a semi-colon (;
), readr
provides the read_csv2()
function that allows us to parse the semi-colon separated values file properly:
url <- "https://gist.githubusercontent.com/feddelegrand7/2c3e0c2f1fe76497ad7c8ce1b6d88853/raw/c1dfcb731dd95a31a0f8db5821fe71a6dfd7c139/train_abo.csv"
paris_train_abo <- read_csv2(url)
## i Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 79 Columns: 14-- Column specification --------------------------------------------------------
## Delimiter: ";"
## chr (2): titre_de_transport, commentaire
## dbl (11): toutes_zones, zone_1_a_2, zone_1_a_3, zone_1_a_4, zone_1_a_5, zon...
## date (1): date
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
paris_train_abo
## # A tibble: 79 x 14
## titre_de_transport date toutes_zones zone_1_a_2 zone_1_a_3 zone_1_a_4
## <chr> <date> <dbl> <dbl> <dbl> <dbl>
## 1 Mobilis 1J 2014-01-01 NA 6.8 9.05 11.2
## 2 Paris Visite 2J Adu~ 2014-01-01 NA NA 17.6 NA
## 3 Paris Visite 5J Adu~ 2014-01-01 NA NA 34.7 NA
## 4 Forfait Solidarité ~ 2015-01-01 NA 17.5 22.3 27.0
## 5 Paris Visite 1J Enf~ 2017-08-01 NA NA 6 NA
## 6 Paris Visite 1J Adu~ 2013-01-01 NA NA 10.6 NA
## 7 Paris Visite 5J Adu~ 2013-01-01 NA NA 33.7 NA
## 8 Forfait Solidarité ~ 2017-08-01 18.8 NA NA NA
## 9 Imagine R étudiant ~ 2016-08-01 334. NA NA NA
## 10 Paris Visite 5J Adu~ 2017-08-01 NA NA 38.4 NA
## # ... with 69 more rows, and 8 more variables: zone_1_a_5 <dbl>,
## # zone_2_a_3 <dbl>, zone_2_a_4 <dbl>, zone_2_a_5 <dbl>, zone_3_a_4 <dbl>,
## # zone_3_a_5 <dbl>, zone_4_a_5 <dbl>, commentaire <chr>
Exercise 3:
🧠🧠🧠🧠🧠> Import the following CSV file.
Q1. After reading the data, you’ll find that all the columns (except the date) are of
character
type. Why?
Q2. Using one of the
read_csv2()
parameters correctQ1
so that you find the same column types as in the previous example (see above).
🧠🧠🧠🧠ðŸ§
So, let’s summarize, we should use read_csv()
when the delimiter is a comma ,
and read_csv2()
when the delimiter is a semi-colon ;
but what if the delimiter is |
or a -
or a ###
or any symbol. In this case, there is a flexible function provided by readr
called read_delim()
which allows to customize our CSV properties (delimiter, grouping mark and others).
As an example consider the following CSV file. It’s a real data frame from the French open data portal that showcases the annual electricity production from natural gas in France.
Looking at the data, we can see that the delimiter is a semi-colon ;
, so we can be tempted to use the read_csv2()
function however if look closely, the decimal mark is not a comma ,
but a dot .
. Remember that read_csv2()
consider a .
as a grouping mark and a ,
as a decimal mark. As a consequence, a value of 485.167
will be considered as 485167
by read_csv2()
which is obviously inaccurate:
url <- "https://gist.githubusercontent.com/feddelegrand7/5031ffd852acaef2e797056471db3a1c/raw/ea5e46dd7cb7c27c9fda1206f4e82e2403ec33c2/electricity_prod.csv"
read_csv2(url)
## i Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
## Rows: 28 Columns: 9-- Column specification --------------------------------------------------------
## Delimiter: ";"
## chr (4): site, statut, operateur, point_geo
## dbl (2): annee_de_reference, date_de_mise_en_service
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 28 x 9
## annee_de_reference site statut puissance_insta~ longitude_site latitude_site
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2019 Land~ En pr~ 4220 -40667 485167
## 2 2019 SPEM~ En se~ 4350 -2137728 4730861
## 3 2019 Pont~ En se~ 4400 38692 502268
## 4 2020 Toul En se~ 4130 5913879 48694169
## 5 2020 Blén~ En se~ 4300 6079634 48867195
## 6 2020 Sain~ En se~ 8600 66999 491488
## 7 2019 Mont~ En se~ 3700 28511 483771
## 8 2020 Mont~ En se~ 3700 28511 483771
## 9 2020 Land~ En pr~ 4220 -40667 485167
## 10 2020 Mart~ En se~ 9300 5021728 43359135
## # ... with 18 more rows, and 3 more variables: operateur <chr>,
## # date_de_mise_en_service <dbl>, point_geo <chr>
Using read_delim()
, we can overcome this situation and import our CSV file correctly:
read_delim(
file = url,
delim = ";",
locale = locale(decimal_mark = ".")
)
## Rows: 28 Columns: 9-- Column specification --------------------------------------------------------
## Delimiter: ";"
## chr (4): site, statut, operateur, point_geo
## dbl (5): annee_de_reference, puissance_installee_mw, longitude_site, latitud...
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 28 x 9
## annee_de_reference site statut puissance_insta~ longitude_site latitude_site
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2019 Land~ En pr~ 422 -4.07 48.5
## 2 2019 SPEM~ En se~ 435 -2.14 47.3
## 3 2019 Pont~ En se~ 440 3.87 50.2
## 4 2020 Toul En se~ 413 5.91 48.7
## 5 2020 Blén~ En se~ 430 6.08 48.9
## 6 2020 Sain~ En se~ 860 6.70 49.1
## 7 2019 Mont~ En se~ 370 2.85 48.4
## 8 2020 Mont~ En se~ 370 2.85 48.4
## 9 2020 Land~ En pr~ 422 -4.07 48.5
## 10 2020 Mart~ En se~ 930 5.02 43.4
## # ... with 18 more rows, and 3 more variables: operateur <chr>,
## # date_de_mise_en_service <dbl>, point_geo <chr>
Here, we’ve used the locale
parameter which takes the locale()
function as argument. Inside the locale()
function, we’ve indicated that our decimal mark is a .
(locale(decimal_mark = ".")
).
Exercise 4:
🧠🧠🧠🧠🧠> Q1: Read the following CSV file. 🧠🧠🧠🧠ðŸ§
4. Column Specification in readr
When reading data, readr
uses some heuristics to guess the type of each column. By default, it looks at the first 1000 rows in order to make an accurate guess. Further, using the guess_max
parameter, it is possible to modify the default number of rows needed to determine the type of each column. It is an arbitrary decision. Setting guess_max
to a lower value will increase speed and decrease accuracy. Conversely, a higher value will increase the accuracy but will slow down the data loading process. There is one type of data that will never be guessed though: the factor
type. Hopefully, readr
allows us to set manually each column type. This is a more robust way to load data. Therefore, in order to set a particular variable type to factor
within the data loading process, we need to specify it manually using the
col_type
parameter.
Let’s go back to our first data example, the mpg csv file:
url <- "https://gist.githubusercontent.com/feddelegrand7/b366864aabf9653361f461cbf972d97c/raw/a62c4672f2f5824b2634a66c948e6258d7c65323/mpg.csv"
mpg <- read_csv(url)
## Rows: 234 Columns: 11-- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (6): manufacturer, model, trans, drv, fl, class
## dbl (5): displ, year, cyl, cty, hwy
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto~ f 18 29 p comp~
## 2 audi a4 1.8 1999 4 manu~ f 21 29 p comp~
## 3 audi a4 2 2008 4 manu~ f 20 31 p comp~
## 4 audi a4 2 2008 4 auto~ f 21 30 p comp~
## 5 audi a4 2.8 1999 6 auto~ f 16 26 p comp~
## 6 audi a4 2.8 1999 6 manu~ f 18 26 p comp~
## 7 audi a4 3.1 2008 6 auto~ f 18 27 p comp~
## 8 audi a4 quattro 1.8 1999 4 manu~ 4 18 26 p comp~
## 9 audi a4 quattro 1.8 1999 4 auto~ 4 16 25 p comp~
## 10 audi a4 quattro 2 2008 4 manu~ 4 20 28 p comp~
## # ... with 224 more rows
Let’s say that for whatever reason, we wanted to set the type of the year
variable to factor
. We can easily operate on the col_types
parameter as seen previously but this time, we’ll use a list
to specify the name of the column to act on:
mpg_fct <- read_csv(
file = url,
col_types = list(
year = "f"
)
)
class(mpg_fct$year)
## [1] "factor"
levels(mpg_fct$year)
## [1] "1999" "2008"
readr
was smart enough to detect the levels
of the factor
variable. Now, what if our goal was to extend the levels
of the year
variable. From our data set, it takes only two distinct values 1999
and 2008
but what if in the future, we will get data about other years, maybe 2022
? When it comes to flexibility, we need to use the col_***
family function:
mpg_fct <- read_csv(
file = url,
col_types = list(
year = col_factor(
levels = c("1999", "2008", "2022")
)
)
)
class(mpg_fct$year)
## [1] "factor"
levels(mpg_fct$year)
## [1] "1999" "2008" "2022"
Exercise 4:
🧠🧠🧠🧠🧠> Consider the following TidyTuesday CSV file that exihibits a survey about managers salary among other information. You can find more info here.
Q1: Set the
guess_max
parameter to0
. What do you observe? Q2: Withguess_max = 1000
and using thecol_types
parameter, make sure that thetimestamp
column is of typedate-time
, theannual_salary
andother_monetary_comp
columns of typedouble
, while all the other variables should be of typecharacter
. The following page might help for the date-time parsing. 🧠🧠🧠🧠ðŸ§