Content
1. Loading Excel data using read_excel()
In order to deal with Excel files, the Tidyverse
provides the readxl
package which facilitates the interaction between R
and Excel
. It supports both the .xls
and the .xlsx
format.
We will work with the famous gapminder
data frame which displays the evolution of some economic and demographic indicators for 142 countries from 1952 to 2007. You can download the Excel (.xlsx)
version of gapminder
from here. The file has 6 different sheets: World, Europe, Africa, Americas, Asia and Oceania
.
Let’s read the file using the read_xlsx()
function:
library(readxl)
gapminder_data <- read_excel(path = "gapminder.xlsx")
head(gapminder_data)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
You can see from the above that by default read_xlsx()
reads the first sheet of the Excel
file however what if we wanted to select a specific sheet? Hopefully the function provides a parameter called sheet
which allows to target a sheet by names or by position. Let’s say we’re interested in African countries, we can proceed as follows:
Africa_data <- read_excel(path = "gapminder.xlsx",
sheet = "Africa")
head(Africa_data)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Algeria Africa 1952 43.1 9279525 2449.
## 2 Algeria Africa 1957 45.7 10270856 3014.
## 3 Algeria Africa 1962 48.3 11000948 2551.
## 4 Algeria Africa 1967 51.4 12760499 3247.
## 5 Algeria Africa 1972 54.5 14760787 4183.
## 6 Algeria Africa 1977 58.0 17152804 4910.
🌵🌵🌵🌵🌵🌵🌵🌵🌵
Important: The
sheet
parameter is case sensitive.
read_excel(path = "gapminder.xlsx",
sheet = "africa")
## Error: Sheet 'africa' not found
🌵🌵🌵🌵🌵🌵🌵🌵🌵
As mentioned previously, we can read by sheet position:
Africa_data <- read_excel(path = "gapminder.xlsx",
sheet = 3)
head(Africa_data)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Algeria Africa 1952 43.1 9279525 2449.
## 2 Algeria Africa 1957 45.7 10270856 3014.
## 3 Algeria Africa 1962 48.3 11000948 2551.
## 4 Algeria Africa 1967 51.4 12760499 3247.
## 5 Algeria Africa 1972 54.5 14760787 4183.
## 6 Algeria Africa 1977 58.0 17152804 4910.
Exercise 1:
🧠 🧠 🧠 🧠 🧠 > Q1: Using the same file, read the data from Europe and from Asia, for the first one use the name of the sheet and for the second use its position. 🧠 🧠 🧠 🧠 🧠
2. Interacting with the spreadsheet’s structure
Using read_excel()
, it is also possible to interact with the inner structure of the Excel file.
Suppose we’re interested in a specific RANGE of columns, from country
until lifeExp
. We can select this RANGE of columns by providing the first Excel
column name (or position) and the last one.
The following code reads all the data from the Excel
column A
until the column D
:
sm_data <- read_excel(
path = "gapminder.xlsx",
range = cell_cols("A:D")
)
head(sm_data)
## # A tibble: 6 x 4
## country continent year lifeExp
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8
## 2 Afghanistan Asia 1957 30.3
## 3 Afghanistan Asia 1962 32.0
## 4 Afghanistan Asia 1967 34.0
## 5 Afghanistan Asia 1972 36.1
## 6 Afghanistan Asia 1977 38.4
Using the columns’ position:
sm_data_pos <- read_excel(
path = "gapminder.xlsx",
range = cell_cols(1:4)
)
head(sm_data_pos)
## # A tibble: 6 x 4
## country continent year lifeExp
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8
## 2 Afghanistan Asia 1957 30.3
## 3 Afghanistan Asia 1962 32.0
## 4 Afghanistan Asia 1967 34.0
## 5 Afghanistan Asia 1972 36.1
## 6 Afghanistan Asia 1977 38.4
⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️ > Note that for internal implementation reasons, it is not possible to select columns individually (e.g. cell_cols(c(1, 3, 6))). See here. ⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️
Nevertheless, there is a trick. We can use the col_types
parameter to skip
specific columns. This parameter can be used to set a type
to each column of the data frame or to skip some columns. It can take the following values for each column: "skip", "guess", "logical", "numeric", "date", "text" or "list"
, by default, it is set to NULL
and tries to guess the type
of each column in the spreadsheet.
Let’s say we want to load the column “A” (country) and “D” (lifeExp).
data_trick <- read_excel(
path = "gapminder.xlsx",
range = cell_cols("A:D"),
col_types = c("guess", "skip", "skip", "guess")
)
head(data_trick)
## # A tibble: 6 x 2
## country lifeExp
## <chr> <dbl>
## 1 Afghanistan 28.8
## 2 Afghanistan 30.3
## 3 Afghanistan 32.0
## 4 Afghanistan 34.0
## 5 Afghanistan 36.1
## 6 Afghanistan 38.4
Exercise 2:
🧠 🧠 🧠 🧠 🧠
> Q1: From the gapminder spreadsheet, load the Europe
data frame however only with the following columns country
, year
and pop
.
> Q3: Using the .name_repair
parameter, make the columns’s names in upper case.
🧠 🧠 🧠 🧠 🧠
We can also load a specific set of rows, maybe you just want to check the validity of the data and just want to check the first 10 rows:
read_excel(path = "gapminder.xlsx",
range = cell_rows(1:10))
## # A tibble: 9 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
Similarly to cell_col()
, it is not possible to select individual rows with cell_rows()
. If you provide a vector of number read_excel()
will consider the minimum and the maximum of the vector as a range. For example, in the following we use, cell_rows(c(1, 4, 10))
. In this case, read_excel
will understand cell_rows(1:10)
:
read_excel(path = "gapminder.xlsx",
range = cell_rows(c(1, 4, 10)))
## # A tibble: 9 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
Finally, it is possible to select a range of cells. In the example below, we select all the cells from A1 to C8 within the Ociania sheet:
read_excel(path = "gapminder.xlsx",
sheet = "Oceania",
range = "A1:C8")
## # A tibble: 7 x 3
## country continent year
## <chr> <chr> <dbl>
## 1 Australia Oceania 1952
## 2 Australia Oceania 1957
## 3 Australia Oceania 1962
## 4 Australia Oceania 1967
## 5 Australia Oceania 1972
## 6 Australia Oceania 1977
## 7 Australia Oceania 1982
Exercise 3:
🧠 🧠 🧠 🧠 🧠
> Q1: Using the range parameter select the first 10 rows of the columns year, lifeExp and pop in the Asia sheet.
> Q2: Repeat the above operation however without using the sheet
parameter.
🧠 🧠 🧠 🧠 🧠
3. Messy Excel Files
Now we’ll work with a messy Excel files that’s a bit closer to the reality. You can download here.
If we read it naively with read_excel()
, we got the following:
read_excel(path = "gapminder_mess.xlsx")
## New names:
## # A tibble: 1,707 x 6
## `#### The data comes from the gapminder packa~` ...2 ...3 ...4 ...5 ...6
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 to know more about it, visit: <NA> <NA> <NA> <NA> <NA>
## 2 https://cran.r-project.org/web/packages/gapmin~ <NA> <NA> <NA> <NA> <NA>
## 3 country cont~ year life~ pop gdp ~
## 4 Afghanistan Asia 1952 not_~ <NA> not_~
## 5 Afghanistan Asia 1957 not_~ <NA> 820.~
## 6 Afghanistan Asia 1962 not_~ <NA> not_~
## 7 Afghanistan Asia 1967 34.0~ <NA> 836.~
## 8 Afghanistan Asia 1972 36.0~ not_~ 739.~
## 9 Afghanistan Asia 1977 not_~ 1488~ 786.~
## 10 Afghanistan Asia 1982 39.8~ not_~ not_~
## # ... with 1,697 more rows
Hopefully, there are parameters in read_excel()
that allows us to skip a specific number of rows:
messy_data <- read_excel(
path = "gapminder_mess.xlsx",
skip = 4
)
messy_data
## # A tibble: 1,704 x 6
## country continent year `life Exp` pop `gdp Percap`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Afghanistan Asia 1952 not_avail <NA> not_avail
## 2 Afghanistan Asia 1957 not_avail <NA> 820.85302960000001
## 3 Afghanistan Asia 1962 not_avail <NA> not_sure
## 4 Afghanistan Asia 1967 34.020000000000003 <NA> 836.19713820000004
## 5 Afghanistan Asia 1972 36.088000000000001 not_avail 739.98110580000002
## 6 Afghanistan Asia 1977 not_avail 14880372 786.11335999999994
## 7 Afghanistan Asia 1982 39.853999999999999 not_avail not_avail
## 8 Afghanistan Asia 1987 40.822000000000003 13867957 852.39594480000005
## 9 Afghanistan Asia 1992 not_avail not_avail 649.34139519999997
## 10 Afghanistan Asia 1997 41.762999999999998 22227415 635.34135100000003
## # ... with 1,694 more rows
Now, we still need to deal with blank cells, the not_avail
and the not_sure
notation. By default, read_excel()
considers blank cells as NA
values.
head(messy_data$pop)
## [1] NA NA NA NA "not_avail" "14880372"
Nevertheless, if you want to specify other values as NA
, you can do it with the na
parameter:
cleaned_data <- read_excel(
path = "gapminder_mess.xlsx",
skip = 4,
na = c("not_avail", "not_sure")
)
cleaned_data
## # A tibble: 1,704 x 6
## country continent year `life Exp` pop `gdp Percap`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1952 NA NA NA
## 2 Afghanistan Asia 1957 NA NA 821.
## 3 Afghanistan Asia 1962 NA NA NA
## 4 Afghanistan Asia 1967 34.0 NA 836.
## 5 Afghanistan Asia 1972 36.1 NA 740.
## 6 Afghanistan Asia 1977 NA 14880372 786.
## 7 Afghanistan Asia 1982 39.9 NA NA
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 NA NA 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ... with 1,694 more rows
Exercise 4:
🧠 🧠 🧠 🧠 🧠
> Q1: Spoiler Alert, our data frame is not 100% clean. Look at the column names. There is a blank space in life Exp
and gdp Percap
, this is not optimal names. We want snake_case naming!
Load the gapminder_mess.xlsx
data frame but with column names cleaned.
> Hint: use the janitor package.
Q2: What happens when the excel values use the comma
,
as a decimal separator. 🧠 🧠 🧠 🧠 🧠