Back
Featured image of post Data Wrangling with {dplyr}

Data Wrangling with {dplyr}

{dplyr} is a powerful package that allows you to operate complex operations on data farmes in order to get a maximum amount of insights

0. The magrittr pipe %>%

Before digging into dplyr, there is a sort topic yet important to cover, and it’s the tidyverse piping process. Piping is used in programming to chain several operations in order to yield a final output. In the tidyverse this is done using the magrittr pipe operator %>%. The pipe takes the result of the Left Handed Side (L.H.S) code and pipe it into the first parameter of the Right Handed Side (R.H.S) function.

Consider the following example. We use the summary() function to display some statistical metrics in the mtcars data frame. Note that we’re keeping only the rows where mpg is greater or equal than 30 miles per gallon:

summary(subset(mtcars, mpg >= 30))
##       mpg             cyl         disp             hp              drat      
##  Min.   :30.40   Min.   :4   Min.   :71.10   Min.   : 52.00   Min.   :3.770  
##  1st Qu.:30.40   1st Qu.:4   1st Qu.:74.55   1st Qu.: 61.75   1st Qu.:4.003  
##  Median :31.40   Median :4   Median :77.20   Median : 65.50   Median :4.150  
##  Mean   :31.77   Mean   :4   Mean   :80.15   Mean   : 74.00   Mean   :4.250  
##  3rd Qu.:32.77   3rd Qu.:4   3rd Qu.:82.80   3rd Qu.: 77.75   3rd Qu.:4.397  
##  Max.   :33.90   Max.   :4   Max.   :95.10   Max.   :113.00   Max.   :4.930  
##        wt             qsec             vs          am         gear     
##  Min.   :1.513   Min.   :16.90   Min.   :1   Min.   :1   Min.   :4.00  
##  1st Qu.:1.589   1st Qu.:18.11   1st Qu.:1   1st Qu.:1   1st Qu.:4.00  
##  Median :1.725   Median :19.00   Median :1   Median :1   Median :4.00  
##  Mean   :1.791   Mean   :18.70   Mean   :1   Mean   :1   Mean   :4.25  
##  3rd Qu.:1.926   3rd Qu.:19.58   3rd Qu.:1   3rd Qu.:1   3rd Qu.:4.25  
##  Max.   :2.200   Max.   :19.90   Max.   :1   Max.   :1   Max.   :5.00  
##       carb    
##  Min.   :1.0  
##  1st Qu.:1.0  
##  Median :1.5  
##  Mean   :1.5  
##  3rd Qu.:2.0  
##  Max.   :2.0

The above code is relatively readable but not very intuitive and not that pleasant to read. Let’s rewrite it using the pipe operator:

library(magrittr)
## Warning: package 'magrittr' was built under R version 4.1.3

mtcars %>% 
  subset(mpg >= 30) %>% 
  summary()
##       mpg             cyl         disp             hp              drat      
##  Min.   :30.40   Min.   :4   Min.   :71.10   Min.   : 52.00   Min.   :3.770  
##  1st Qu.:30.40   1st Qu.:4   1st Qu.:74.55   1st Qu.: 61.75   1st Qu.:4.003  
##  Median :31.40   Median :4   Median :77.20   Median : 65.50   Median :4.150  
##  Mean   :31.77   Mean   :4   Mean   :80.15   Mean   : 74.00   Mean   :4.250  
##  3rd Qu.:32.77   3rd Qu.:4   3rd Qu.:82.80   3rd Qu.: 77.75   3rd Qu.:4.397  
##  Max.   :33.90   Max.   :4   Max.   :95.10   Max.   :113.00   Max.   :4.930  
##        wt             qsec             vs          am         gear     
##  Min.   :1.513   Min.   :16.90   Min.   :1   Min.   :1   Min.   :4.00  
##  1st Qu.:1.589   1st Qu.:18.11   1st Qu.:1   1st Qu.:1   1st Qu.:4.00  
##  Median :1.725   Median :19.00   Median :1   Median :1   Median :4.00  
##  Mean   :1.791   Mean   :18.70   Mean   :1   Mean   :1   Mean   :4.25  
##  3rd Qu.:1.926   3rd Qu.:19.58   3rd Qu.:1   3rd Qu.:1   3rd Qu.:4.25  
##  Max.   :2.200   Max.   :19.90   Max.   :1   Max.   :1   Max.   :5.00  
##       carb    
##  Min.   :1.0  
##  1st Qu.:1.0  
##  Median :1.5  
##  Mean   :1.5  
##  3rd Qu.:2.0  
##  Max.   :2.0

Using the pipe, it feels like we’re reading a paragraph in which each operation is a sentence. The order of the operations are clearly showcased, much like a Workflow. We can read the above code as follows: Take the mtcars data frame, keep only the observations where mpg greater or equal 30, finally make a summary of the data frame.

Ok, that’s nice but what if we wanted to pipe the L.H.S into the second or the third parameter (not only the first)? Well there’s an easy solution provided by magrittr: Use a . wherever you want to incorporate the result of the L.H.S Consider the following linear model:

summary(lm(formula = mpg ~ wt, 
   data = mtcars, 
   subset = wt >= 2))
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars, subset = wt >= 2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9773 -1.8415 -0.1206  0.9795  8.2520 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  34.0607     2.3083   14.76 3.78e-14 ***
## wt           -4.5058     0.6539   -6.89 2.58e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.879 on 26 degrees of freedom
## Multiple R-squared:  0.6462, Adjusted R-squared:  0.6325 
## F-statistic: 47.48 on 1 and 26 DF,  p-value: 2.58e-07

Insted of using both the data and the subset parameter, we can just subset the data frame beforehand and incorporate the result within the data parameter.

Using the pipe, we can write:

mtcars %>% 
  subset(wt >= 2) %>% 
  lm(formula = mpg ~ wt, data = .) %>% 
  summary()
## 
## Call:
## lm(formula = mpg ~ wt, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9773 -1.8415 -0.1206  0.9795  8.2520 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  34.0607     2.3083   14.76 3.78e-14 ***
## wt           -4.5058     0.6539   -6.89 2.58e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.879 on 26 degrees of freedom
## Multiple R-squared:  0.6462, Adjusted R-squared:  0.6325 
## F-statistic: 47.48 on 1 and 26 DF,  p-value: 2.58e-07

Exercise 1:

🧠 🧠 🧠 🧠 🧠 > Q1: We want to display a message to the console showcasing the structure (str) of a subset of the mtcars data frame (where wt is greater or equal than 2)

message(str(subset(mtcars, wt >= 2)))
## 'data.frame':    28 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
## 

Rewrite the code chunk using the piping phylosophy.

Q2: Express in some sentences what’s happening in the piping process.

Q3: Repeat Q1 and Q2 with the following code chunk:

hist(subset(mtcars, am == "1")$mpg, 
     main = "distribution of the mpg variable", 
     sub = "All the vehicules are automatic", 
     xlab = "")

🧠 🧠 🧠 🧠 🧠

1. Column selection with select()

You might see it as trivial but column selection is an important toolkit to have. In many situations, you’ll have to deal with data frames that have more than 50 columns and trust me in those cases, selecting the needed columns is not easy, unless you use dplyr, then it becomes quite straightforward.

Let’s work with the penguins data frame from the palmerpenguins package. The data contains information about 344 penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica. More on the data frame here.

library(palmerpenguins)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

penguins
## # A tibble: 344 x 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>

When loading a data frame, it’s always a good practice to inspect its content. In base R, we can use the str() function:

str(penguins)
## tibble [344 x 8] (S3: tbl_df/tbl/data.frame)
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
##  $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

In dplyr there is a more aesthetic function that achieves the same purpose, glimpse():

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex               <fct> male, female, female, NA, female, male, female, male~
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~

Or in a dplyr friendly way:

penguins %>% glimpse()
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex               <fct> male, female, female, NA, female, male, female, male~
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~

Now, let’s talk about the selection features in dplyr which provides a unique function to select columns, the select() function. Let’s say that we’re interested only in the species column. We can select as follows:

penguins %>% select(species)
## # A tibble: 344 x 1
##    species
##    <fct>  
##  1 Adelie 
##  2 Adelie 
##  3 Adelie 
##  4 Adelie 
##  5 Adelie 
##  6 Adelie 
##  7 Adelie 
##  8 Adelie 
##  9 Adelie 
## 10 Adelie 
## # ... with 334 more rows

Note that dplyr and the tidyverse in general uses the so called Non Standard Evaluation. As such, you don’t need to provide the name of the column as a character string, but if needed you can:

penguins %>% select("species")
## # A tibble: 344 x 1
##    species
##    <fct>  
##  1 Adelie 
##  2 Adelie 
##  3 Adelie 
##  4 Adelie 
##  5 Adelie 
##  6 Adelie 
##  7 Adelie 
##  8 Adelie 
##  9 Adelie 
## 10 Adelie 
## # ... with 334 more rows

You can also select a column by its position:

penguins %>% select(1)
## # A tibble: 344 x 1
##    species
##    <fct>  
##  1 Adelie 
##  2 Adelie 
##  3 Adelie 
##  4 Adelie 
##  5 Adelie 
##  6 Adelie 
##  7 Adelie 
##  8 Adelie 
##  9 Adelie 
## 10 Adelie 
## # ... with 334 more rows

Now, let’s talk about how to deselect a column, meaning drop it and keep all other columns in the data frame. This is easiy achievable by providing the minus sign - or an exclamation point !at the beginning of the column name to drop. To illustrate it, let’s say we’re interested in all the variables except the island. We can drop it with:

penguins %>% select(-island)
## # A tibble: 344 x 7
##    species bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex    year
##    <fct>            <dbl>         <dbl>            <int>       <int> <fct> <int>
##  1 Adelie            39.1          18.7              181        3750 male   2007
##  2 Adelie            39.5          17.4              186        3800 fema~  2007
##  3 Adelie            40.3          18                195        3250 fema~  2007
##  4 Adelie            NA            NA                 NA          NA <NA>   2007
##  5 Adelie            36.7          19.3              193        3450 fema~  2007
##  6 Adelie            39.3          20.6              190        3650 male   2007
##  7 Adelie            38.9          17.8              181        3625 fema~  2007
##  8 Adelie            39.2          19.6              195        4675 male   2007
##  9 Adelie            34.1          18.1              193        3475 <NA>   2007
## 10 Adelie            42            20.2              190        4250 <NA>   2007
## # ... with 334 more rows

or with

penguins %>% select(!island)
## # A tibble: 344 x 7
##    species bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex    year
##    <fct>            <dbl>         <dbl>            <int>       <int> <fct> <int>
##  1 Adelie            39.1          18.7              181        3750 male   2007
##  2 Adelie            39.5          17.4              186        3800 fema~  2007
##  3 Adelie            40.3          18                195        3250 fema~  2007
##  4 Adelie            NA            NA                 NA          NA <NA>   2007
##  5 Adelie            36.7          19.3              193        3450 fema~  2007
##  6 Adelie            39.3          20.6              190        3650 male   2007
##  7 Adelie            38.9          17.8              181        3625 fema~  2007
##  8 Adelie            39.2          19.6              195        4675 male   2007
##  9 Adelie            34.1          18.1              193        3475 <NA>   2007
## 10 Adelie            42            20.2              190        4250 <NA>   2007
## # ... with 334 more rows

We can also drop by position:

penguins %>% select(-2)
## # A tibble: 344 x 7
##    species bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex    year
##    <fct>            <dbl>         <dbl>            <int>       <int> <fct> <int>
##  1 Adelie            39.1          18.7              181        3750 male   2007
##  2 Adelie            39.5          17.4              186        3800 fema~  2007
##  3 Adelie            40.3          18                195        3250 fema~  2007
##  4 Adelie            NA            NA                 NA          NA <NA>   2007
##  5 Adelie            36.7          19.3              193        3450 fema~  2007
##  6 Adelie            39.3          20.6              190        3650 male   2007
##  7 Adelie            38.9          17.8              181        3625 fema~  2007
##  8 Adelie            39.2          19.6              195        4675 male   2007
##  9 Adelie            34.1          18.1              193        3475 <NA>   2007
## 10 Adelie            42            20.2              190        4250 <NA>   2007
## # ... with 334 more rows

It is easy to select a set of columns, just put the selection inside the vector function c(). Suppose, we want to keep the species, the body_mass_g, the sex and the year variables:

penguins %>% 
  select(c(species, body_mass_g, sex, year))
## # A tibble: 344 x 4
##    species body_mass_g sex     year
##    <fct>         <int> <fct>  <int>
##  1 Adelie         3750 male    2007
##  2 Adelie         3800 female  2007
##  3 Adelie         3250 female  2007
##  4 Adelie           NA <NA>    2007
##  5 Adelie         3450 female  2007
##  6 Adelie         3650 male    2007
##  7 Adelie         3625 female  2007
##  8 Adelie         4675 male    2007
##  9 Adelie         3475 <NA>    2007
## 10 Adelie         4250 <NA>    2007
## # ... with 334 more rows

We can also select a range of columns as follows:

penguins %>% select(species:bill_depth_mm)
## # A tibble: 344 x 4
##    species island    bill_length_mm bill_depth_mm
##    <fct>   <fct>              <dbl>         <dbl>
##  1 Adelie  Torgersen           39.1          18.7
##  2 Adelie  Torgersen           39.5          17.4
##  3 Adelie  Torgersen           40.3          18  
##  4 Adelie  Torgersen           NA            NA  
##  5 Adelie  Torgersen           36.7          19.3
##  6 Adelie  Torgersen           39.3          20.6
##  7 Adelie  Torgersen           38.9          17.8
##  8 Adelie  Torgersen           39.2          19.6
##  9 Adelie  Torgersen           34.1          18.1
## 10 Adelie  Torgersen           42            20.2
## # ... with 334 more rows

It’s also possible to mix individual with range selections:

penguins %>% select(c(species:bill_depth_mm, year))
## # A tibble: 344 x 5
##    species island    bill_length_mm bill_depth_mm  year
##    <fct>   <fct>              <dbl>         <dbl> <int>
##  1 Adelie  Torgersen           39.1          18.7  2007
##  2 Adelie  Torgersen           39.5          17.4  2007
##  3 Adelie  Torgersen           40.3          18    2007
##  4 Adelie  Torgersen           NA            NA    2007
##  5 Adelie  Torgersen           36.7          19.3  2007
##  6 Adelie  Torgersen           39.3          20.6  2007
##  7 Adelie  Torgersen           38.9          17.8  2007
##  8 Adelie  Torgersen           39.2          19.6  2007
##  9 Adelie  Torgersen           34.1          18.1  2007
## 10 Adelie  Torgersen           42            20.2  2007
## # ... with 334 more rows

In the same way, it is possible to drop a range of columns:

penguins %>% select(!(bill_length_mm:year))
## # A tibble: 344 x 2
##    species island   
##    <fct>   <fct>    
##  1 Adelie  Torgersen
##  2 Adelie  Torgersen
##  3 Adelie  Torgersen
##  4 Adelie  Torgersen
##  5 Adelie  Torgersen
##  6 Adelie  Torgersen
##  7 Adelie  Torgersen
##  8 Adelie  Torgersen
##  9 Adelie  Torgersen
## 10 Adelie  Torgersen
## # ... with 334 more rows

The same principles apply when using columns position. In the following, we select the range from column 1 to column3:

penguins %>% select(1:3)
## # A tibble: 344 x 3
##    species island    bill_length_mm
##    <fct>   <fct>              <dbl>
##  1 Adelie  Torgersen           39.1
##  2 Adelie  Torgersen           39.5
##  3 Adelie  Torgersen           40.3
##  4 Adelie  Torgersen           NA  
##  5 Adelie  Torgersen           36.7
##  6 Adelie  Torgersen           39.3
##  7 Adelie  Torgersen           38.9
##  8 Adelie  Torgersen           39.2
##  9 Adelie  Torgersen           34.1
## 10 Adelie  Torgersen           42  
## # ... with 334 more rows

Exercise 2:

🧠 🧠 🧠 🧠 🧠 > The ggplot2 package provides a data frame called txhousing.

Q1: Select the following columns: city, sales, volume, median, listings, inventory

Q2: Select the city, year, sales and median variables. Make sure that the year variable is the first column from the left and sales is the last.

Q3: It is possible to rename colums on the fly during the selection process. Repeat Q2 but this time, rename sales to number_of_sales

Q4: Select all the columns. Put the month and year variables at the beginning of the data frame, dont’t change the order of the remaining columns. 🧠 🧠 🧠 🧠 🧠

dplyr also provides some helper functions that allow us to use a string pattern to select a specific set of columns. Suppose that we want to select the columns that contain the word “bill”. We can use the contains() function as follows:

penguins %>% select(contains("bill"))
## # A tibble: 344 x 2
##    bill_length_mm bill_depth_mm
##             <dbl>         <dbl>
##  1           39.1          18.7
##  2           39.5          17.4
##  3           40.3          18  
##  4           NA            NA  
##  5           36.7          19.3
##  6           39.3          20.6
##  7           38.9          17.8
##  8           39.2          19.6
##  9           34.1          18.1
## 10           42            20.2
## # ... with 334 more rows

We can also use the starts_with() and ends_with() functions. Let’s say we want to select the columns that end with the mm character:

penguins %>% select(ends_with("mm"))
## # A tibble: 344 x 3
##    bill_length_mm bill_depth_mm flipper_length_mm
##             <dbl>         <dbl>             <int>
##  1           39.1          18.7               181
##  2           39.5          17.4               186
##  3           40.3          18                 195
##  4           NA            NA                  NA
##  5           36.7          19.3               193
##  6           39.3          20.6               190
##  7           38.9          17.8               181
##  8           39.2          19.6               195
##  9           34.1          18.1               193
## 10           42            20.2               190
## # ... with 334 more rows

We can mix the these helper functions with simple selection, so if we want to get the species variables and all the columns that start with the character “flipper”, we can do the following:

penguins %>% select(species, starts_with("flipper"))
## # A tibble: 344 x 2
##    species flipper_length_mm
##    <fct>               <int>
##  1 Adelie                181
##  2 Adelie                186
##  3 Adelie                195
##  4 Adelie                 NA
##  5 Adelie                193
##  6 Adelie                190
##  7 Adelie                181
##  8 Adelie                195
##  9 Adelie                193
## 10 Adelie                190
## # ... with 334 more rows

Exercise 3:

🧠 🧠 🧠 🧠 🧠

The radous package allows you to get a data frame about fake personal information. In order to get the data we use the following function:

library(radous)

rad_data <- get_data(n = 200)
## Rows: 200 Columns: 34-- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (27): gender, name.title, name.first, name.last, location.street.name, ...
## dbl   (5): location.street.number, location.coordinates.latitude, location.c...
## dttm  (2): dob.date, registered.date
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

rad_data
## # A tibble: 200 x 34
##    gender name_title   name_first name_last location_street_nu~ location_street~
##    <chr>  <chr>        <chr>      <chr>                   <dbl> <chr>           
##  1 female Mademoiselle Luisa      Lecomte                  8496 Rue Abel-Ferry  
##  2 female Miss         Toligniva  Govera                    677 Chervona kalina 
##  3 male   Monsieur     Mike       Morin                    3773 Rue de L'Abbé-P~
##  4 female Mrs          Lauren     Morrison                 2082 Henry Street    
##  5 male   Mr           Viljami    Rintala                  6094 Verkatehtaankatu
##  6 female Ms           Christine  Chavez                   6176 Marsh Ln        
##  7 female Mrs          Carolyn    Matthews                 8472 Church Lane     
##  8 male   Mr           Demid      Zyablyuk                 1484 Zaporizke shose 
##  9 male   Mr           Roberto    Vázquez                   926 Calle Nebrija   
## 10 male   Mr           Eilif      Sandtorv                 2479 Husebybakken    
## # ... with 190 more rows, and 28 more variables: location_city <chr>,
## #   location_state <chr>, location_country <chr>, location_postcode <chr>,
## #   location_coordinates_latitude <dbl>, location_coordinates_longitude <dbl>,
## #   location_timezone_offset <chr>, location_timezone_description <chr>,
## #   email <chr>, login_uuid <chr>, login_username <chr>, login_password <chr>,
## #   login_salt <chr>, login_md5 <chr>, login_sha1 <chr>, login_sha256 <chr>,
## #   dob_date <dttm>, dob_age <dbl>, registered_date <dttm>, ...

We can get a list of the column names with:

names(rad_data)
##  [1] "gender"                         "name_title"                    
##  [3] "name_first"                     "name_last"                     
##  [5] "location_street_number"         "location_street_name"          
##  [7] "location_city"                  "location_state"                
##  [9] "location_country"               "location_postcode"             
## [11] "location_coordinates_latitude"  "location_coordinates_longitude"
## [13] "location_timezone_offset"       "location_timezone_description" 
## [15] "email"                          "login_uuid"                    
## [17] "login_username"                 "login_password"                
## [19] "login_salt"                     "login_md5"                     
## [21] "login_sha1"                     "login_sha256"                  
## [23] "dob_date"                       "dob_age"                       
## [25] "registered_date"                "registered_age"                
## [27] "phone"                          "cell"                          
## [29] "id_name"                        "id_value"                      
## [31] "picture_large"                  "picture_medium"                
## [33] "picture_thumbnail"              "nat"

Q1: Select all the columns that contain information about the date of birth of the person (see dob). Q2: Select all the columns that contain an id. Q3: Select all the columns that contain information about the name of the person and its location. Q4: Select all the columns except those that relate to the picture of the person. 🧠 🧠 🧠 🧠 🧠

We achieve powerful selections when using select() in conjunction with the where() function. Let’s work withe the mpg data frame from the ggplot2 package:

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.3

mpg
## # A tibble: 234 x 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto~ f        18    29 p     comp~
##  2 audi         a4           1.8  1999     4 manu~ f        21    29 p     comp~
##  3 audi         a4           2    2008     4 manu~ f        20    31 p     comp~
##  4 audi         a4           2    2008     4 auto~ f        21    30 p     comp~
##  5 audi         a4           2.8  1999     6 auto~ f        16    26 p     comp~
##  6 audi         a4           2.8  1999     6 manu~ f        18    26 p     comp~
##  7 audi         a4           3.1  2008     6 auto~ f        18    27 p     comp~
##  8 audi         a4 quattro   1.8  1999     4 manu~ 4        18    26 p     comp~
##  9 audi         a4 quattro   1.8  1999     4 auto~ 4        16    25 p     comp~
## 10 audi         a4 quattro   2    2008     4 manu~ 4        20    28 p     comp~
## # ... with 224 more rows

Suppose that we want to select the character string columns only, we can do:

mpg %>% select(where(is.character))
## # A tibble: 234 x 6
##    manufacturer model      trans      drv   fl    class  
##    <chr>        <chr>      <chr>      <chr> <chr> <chr>  
##  1 audi         a4         auto(l5)   f     p     compact
##  2 audi         a4         manual(m5) f     p     compact
##  3 audi         a4         manual(m6) f     p     compact
##  4 audi         a4         auto(av)   f     p     compact
##  5 audi         a4         auto(l5)   f     p     compact
##  6 audi         a4         manual(m5) f     p     compact
##  7 audi         a4         auto(av)   f     p     compact
##  8 audi         a4 quattro manual(m5) 4     p     compact
##  9 audi         a4 quattro auto(l5)   4     p     compact
## 10 audi         a4 quattro manual(m6) 4     p     compact
## # ... with 224 more rows

Of course, you can mix the where() selection with other types of selection:

mpg %>% select(year, 
               where(is.character) & starts_with("m"))
## # A tibble: 234 x 3
##     year manufacturer model     
##    <int> <chr>        <chr>     
##  1  1999 audi         a4        
##  2  1999 audi         a4        
##  3  2008 audi         a4        
##  4  2008 audi         a4        
##  5  1999 audi         a4        
##  6  1999 audi         a4        
##  7  2008 audi         a4        
##  8  1999 audi         a4 quattro
##  9  1999 audi         a4 quattro
## 10  2008 audi         a4 quattro
## # ... with 224 more rows

Finally, it is possible to select columns based on an expression:

In the mpg data frame, suppose we want to keep only the columns in which the median is below 20:

mpg %>% 
  select(where(is.numeric)) %>% 
  select(where(~median(.) < 20))
## # A tibble: 234 x 3
##    displ   cyl   cty
##    <dbl> <int> <int>
##  1   1.8     4    18
##  2   1.8     4    21
##  3   2       4    20
##  4   2       4    21
##  5   2.8     6    16
##  6   2.8     6    18
##  7   3.1     6    18
##  8   1.8     4    18
##  9   1.8     4    16
## 10   2       4    20
## # ... with 224 more rows

Exercise 4:

🧠 🧠 🧠 🧠 🧠 > Use again the radous package to generate some data. > Q1: Select all the columns except those of type character. > Q2: Select only the columns that contain name of location information about a person. > Q3: Select the columns that contain the name information and the numeric columns. > Q4: Above, we have used the following code chunk:

mpg %>% 
  select(where(is.numeric)) %>% 
  select(where(~median(.) < 20))

Try to rewrite it in a more compact way.

🧠 🧠 🧠 🧠 🧠

2. Filtering data with filter()

The filter function in dplyr allows us to keep a specific set of the data according a certain condition. Working with the mpg data frame, we can for example filter (keep) the vehicles that have a cty greater than or equal to 20:

mpg %>% filter(cty >= 20)
## # A tibble: 56 x 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 manu~ f        21    29 p     comp~
##  2 audi         a4           2    2008     4 manu~ f        20    31 p     comp~
##  3 audi         a4           2    2008     4 auto~ f        21    30 p     comp~
##  4 audi         a4 quattro   2    2008     4 manu~ 4        20    28 p     comp~
##  5 chevrolet    malibu       2.4  2008     4 auto~ f        22    30 r     mids~
##  6 honda        civic        1.6  1999     4 manu~ f        28    33 r     subc~
##  7 honda        civic        1.6  1999     4 auto~ f        24    32 r     subc~
##  8 honda        civic        1.6  1999     4 manu~ f        25    32 r     subc~
##  9 honda        civic        1.6  1999     4 manu~ f        23    29 p     subc~
## 10 honda        civic        1.6  1999     4 auto~ f        24    32 r     subc~
## # ... with 46 more rows

We can put several conditions inside the filter() function. For example, we can subset the vehicles that have a cty greater than or equal to 20 and which have been produced in 2008:


mpg %>% filter(cty >= 20 & year == 2008)
## # A tibble: 34 x 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           2    2008     4 manu~ f        20    31 p     comp~
##  2 audi         a4           2    2008     4 auto~ f        21    30 p     comp~
##  3 audi         a4 quattro   2    2008     4 manu~ 4        20    28 p     comp~
##  4 chevrolet    malibu       2.4  2008     4 auto~ f        22    30 r     mids~
##  5 honda        civic        1.8  2008     4 manu~ f        26    34 r     subc~
##  6 honda        civic        1.8  2008     4 auto~ f        25    36 r     subc~
##  7 honda        civic        1.8  2008     4 auto~ f        24    36 c     subc~
##  8 honda        civic        2    2008     4 manu~ f        21    29 p     subc~
##  9 hyundai      sonata       2.4  2008     4 auto~ f        21    30 r     mids~
## 10 hyundai      sonata       2.4  2008     4 manu~ f        21    31 r     mids~
## # ... with 24 more rows

We can also use union selection, for example, we might be interested to keep only German vehicles:

mpg %>% filter(
  manufacturer == "audi" | manufacturer == "volkswagen"
)
## # A tibble: 45 x 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto~ f        18    29 p     comp~
##  2 audi         a4           1.8  1999     4 manu~ f        21    29 p     comp~
##  3 audi         a4           2    2008     4 manu~ f        20    31 p     comp~
##  4 audi         a4           2    2008     4 auto~ f        21    30 p     comp~
##  5 audi         a4           2.8  1999     6 auto~ f        16    26 p     comp~
##  6 audi         a4           2.8  1999     6 manu~ f        18    26 p     comp~
##  7 audi         a4           3.1  2008     6 auto~ f        18    27 p     comp~
##  8 audi         a4 quattro   1.8  1999     4 manu~ 4        18    26 p     comp~
##  9 audi         a4 quattro   1.8  1999     4 auto~ 4        16    25 p     comp~
## 10 audi         a4 quattro   2    2008     4 manu~ 4        20    28 p     comp~
## # ... with 35 more rows

Above, we’re asking filter() to keep the observations where the manufacturer is either audi or volkswagen.

Sometimes, we’re interested in filtering a range of values, so let’s we want to keep the cars with a cty between 20 and 30 (both not included). We can write:

mpg %>% 
  filter(cty > 20 & cty < 30)
## # A tibble: 43 x 11
##    manufacturer model  displ  year   cyl trans     drv     cty   hwy fl    class
##    <chr>        <chr>  <dbl> <int> <int> <chr>     <chr> <int> <int> <chr> <chr>
##  1 audi         a4       1.8  1999     4 manual(m~ f        21    29 p     comp~
##  2 audi         a4       2    2008     4 auto(av)  f        21    30 p     comp~
##  3 chevrolet    malibu   2.4  2008     4 auto(l4)  f        22    30 r     mids~
##  4 honda        civic    1.6  1999     4 manual(m~ f        28    33 r     subc~
##  5 honda        civic    1.6  1999     4 auto(l4)  f        24    32 r     subc~
##  6 honda        civic    1.6  1999     4 manual(m~ f        25    32 r     subc~
##  7 honda        civic    1.6  1999     4 manual(m~ f        23    29 p     subc~
##  8 honda        civic    1.6  1999     4 auto(l4)  f        24    32 r     subc~
##  9 honda        civic    1.8  2008     4 manual(m~ f        26    34 r     subc~
## 10 honda        civic    1.8  2008     4 auto(l5)  f        25    36 r     subc~
## # ... with 33 more rows

We can also use the convenient between() function however be aware the between() will include the lower and the upper limit. So, if we want to select cars that have a cty between 20 and 30 both included:

mpg %>% 
  filter(between(cty, left = 20, right = 30))
## # A tibble: 54 x 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 manu~ f        21    29 p     comp~
##  2 audi         a4           2    2008     4 manu~ f        20    31 p     comp~
##  3 audi         a4           2    2008     4 auto~ f        21    30 p     comp~
##  4 audi         a4 quattro   2    2008     4 manu~ 4        20    28 p     comp~
##  5 chevrolet    malibu       2.4  2008     4 auto~ f        22    30 r     mids~
##  6 honda        civic        1.6  1999     4 manu~ f        28    33 r     subc~
##  7 honda        civic        1.6  1999     4 auto~ f        24    32 r     subc~
##  8 honda        civic        1.6  1999     4 manu~ f        25    32 r     subc~
##  9 honda        civic        1.6  1999     4 manu~ f        23    29 p     subc~
## 10 honda        civic        1.6  1999     4 auto~ f        24    32 r     subc~
## # ... with 44 more rows

Sometimes, you want to filter around a specific value. For example, we might want to keep only the vehicles with an hwy near 30 with a tolerance of +2/-2. We can achieve that using the near() function:

mpg %>% 
  filter(near(x = hwy, y = 30, tol = 2))
## # A tibble: 33 x 11
##    manufacturer model  displ  year   cyl trans     drv     cty   hwy fl    class
##    <chr>        <chr>  <dbl> <int> <int> <chr>     <chr> <int> <int> <chr> <chr>
##  1 audi         a4       1.8  1999     4 auto(l5)  f        18    29 p     comp~
##  2 audi         a4       1.8  1999     4 manual(m~ f        21    29 p     comp~
##  3 audi         a4       2    2008     4 manual(m~ f        20    31 p     comp~
##  4 audi         a4       2    2008     4 auto(av)  f        21    30 p     comp~
##  5 chevrolet    malibu   2.4  2008     4 auto(l4)  f        22    30 r     mids~
##  6 chevrolet    malibu   3.5  2008     6 auto(l4)  f        18    29 r     mids~
##  7 honda        civic    1.6  1999     4 manual(m~ f        23    29 p     subc~
##  8 honda        civic    2    2008     4 manual(m~ f        21    29 p     subc~
##  9 hyundai      sonata   2.4  2008     4 auto(l4)  f        21    30 r     mids~
## 10 hyundai      sonata   2.4  2008     4 manual(m~ f        21    31 r     mids~
## # ... with 23 more rows

Exercise 5:

🧠 🧠 🧠 🧠 🧠 > Q1: From the mpg data frame, keep only the German cars that are produced in 2008.

Q2: From the penguins data frame, keep the penguins recorded in the Torgersen island that have bill lingth greater or equal than 40mm. Then select the column species, island, body_mass_g and sex.

Q3: Filter (keep) the female penguins, then select only the numeric columns.

Q4: Now, keep only the female Chinstrap penguins with a body mass inferior than 3100 gramme and with a bill length of at least 40mm. Finally, keep only the factors columns and those that are measured in mm.

Q5: Filter the Chinstrap penguins with a body_mass_g between 4000 and 4500 (both included). Finally select the body_mass_g column. (Bonus point if you can use the between() function)

Q6: Filter the penguins that have a body_mass_g near 4000. Give it a tolerance of 2 standard deviation. 🧠 🧠 🧠 🧠 🧠

3. Renaming columns with rename()

We’ve seen that we can rename columns in the selection process. Nonetheless, rename() allows to rename columns in a more convenient way:

penguins %>% 
  rename(penguin_spec = species, location = island)
## # A tibble: 344 x 8
##    penguin_spec location  bill_length_mm bill_depth_mm flipper_length_mm
##    <fct>        <fct>              <dbl>         <dbl>             <int>
##  1 Adelie       Torgersen           39.1          18.7               181
##  2 Adelie       Torgersen           39.5          17.4               186
##  3 Adelie       Torgersen           40.3          18                 195
##  4 Adelie       Torgersen           NA            NA                  NA
##  5 Adelie       Torgersen           36.7          19.3               193
##  6 Adelie       Torgersen           39.3          20.6               190
##  7 Adelie       Torgersen           38.9          17.8               181
##  8 Adelie       Torgersen           39.2          19.6               195
##  9 Adelie       Torgersen           34.1          18.1               193
## 10 Adelie       Torgersen           42            20.2               190
## # ... with 334 more rows, and 3 more variables: body_mass_g <int>, sex <fct>,
## #   year <int>

The pattern is clear, the new name is on the left and the old name on the right.

Exercise 6:

🧠 🧠 🧠 🧠 🧠 > Q1: Choose 3 columns from the Penguins data frame of your choice and rename them using the rename() function. 🧠 🧠 🧠 🧠 🧠

4. Getting distinct values with distinct() and n_distinct()

distinct() returns all the values available inside a variable nonetheless it drops the duplicates, you get distinct values. Consider the following example, we want to know which year observations are available within the penguins data frame:

penguins %>% distinct(year)
## # A tibble: 3 x 1
##    year
##   <int>
## 1  2007
## 2  2008
## 3  2009

n_distinct() on the other hand counts the number of unique values available within a column. So if we use it in our previous example, we get:

penguins %>% pull(year) %>% n_distinct()
## [1] 3

Suppose we want to grab all the columns that take a limited number of values, let’s say 4 or below. We can mix distict() with where() as follows:

penguins %>% 
  select(where(~n_distinct(.) < 4))
## # A tibble: 344 x 4
##    species island    sex     year
##    <fct>   <fct>     <fct>  <int>
##  1 Adelie  Torgersen male    2007
##  2 Adelie  Torgersen female  2007
##  3 Adelie  Torgersen female  2007
##  4 Adelie  Torgersen <NA>    2007
##  5 Adelie  Torgersen female  2007
##  6 Adelie  Torgersen male    2007
##  7 Adelie  Torgersen female  2007
##  8 Adelie  Torgersen male    2007
##  9 Adelie  Torgersen <NA>    2007
## 10 Adelie  Torgersen <NA>    2007
## # ... with 334 more rows

Exercise 7:

🧠 🧠 🧠 🧠 🧠 > Q1: From the mpg data frame choose only the columns that can take only 10 unique values or below.

Q2: Using distinct() list the unique values that can be taken by the manufacturer column, do the same with model. 🧠 🧠 🧠 🧠 🧠

5. Column sorting with arrange()

It is possible to order a data frame according to the values of a specific column. Using the arrange() function we can for example sort the mpg vehicles according the displ variable:

mpg %>% 
  arrange(displ)
## # A tibble: 234 x 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 honda        civic        1.6  1999     4 manu~ f        28    33 r     subc~
##  2 honda        civic        1.6  1999     4 auto~ f        24    32 r     subc~
##  3 honda        civic        1.6  1999     4 manu~ f        25    32 r     subc~
##  4 honda        civic        1.6  1999     4 manu~ f        23    29 p     subc~
##  5 honda        civic        1.6  1999     4 auto~ f        24    32 r     subc~
##  6 audi         a4           1.8  1999     4 auto~ f        18    29 p     comp~
##  7 audi         a4           1.8  1999     4 manu~ f        21    29 p     comp~
##  8 audi         a4 quattro   1.8  1999     4 manu~ 4        18    26 p     comp~
##  9 audi         a4 quattro   1.8  1999     4 auto~ 4        16    25 p     comp~
## 10 honda        civic        1.8  2008     4 manu~ f        26    34 r     subc~
## # ... with 224 more rows

As you can see from above the ordering is ascending order (from the lowest to the highest value), nonetheless, using desc() we can specify than we want a descending ordering:

mpg %>% 
  arrange(desc(displ))
## # A tibble: 234 x 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 chevrolet    corvette     7    2008     8 manu~ r        15    24 p     2sea~
##  2 chevrolet    k1500 tah~   6.5  1999     8 auto~ 4        14    17 d     suv  
##  3 chevrolet    corvette     6.2  2008     8 manu~ r        16    26 p     2sea~
##  4 chevrolet    corvette     6.2  2008     8 auto~ r        15    25 p     2sea~
##  5 jeep         grand che~   6.1  2008     8 auto~ 4        11    14 p     suv  
##  6 chevrolet    c1500 sub~   6    2008     8 auto~ r        12    17 r     suv  
##  7 dodge        durango 4~   5.9  1999     8 auto~ 4        11    15 r     suv  
##  8 dodge        ram 1500 ~   5.9  1999     8 auto~ 4        11    15 r     pick~
##  9 chevrolet    c1500 sub~   5.7  1999     8 auto~ r        13    17 r     suv  
## 10 chevrolet    corvette     5.7  1999     8 manu~ r        16    26 p     2sea~
## # ... with 224 more rows

We can also arrange character columns by alphabetical order:

mpg %>% 
  arrange(manufacturer)
## # A tibble: 234 x 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto~ f        18    29 p     comp~
##  2 audi         a4           1.8  1999     4 manu~ f        21    29 p     comp~
##  3 audi         a4           2    2008     4 manu~ f        20    31 p     comp~
##  4 audi         a4           2    2008     4 auto~ f        21    30 p     comp~
##  5 audi         a4           2.8  1999     6 auto~ f        16    26 p     comp~
##  6 audi         a4           2.8  1999     6 manu~ f        18    26 p     comp~
##  7 audi         a4           3.1  2008     6 auto~ f        18    27 p     comp~
##  8 audi         a4 quattro   1.8  1999     4 manu~ 4        18    26 p     comp~
##  9 audi         a4 quattro   1.8  1999     4 auto~ 4        16    25 p     comp~
## 10 audi         a4 quattro   2    2008     4 manu~ 4        20    28 p     comp~
## # ... with 224 more rows

We can arrange several columns at the sime time:

mpg %>%  
  arrange(manufacturer, year)
## # A tibble: 234 x 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto~ f        18    29 p     comp~
##  2 audi         a4           1.8  1999     4 manu~ f        21    29 p     comp~
##  3 audi         a4           2.8  1999     6 auto~ f        16    26 p     comp~
##  4 audi         a4           2.8  1999     6 manu~ f        18    26 p     comp~
##  5 audi         a4 quattro   1.8  1999     4 manu~ 4        18    26 p     comp~
##  6 audi         a4 quattro   1.8  1999     4 auto~ 4        16    25 p     comp~
##  7 audi         a4 quattro   2.8  1999     6 auto~ 4        15    25 p     comp~
##  8 audi         a4 quattro   2.8  1999     6 manu~ 4        17    25 p     comp~
##  9 audi         a6 quattro   2.8  1999     6 auto~ 4        15    24 p     mids~
## 10 audi         a4           2    2008     4 manu~ f        20    31 p     comp~
## # ... with 224 more rows

As you can see from above, the manufacturer column is first ordered, then the year column.

Exercise 8:

🧠 🧠 🧠 🧠 🧠 > Q1: From the penguins data frame, select all the columns that contains the character mm, then keep only the column where the mean is below 50. Finally arrange the values in descending order by the bill_length_mm variable. > Q2: Now, the repeat Q1 however this time, sort the final data frame according to its first column no matter its content. 🧠 🧠 🧠 🧠 🧠

  1. Dropping NA values with drop_na() from tidyr

NA is an abbreviation for Non Available and they are used in R to exhibit the fact that a value is a missing. Let’s take a look at the penguins data frame again:

penguins
## # A tibble: 344 x 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>

We can see that there are some NA values in it. In order to get the number of NA values in each column, the simplest way is to use the base R colSums() function that calculate the sum of each numeric column within a data frame:

colSums(is.na(penguins))
##           species            island    bill_length_mm     bill_depth_mm 
##                 0                 0                 2                 2 
## flipper_length_mm       body_mass_g               sex              year 
##                 2                 2                11                 0

From above, we can see that we have exactly 19 NAs:

na_val <- colSums(is.na(penguins))

sum(na_val)
## [1] 19

We can drop all the rows that contains NA values using the drop_na() function:

library(tidyr)
## Warning: package 'tidyr' was built under R version 4.1.3
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
peng_no_na <- penguins %>% drop_na()

peng_no_na
## # A tibble: 333 x 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           36.7          19.3               193        3450
##  5 Adelie  Torgersen           39.3          20.6               190        3650
##  6 Adelie  Torgersen           38.9          17.8               181        3625
##  7 Adelie  Torgersen           39.2          19.6               195        4675
##  8 Adelie  Torgersen           41.1          17.6               182        3200
##  9 Adelie  Torgersen           38.6          21.2               191        3800
## 10 Adelie  Torgersen           34.6          21.1               198        4400
## # ... with 323 more rows, and 2 more variables: sex <fct>, year <int>

Now if we repeat the previous colSums command, we get the following:

colSums(is.na(peng_no_na))
##           species            island    bill_length_mm     bill_depth_mm 
##                 0                 0                 0                 0 
## flipper_length_mm       body_mass_g               sex              year 
##                 0                 0                 0                 0

So all the NA values and their corresponding rows have been dropped.

In practice however, we just want to drop the NA values that are available within a specific column without impacting the other NAs in our data frame. Suppose, that we need to do some computations on our body_mass_g function and those computations don’t accept NAs. We can drop the missing values available within a particular column by specifying the column inside the drop_na() function:

peng_bm_no_na <- penguins %>% 
  drop_na(body_mass_g)

Now, we can see that the body_mass_g has no NAs anymore:

colSums(is.na(peng_bm_no_na))
##           species            island    bill_length_mm     bill_depth_mm 
##                 0                 0                 0                 0 
## flipper_length_mm       body_mass_g               sex              year 
##                 0                 0                 9                 0

Nonetheless, some NAs in the sex variable hasn’t been impacted simply because they are on different rows. Remember than removing an NA will remove the entire row.

  1. Summary values with group_by() and summarise()

The summarise() function allows us to proceed calculations on a specific column or a set of columns.

Suppose, we want to get the mean of the bill_length_mm variable in the penguins data frame:

penguins %>% 
  drop_na(bill_length_mm) %>% 
  summarise(mean(bill_length_mm))
## # A tibble: 1 x 1
##   `mean(bill_length_mm)`
##                    <dbl>
## 1                   43.9

It is possible to give a name to our new summary metric:

penguins %>% 
  drop_na(bill_length_mm) %>% 
  summarise(avg = mean(bill_length_mm))
## # A tibble: 1 x 1
##     avg
##   <dbl>
## 1  43.9

It is possible to calculate a metric on several columns:

penguins %>% 
  drop_na(bill_length_mm, body_mass_g) %>% 
  summarise(
    avg_bill_length = mean(bill_length_mm), 
    avg_body_mass = mean(body_mass_g) 
  )
## # A tibble: 1 x 2
##   avg_bill_length avg_body_mass
##             <dbl>         <dbl>
## 1            43.9         4202.

Exercise 9:

🧠 🧠 🧠 🧠 🧠 > Q1: Using summarise() calculate the median of the flipper_length_mm and the bill_depth_mm.

Q2: Same as Q1, but now calculate the sum of the values.

Q3: We’ve seen the following code above:

penguins %>% 
  drop_na(bill_length_mm, body_mass_g) %>% 
  summarise(
    avg_bill_length = mean(bill_length_mm), 
    avg_body_mass = mean(body_mass_g) 
  )

Try to get the same results however this time without using the drop_na() function. 🧠 🧠 🧠 🧠 🧠

Now, what if we wanted to calculate a metric on several columns but without having the need to write each column name. Let’s say we’re interested in the following columns: bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g

penguins %>% 
  summarise(
    across(.cols = bill_length_mm:body_mass_g, 
           .fns = ~mean(., na.rm = TRUE))
  )
## # A tibble: 1 x 4
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##            <dbl>         <dbl>             <dbl>       <dbl>
## 1           43.9          17.2              201.       4202.

Now, what if we want to calculate the average on all the numeric columns. We can use across and where:

penguins %>% 
  summarise(
    across(
      where(is.numeric), 
      ~mean(., na.rm = TRUE)
    )
  )
## # A tibble: 1 x 5
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
##            <dbl>         <dbl>             <dbl>       <dbl> <dbl>
## 1           43.9          17.2              201.       4202. 2008.

We can drop the year column if we want to:

penguins %>% 
  summarise(
    across(
      .cols = c(where(is.numeric), -year), 
      ~mean(., na.rm = TRUE)
    )
  )
## # A tibble: 1 x 4
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##            <dbl>         <dbl>             <dbl>       <dbl>
## 1           43.9          17.2              201.       4202.

Exercise 10:

🧠 🧠 🧠 🧠 🧠 > * Q1: In the mpg data frame, calculate the median of all the numeric variables. > * Q2: Do the same as Q1, but now drop the year variable. > * Q3: Calculate the number of distinct values for each character column in the mpg data frame. > * Q4: In the penguins data frame, calculate the number of distinct values for the columns of type factor. 🧠 🧠 🧠 🧠 🧠

summarise() is a powerful function, especially when you combine it with where() and across() however it becomes way more powerful when you add a grouping layer using the group_by() function. The combination of group_by() and summarise() allows us to calculate metrics per group. Let’s take an example. In our mpg data frame, suppose we’re interested in the average miles per gallon in highway per manufacturer. We can write:

mpg %>% 
  group_by(manufacturer) %>% 
  summarise(avg_hwy = mean(hwy))
## # A tibble: 15 x 2
##    manufacturer avg_hwy
##    <chr>          <dbl>
##  1 audi            26.4
##  2 chevrolet       21.9
##  3 dodge           17.9
##  4 ford            19.4
##  5 honda           32.6
##  6 hyundai         26.9
##  7 jeep            17.6
##  8 land rover      16.5
##  9 lincoln         17  
## 10 mercury         18  
## 11 nissan          24.6
## 12 pontiac         26.4
## 13 subaru          25.6
## 14 toyota          24.9
## 15 volkswagen      29.2

We can also arrange the results in descending order using the arrange() and desc() functions:

mpg %>% 
  group_by(manufacturer) %>% 
  summarise(avg_hwy = mean(hwy)) %>% 
  arrange(desc(avg_hwy))
## # A tibble: 15 x 2
##    manufacturer avg_hwy
##    <chr>          <dbl>
##  1 honda           32.6
##  2 volkswagen      29.2
##  3 hyundai         26.9
##  4 audi            26.4
##  5 pontiac         26.4
##  6 subaru          25.6
##  7 toyota          24.9
##  8 nissan          24.6
##  9 chevrolet       21.9
## 10 ford            19.4
## 11 mercury         18  
## 12 dodge           17.9
## 13 jeep            17.6
## 14 lincoln         17  
## 15 land rover      16.5

We can also group several variables at the same time. Suppose, I’m interested in the average hwy for each manufacturer and model:

mpg %>% 
  group_by(manufacturer, model) %>% 
  summarise(avg_hwy = mean(hwy)) %>% 
  arrange(desc(avg_hwy))
## `summarise()` has grouped output by 'manufacturer'. You can override using the
## `.groups` argument.
## # A tibble: 38 x 3
## # Groups:   manufacturer [15]
##    manufacturer model        avg_hwy
##    <chr>        <chr>          <dbl>
##  1 toyota       corolla         34  
##  2 volkswagen   new beetle      32.8
##  3 honda        civic           32.6
##  4 volkswagen   jetta           29.1
##  5 nissan       altima          28.7
##  6 audi         a4              28.3
##  7 toyota       camry           28.3
##  8 toyota       camry solara    28.1
##  9 hyundai      sonata          27.7
## 10 chevrolet    malibu          27.6
## # ... with 28 more rows

I can also calculate several measures:

mpg %>% 
  group_by(manufacturer, model) %>% 
  summarise(avg_hwy = mean(hwy), median_hwy = median(hwy)) %>% 
  arrange(desc(avg_hwy))
## `summarise()` has grouped output by 'manufacturer'. You can override using the
## `.groups` argument.
## # A tibble: 38 x 4
## # Groups:   manufacturer [15]
##    manufacturer model        avg_hwy median_hwy
##    <chr>        <chr>          <dbl>      <dbl>
##  1 toyota       corolla         34           35
##  2 volkswagen   new beetle      32.8         29
##  3 honda        civic           32.6         32
##  4 volkswagen   jetta           29.1         29
##  5 nissan       altima          28.7         28
##  6 audi         a4              28.3         29
##  7 toyota       camry           28.3         28
##  8 toyota       camry solara    28.1         27
##  9 hyundai      sonata          27.7         27
## 10 chevrolet    malibu          27.6         27
## # ... with 28 more rows

We can also group by a variable and then get a metric for all numerical variables (for example):

mpg %>% 
  group_by(manufacturer) %>% 
  summarise(
    across(
      .cols = where(is.numeric), 
      .fns = median
    )
  )
## # A tibble: 15 x 6
##    manufacturer displ  year   cyl   cty   hwy
##    <chr>        <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 audi           2.8 2004.     6  17.5  26  
##  2 chevrolet      5.3 2008      8  15    23  
##  3 dodge          4.7 2008      8  13    17  
##  4 ford           4.6 1999      8  14    18  
##  5 honda          1.6 1999      4  24    32  
##  6 hyundai        2.4 2008      4  18.5  26.5
##  7 jeep           4.7 2008      8  14    18.5
##  8 land rover     4.3 2004.     8  11.5  16.5
##  9 lincoln        5.4 1999      8  11    17  
## 10 mercury        4.3 2004.     7  13    18  
## 11 nissan         3.3 2008      6  19    26  
## 12 pontiac        3.8 1999      6  17    26  
## 13 subaru         2.5 2008      4  19    26  
## 14 toyota         2.7 1999      4  18    26  
## 15 volkswagen     2   1999      4  21    29

group_by() and summarise() are especially useful to check how many times a specific value of a variable appear within the column, or in other word, the count of the variable. For example, we might be interesting in counting the values of the manufacturer column:

mpg %>% 
  group_by(manufacturer) %>% 
  summarise(amazing_count = n()) %>% 
  arrange(desc(amazing_count))
## # A tibble: 15 x 2
##    manufacturer amazing_count
##    <chr>                <int>
##  1 dodge                   37
##  2 toyota                  34
##  3 volkswagen              27
##  4 ford                    25
##  5 chevrolet               19
##  6 audi                    18
##  7 hyundai                 14
##  8 subaru                  14
##  9 nissan                  13
## 10 honda                    9
## 11 jeep                     8
## 12 pontiac                  5
## 13 land rover               4
## 14 mercury                  4
## 15 lincoln                  3

The n() function is part of dplyr and is used to count the observations available within a particular column. Counting variable’s observations is a recurrent process and is undertaken very often, so ofter than dply provides a shortcut, so instead of writing the above code, we can use the count() function:

mpg %>% 
  count(manufacturer, sort = TRUE)
## # A tibble: 15 x 2
##    manufacturer     n
##    <chr>        <int>
##  1 dodge           37
##  2 toyota          34
##  3 volkswagen      27
##  4 ford            25
##  5 chevrolet       19
##  6 audi            18
##  7 hyundai         14
##  8 subaru          14
##  9 nissan          13
## 10 honda            9
## 11 jeep             8
## 12 pontiac          5
## 13 land rover       4
## 14 mercury          4
## 15 lincoln          3

Exercise 10:

🧠 🧠 🧠 🧠 🧠 > *Q1: Using the starwars data frame in dplyr, calculate the average mass according to each homeworld.

*Q2: Calculate the median mass for each species, which one has the highest mass and which one has the lowest.

*Q3: Calculate the median and the mean height for each eye_color and homeworld (both at the same time)

*Q4: How many characteres are from Tatooine?

*Q5: How many feminine and how many masculin (use the gender variable) human characters are from Naboo?

*Q6: How many Human characters with blue eyes are from Tatooine?

*Q7: How many characters have a mass superior than 100KG?

*Q8: How many characters have a height inferior than 70cm?

*Q9: Which homeland has the highest average mass?

*Q10: Are characters in Blong hair in average taller than the others?

starwars
## # A tibble: 87 x 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk~    172    77 blond      fair       blue            19   male  mascu~
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu~
##  3 R2-D2        96    32 <NA>       white, bl~ red             33   none  mascu~
##  4 Darth V~    202   136 none       white      yellow          41.9 male  mascu~
##  5 Leia Or~    150    49 brown      light      brown           19   fema~ femin~
##  6 Owen La~    178   120 brown, gr~ light      blue            52   male  mascu~
##  7 Beru Wh~    165    75 brown      light      blue            47   fema~ femin~
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu~
##  9 Biggs D~    183    84 black      light      brown           24   male  mascu~
## 10 Obi-Wan~    182    77 auburn, w~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

8. Adding a new column or modifying an existing one with mutate()

In dplyr, it is possible to add a new column or to modify an existing one using the mutate() function. Let’s have an example and work with the chickwts built in data frame. It exhibits several information about the feeding types and the weight of 71 chicks.

chickwts
##    weight      feed
## 1     179 horsebean
## 2     160 horsebean
## 3     136 horsebean
## 4     227 horsebean
## 5     217 horsebean
## 6     168 horsebean
## 7     108 horsebean
## 8     124 horsebean
## 9     143 horsebean
## 10    140 horsebean
## 11    309   linseed
## 12    229   linseed
## 13    181   linseed
## 14    141   linseed
## 15    260   linseed
## 16    203   linseed
## 17    148   linseed
## 18    169   linseed
## 19    213   linseed
## 20    257   linseed
## 21    244   linseed
## 22    271   linseed
## 23    243   soybean
## 24    230   soybean
## 25    248   soybean
## 26    327   soybean
## 27    329   soybean
## 28    250   soybean
## 29    193   soybean
## 30    271   soybean
## 31    316   soybean
## 32    267   soybean
## 33    199   soybean
## 34    171   soybean
## 35    158   soybean
## 36    248   soybean
## 37    423 sunflower
## 38    340 sunflower
## 39    392 sunflower
## 40    339 sunflower
## 41    341 sunflower
## 42    226 sunflower
## 43    320 sunflower
## 44    295 sunflower
## 45    334 sunflower
## 46    322 sunflower
## 47    297 sunflower
## 48    318 sunflower
## 49    325  meatmeal
## 50    257  meatmeal
## 51    303  meatmeal
## 52    315  meatmeal
## 53    380  meatmeal
## 54    153  meatmeal
## 55    263  meatmeal
## 56    242  meatmeal
## 57    206  meatmeal
## 58    344  meatmeal
## 59    258  meatmeal
## 60    368    casein
## 61    390    casein
## 62    379    casein
## 63    260    casein
## 64    404    casein
## 65    318    casein
## 66    352    casein
## 67    359    casein
## 68    216    casein
## 69    222    casein
## 70    283    casein
## 71    332    casein

As you can see from above the weight column is expressed in grams. What if for whatever reason, we wanted to express the weight in Kg. We will use the mutate() function to create the needed column as follow:

chickwts %>% 
  mutate(weight / 1000)
##    weight      feed weight/1000
## 1     179 horsebean       0.179
## 2     160 horsebean       0.160
## 3     136 horsebean       0.136
## 4     227 horsebean       0.227
## 5     217 horsebean       0.217
## 6     168 horsebean       0.168
## 7     108 horsebean       0.108
## 8     124 horsebean       0.124
## 9     143 horsebean       0.143
## 10    140 horsebean       0.140
## 11    309   linseed       0.309
## 12    229   linseed       0.229
## 13    181   linseed       0.181
## 14    141   linseed       0.141
## 15    260   linseed       0.260
## 16    203   linseed       0.203
## 17    148   linseed       0.148
## 18    169   linseed       0.169
## 19    213   linseed       0.213
## 20    257   linseed       0.257
## 21    244   linseed       0.244
## 22    271   linseed       0.271
## 23    243   soybean       0.243
## 24    230   soybean       0.230
## 25    248   soybean       0.248
## 26    327   soybean       0.327
## 27    329   soybean       0.329
## 28    250   soybean       0.250
## 29    193   soybean       0.193
## 30    271   soybean       0.271
## 31    316   soybean       0.316
## 32    267   soybean       0.267
## 33    199   soybean       0.199
## 34    171   soybean       0.171
## 35    158   soybean       0.158
## 36    248   soybean       0.248
## 37    423 sunflower       0.423
## 38    340 sunflower       0.340
## 39    392 sunflower       0.392
## 40    339 sunflower       0.339
## 41    341 sunflower       0.341
## 42    226 sunflower       0.226
## 43    320 sunflower       0.320
## 44    295 sunflower       0.295
## 45    334 sunflower       0.334
## 46    322 sunflower       0.322
## 47    297 sunflower       0.297
## 48    318 sunflower       0.318
## 49    325  meatmeal       0.325
## 50    257  meatmeal       0.257
## 51    303  meatmeal       0.303
## 52    315  meatmeal       0.315
## 53    380  meatmeal       0.380
## 54    153  meatmeal       0.153
## 55    263  meatmeal       0.263
## 56    242  meatmeal       0.242
## 57    206  meatmeal       0.206
## 58    344  meatmeal       0.344
## 59    258  meatmeal       0.258
## 60    368    casein       0.368
## 61    390    casein       0.390
## 62    379    casein       0.379
## 63    260    casein       0.260
## 64    404    casein       0.404
## 65    318    casein       0.318
## 66    352    casein       0.352
## 67    359    casein       0.359
## 68    216    casein       0.216
## 69    222    casein       0.222
## 70    283    casein       0.283
## 71    332    casein       0.332

So, above we used the initial weight column and divided it by 1000 to the get the new weight/1000 column. We could easily give a nice name to our new column:

chickwts %>% 
  mutate(weight_kg = weight / 1000)
##    weight      feed weight_kg
## 1     179 horsebean     0.179
## 2     160 horsebean     0.160
## 3     136 horsebean     0.136
## 4     227 horsebean     0.227
## 5     217 horsebean     0.217
## 6     168 horsebean     0.168
## 7     108 horsebean     0.108
## 8     124 horsebean     0.124
## 9     143 horsebean     0.143
## 10    140 horsebean     0.140
## 11    309   linseed     0.309
## 12    229   linseed     0.229
## 13    181   linseed     0.181
## 14    141   linseed     0.141
## 15    260   linseed     0.260
## 16    203   linseed     0.203
## 17    148   linseed     0.148
## 18    169   linseed     0.169
## 19    213   linseed     0.213
## 20    257   linseed     0.257
## 21    244   linseed     0.244
## 22    271   linseed     0.271
## 23    243   soybean     0.243
## 24    230   soybean     0.230
## 25    248   soybean     0.248
## 26    327   soybean     0.327
## 27    329   soybean     0.329
## 28    250   soybean     0.250
## 29    193   soybean     0.193
## 30    271   soybean     0.271
## 31    316   soybean     0.316
## 32    267   soybean     0.267
## 33    199   soybean     0.199
## 34    171   soybean     0.171
## 35    158   soybean     0.158
## 36    248   soybean     0.248
## 37    423 sunflower     0.423
## 38    340 sunflower     0.340
## 39    392 sunflower     0.392
## 40    339 sunflower     0.339
## 41    341 sunflower     0.341
## 42    226 sunflower     0.226
## 43    320 sunflower     0.320
## 44    295 sunflower     0.295
## 45    334 sunflower     0.334
## 46    322 sunflower     0.322
## 47    297 sunflower     0.297
## 48    318 sunflower     0.318
## 49    325  meatmeal     0.325
## 50    257  meatmeal     0.257
## 51    303  meatmeal     0.303
## 52    315  meatmeal     0.315
## 53    380  meatmeal     0.380
## 54    153  meatmeal     0.153
## 55    263  meatmeal     0.263
## 56    242  meatmeal     0.242
## 57    206  meatmeal     0.206
## 58    344  meatmeal     0.344
## 59    258  meatmeal     0.258
## 60    368    casein     0.368
## 61    390    casein     0.390
## 62    379    casein     0.379
## 63    260    casein     0.260
## 64    404    casein     0.404
## 65    318    casein     0.318
## 66    352    casein     0.352
## 67    359    casein     0.359
## 68    216    casein     0.216
## 69    222    casein     0.222
## 70    283    casein     0.283
## 71    332    casein     0.332

It is possible to position the new column created according to our wish using the .after and the .before parameters of the mutate() function. By default, the new variable will be set at the last position (from the last) of the data frame. Suppose, we want to put the new created column weight_kg at the beginning:

chickwts %>% 
  mutate(
    weight_kg = weight / 1000, 
    .before = weight 
  )
##    weight_kg weight      feed
## 1      0.179    179 horsebean
## 2      0.160    160 horsebean
## 3      0.136    136 horsebean
## 4      0.227    227 horsebean
## 5      0.217    217 horsebean
## 6      0.168    168 horsebean
## 7      0.108    108 horsebean
## 8      0.124    124 horsebean
## 9      0.143    143 horsebean
## 10     0.140    140 horsebean
## 11     0.309    309   linseed
## 12     0.229    229   linseed
## 13     0.181    181   linseed
## 14     0.141    141   linseed
## 15     0.260    260   linseed
## 16     0.203    203   linseed
## 17     0.148    148   linseed
## 18     0.169    169   linseed
## 19     0.213    213   linseed
## 20     0.257    257   linseed
## 21     0.244    244   linseed
## 22     0.271    271   linseed
## 23     0.243    243   soybean
## 24     0.230    230   soybean
## 25     0.248    248   soybean
## 26     0.327    327   soybean
## 27     0.329    329   soybean
## 28     0.250    250   soybean
## 29     0.193    193   soybean
## 30     0.271    271   soybean
## 31     0.316    316   soybean
## 32     0.267    267   soybean
## 33     0.199    199   soybean
## 34     0.171    171   soybean
## 35     0.158    158   soybean
## 36     0.248    248   soybean
## 37     0.423    423 sunflower
## 38     0.340    340 sunflower
## 39     0.392    392 sunflower
## 40     0.339    339 sunflower
## 41     0.341    341 sunflower
## 42     0.226    226 sunflower
## 43     0.320    320 sunflower
## 44     0.295    295 sunflower
## 45     0.334    334 sunflower
## 46     0.322    322 sunflower
## 47     0.297    297 sunflower
## 48     0.318    318 sunflower
## 49     0.325    325  meatmeal
## 50     0.257    257  meatmeal
## 51     0.303    303  meatmeal
## 52     0.315    315  meatmeal
## 53     0.380    380  meatmeal
## 54     0.153    153  meatmeal
## 55     0.263    263  meatmeal
## 56     0.242    242  meatmeal
## 57     0.206    206  meatmeal
## 58     0.344    344  meatmeal
## 59     0.258    258  meatmeal
## 60     0.368    368    casein
## 61     0.390    390    casein
## 62     0.379    379    casein
## 63     0.260    260    casein
## 64     0.404    404    casein
## 65     0.318    318    casein
## 66     0.352    352    casein
## 67     0.359    359    casein
## 68     0.216    216    casein
## 69     0.222    222    casein
## 70     0.283    283    casein
## 71     0.332    332    casein

Instead of creating a new column, we can override the initial column if we provide inside the mutate() function the same column name:

chickwts %>% 
  mutate(weight = weight / 1000)
##    weight      feed
## 1   0.179 horsebean
## 2   0.160 horsebean
## 3   0.136 horsebean
## 4   0.227 horsebean
## 5   0.217 horsebean
## 6   0.168 horsebean
## 7   0.108 horsebean
## 8   0.124 horsebean
## 9   0.143 horsebean
## 10  0.140 horsebean
## 11  0.309   linseed
## 12  0.229   linseed
## 13  0.181   linseed
## 14  0.141   linseed
## 15  0.260   linseed
## 16  0.203   linseed
## 17  0.148   linseed
## 18  0.169   linseed
## 19  0.213   linseed
## 20  0.257   linseed
## 21  0.244   linseed
## 22  0.271   linseed
## 23  0.243   soybean
## 24  0.230   soybean
## 25  0.248   soybean
## 26  0.327   soybean
## 27  0.329   soybean
## 28  0.250   soybean
## 29  0.193   soybean
## 30  0.271   soybean
## 31  0.316   soybean
## 32  0.267   soybean
## 33  0.199   soybean
## 34  0.171   soybean
## 35  0.158   soybean
## 36  0.248   soybean
## 37  0.423 sunflower
## 38  0.340 sunflower
## 39  0.392 sunflower
## 40  0.339 sunflower
## 41  0.341 sunflower
## 42  0.226 sunflower
## 43  0.320 sunflower
## 44  0.295 sunflower
## 45  0.334 sunflower
## 46  0.322 sunflower
## 47  0.297 sunflower
## 48  0.318 sunflower
## 49  0.325  meatmeal
## 50  0.257  meatmeal
## 51  0.303  meatmeal
## 52  0.315  meatmeal
## 53  0.380  meatmeal
## 54  0.153  meatmeal
## 55  0.263  meatmeal
## 56  0.242  meatmeal
## 57  0.206  meatmeal
## 58  0.344  meatmeal
## 59  0.258  meatmeal
## 60  0.368    casein
## 61  0.390    casein
## 62  0.379    casein
## 63  0.260    casein
## 64  0.404    casein
## 65  0.318    casein
## 66  0.352    casein
## 67  0.359    casein
## 68  0.216    casein
## 69  0.222    casein
## 70  0.283    casein
## 71  0.332    casein

Let’s take another example, in the penguins data frame we’ve the following variables:

penguins
## # A tibble: 344 x 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>

Let’s say, we want to create a ratio of bill_length_mm on bill_depth_mm, call it bill_ratio and put just after the island column. We can write:

penguins %>%  
  mutate(bill_ratio = bill_length_mm / bill_depth_mm, 
         .after = island)
## # A tibble: 344 x 9
##    species island    bill_ratio bill_length_mm bill_depth_mm flipper_length_mm
##    <fct>   <fct>          <dbl>          <dbl>         <dbl>             <int>
##  1 Adelie  Torgersen       2.09           39.1          18.7               181
##  2 Adelie  Torgersen       2.27           39.5          17.4               186
##  3 Adelie  Torgersen       2.24           40.3          18                 195
##  4 Adelie  Torgersen      NA              NA            NA                  NA
##  5 Adelie  Torgersen       1.90           36.7          19.3               193
##  6 Adelie  Torgersen       1.91           39.3          20.6               190
##  7 Adelie  Torgersen       2.19           38.9          17.8               181
##  8 Adelie  Torgersen       2              39.2          19.6               195
##  9 Adelie  Torgersen       1.88           34.1          18.1               193
## 10 Adelie  Torgersen       2.08           42            20.2               190
## # ... with 334 more rows, and 3 more variables: body_mass_g <int>, sex <fct>,
## #   year <int>

Exercise 11:

*Q1: In the mpg data frame, calculate the ratio hwy on cty, name it whatever you want and put just before the cty variable.

*Q2: Using mutate(), convert the year variable (which is of double type) into a factor.

*Q3: Use mutate() to convert all the manufacturer values into upper case.

*Q4: Use the radous package and the get_date() function to get 100 observations. Then, use mutate() to create an new column that will hold the full name of the person (first name and last name). Put the new column name at the start of the data frame.

Built with Hugo
Theme Stack designed by Jimmy