Content
1. Column selection with select()
2. Filtering data with filter()
3. Renaming columns with rename()
distinct()
and n_distinct()
5. Column sorting with arrange()
6. Dropping NA
values with drop_na()
from tidyr
7. Summary values with group_by()
and summarise()
mutate()
0. The magrittr
pipe %>%
Before digging into dplyr
, there is a sort topic yet important to cover, and it’s the tidyverse
piping process. Piping is used in programming to chain several operations in order to yield a final output. In the tidyverse
this is done using the magrittr
pipe operator %>%
. The pipe takes the result of the Left Handed Side (L.H.S) code and pipe it into the first parameter of the Right Handed Side (R.H.S) function.
Consider the following example. We use the summary()
function to display some statistical metrics in the mtcars
data frame. Note that we’re keeping only the rows where mpg
is greater or equal than 30 miles per gallon:
summary(subset(mtcars, mpg >= 30))
## mpg cyl disp hp drat
## Min. :30.40 Min. :4 Min. :71.10 Min. : 52.00 Min. :3.770
## 1st Qu.:30.40 1st Qu.:4 1st Qu.:74.55 1st Qu.: 61.75 1st Qu.:4.003
## Median :31.40 Median :4 Median :77.20 Median : 65.50 Median :4.150
## Mean :31.77 Mean :4 Mean :80.15 Mean : 74.00 Mean :4.250
## 3rd Qu.:32.77 3rd Qu.:4 3rd Qu.:82.80 3rd Qu.: 77.75 3rd Qu.:4.397
## Max. :33.90 Max. :4 Max. :95.10 Max. :113.00 Max. :4.930
## wt qsec vs am gear
## Min. :1.513 Min. :16.90 Min. :1 Min. :1 Min. :4.00
## 1st Qu.:1.589 1st Qu.:18.11 1st Qu.:1 1st Qu.:1 1st Qu.:4.00
## Median :1.725 Median :19.00 Median :1 Median :1 Median :4.00
## Mean :1.791 Mean :18.70 Mean :1 Mean :1 Mean :4.25
## 3rd Qu.:1.926 3rd Qu.:19.58 3rd Qu.:1 3rd Qu.:1 3rd Qu.:4.25
## Max. :2.200 Max. :19.90 Max. :1 Max. :1 Max. :5.00
## carb
## Min. :1.0
## 1st Qu.:1.0
## Median :1.5
## Mean :1.5
## 3rd Qu.:2.0
## Max. :2.0
The above code is relatively readable but not very intuitive and not that pleasant to read. Let’s rewrite it using the pipe operator:
library(magrittr)
## Warning: package 'magrittr' was built under R version 4.1.3
mtcars %>%
subset(mpg >= 30) %>%
summary()
## mpg cyl disp hp drat
## Min. :30.40 Min. :4 Min. :71.10 Min. : 52.00 Min. :3.770
## 1st Qu.:30.40 1st Qu.:4 1st Qu.:74.55 1st Qu.: 61.75 1st Qu.:4.003
## Median :31.40 Median :4 Median :77.20 Median : 65.50 Median :4.150
## Mean :31.77 Mean :4 Mean :80.15 Mean : 74.00 Mean :4.250
## 3rd Qu.:32.77 3rd Qu.:4 3rd Qu.:82.80 3rd Qu.: 77.75 3rd Qu.:4.397
## Max. :33.90 Max. :4 Max. :95.10 Max. :113.00 Max. :4.930
## wt qsec vs am gear
## Min. :1.513 Min. :16.90 Min. :1 Min. :1 Min. :4.00
## 1st Qu.:1.589 1st Qu.:18.11 1st Qu.:1 1st Qu.:1 1st Qu.:4.00
## Median :1.725 Median :19.00 Median :1 Median :1 Median :4.00
## Mean :1.791 Mean :18.70 Mean :1 Mean :1 Mean :4.25
## 3rd Qu.:1.926 3rd Qu.:19.58 3rd Qu.:1 3rd Qu.:1 3rd Qu.:4.25
## Max. :2.200 Max. :19.90 Max. :1 Max. :1 Max. :5.00
## carb
## Min. :1.0
## 1st Qu.:1.0
## Median :1.5
## Mean :1.5
## 3rd Qu.:2.0
## Max. :2.0
Using the pipe, it feels like we’re reading a paragraph in which each operation is a sentence. The order of the operations are clearly showcased, much like a Workflow. We can read the above code as follows: Take the mtcars data frame, keep only the observations where mpg greater or equal 30, finally make a summary of the data frame.
Ok, that’s nice but what if we wanted to pipe the L.H.S into the second or the third parameter (not only the first)? Well there’s an easy solution provided by magrittr
: Use a .
wherever you want to incorporate the result of the L.H.S
Consider the following linear model:
summary(lm(formula = mpg ~ wt,
data = mtcars,
subset = wt >= 2))
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars, subset = wt >= 2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9773 -1.8415 -0.1206 0.9795 8.2520
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.0607 2.3083 14.76 3.78e-14 ***
## wt -4.5058 0.6539 -6.89 2.58e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.879 on 26 degrees of freedom
## Multiple R-squared: 0.6462, Adjusted R-squared: 0.6325
## F-statistic: 47.48 on 1 and 26 DF, p-value: 2.58e-07
Insted of using both the data
and the subset
parameter, we can just subset the data frame beforehand and incorporate the result within the data
parameter.
Using the pipe, we can write:
mtcars %>%
subset(wt >= 2) %>%
lm(formula = mpg ~ wt, data = .) %>%
summary()
##
## Call:
## lm(formula = mpg ~ wt, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9773 -1.8415 -0.1206 0.9795 8.2520
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.0607 2.3083 14.76 3.78e-14 ***
## wt -4.5058 0.6539 -6.89 2.58e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.879 on 26 degrees of freedom
## Multiple R-squared: 0.6462, Adjusted R-squared: 0.6325
## F-statistic: 47.48 on 1 and 26 DF, p-value: 2.58e-07
Exercise 1:
🧠 🧠 🧠 🧠 🧠 > Q1: We want to display a message to the console showcasing the structure (str) of a subset of the mtcars data frame (where wt is greater or equal than 2)
message(str(subset(mtcars, wt >= 2)))
## 'data.frame': 28 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
##
Rewrite the code chunk using the piping phylosophy.
Q2: Express in some sentences what’s happening in the piping process.
Q3: Repeat Q1 and Q2 with the following code chunk:
hist(subset(mtcars, am == "1")$mpg,
main = "distribution of the mpg variable",
sub = "All the vehicules are automatic",
xlab = "")
🧠 🧠 🧠 🧠 🧠
1. Column selection with select()
You might see it as trivial but column selection is an important toolkit to have. In many situations, you’ll have to deal with data frames that have more than 50 columns and trust me in those cases, selecting the needed columns is not easy, unless you use dplyr
, then it becomes quite straightforward.
Let’s work with the penguins
data frame from the palmerpenguins
package. The data contains information about 344 penguins. There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica. More on the data frame here.
library(palmerpenguins)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
penguins
## # A tibble: 344 x 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>
When loading a data frame, it’s always a good practice to inspect its content. In base R, we can use the str()
function:
str(penguins)
## tibble [344 x 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
In dplyr
there is a more aesthetic function that achieves the same purpose, glimpse()
:
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex <fct> male, female, female, NA, female, male, female, male~
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~
Or in a dplyr
friendly way:
penguins %>% glimpse()
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex <fct> male, female, female, NA, female, male, female, male~
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~
Now, let’s talk about the selection features in dplyr
which provides a unique function to select columns, the select()
function. Let’s say that we’re interested only in the species
column. We can select as follows:
penguins %>% select(species)
## # A tibble: 344 x 1
## species
## <fct>
## 1 Adelie
## 2 Adelie
## 3 Adelie
## 4 Adelie
## 5 Adelie
## 6 Adelie
## 7 Adelie
## 8 Adelie
## 9 Adelie
## 10 Adelie
## # ... with 334 more rows
Note that dplyr
and the tidyverse
in general uses the so called Non Standard Evaluation. As such, you don’t need to provide the name of the column as a character string, but if needed you can:
penguins %>% select("species")
## # A tibble: 344 x 1
## species
## <fct>
## 1 Adelie
## 2 Adelie
## 3 Adelie
## 4 Adelie
## 5 Adelie
## 6 Adelie
## 7 Adelie
## 8 Adelie
## 9 Adelie
## 10 Adelie
## # ... with 334 more rows
You can also select a column by its position:
penguins %>% select(1)
## # A tibble: 344 x 1
## species
## <fct>
## 1 Adelie
## 2 Adelie
## 3 Adelie
## 4 Adelie
## 5 Adelie
## 6 Adelie
## 7 Adelie
## 8 Adelie
## 9 Adelie
## 10 Adelie
## # ... with 334 more rows
Now, let’s talk about how to deselect
a column, meaning drop it and keep all other columns in the data frame. This is easiy achievable by providing the minus sign -
or an exclamation point !
at the beginning of the column name to drop. To illustrate it, let’s say we’re interested in all the variables except the island
. We can drop it with:
penguins %>% select(-island)
## # A tibble: 344 x 7
## species bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex year
## <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie 39.1 18.7 181 3750 male 2007
## 2 Adelie 39.5 17.4 186 3800 fema~ 2007
## 3 Adelie 40.3 18 195 3250 fema~ 2007
## 4 Adelie NA NA NA NA <NA> 2007
## 5 Adelie 36.7 19.3 193 3450 fema~ 2007
## 6 Adelie 39.3 20.6 190 3650 male 2007
## 7 Adelie 38.9 17.8 181 3625 fema~ 2007
## 8 Adelie 39.2 19.6 195 4675 male 2007
## 9 Adelie 34.1 18.1 193 3475 <NA> 2007
## 10 Adelie 42 20.2 190 4250 <NA> 2007
## # ... with 334 more rows
or with
penguins %>% select(!island)
## # A tibble: 344 x 7
## species bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex year
## <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie 39.1 18.7 181 3750 male 2007
## 2 Adelie 39.5 17.4 186 3800 fema~ 2007
## 3 Adelie 40.3 18 195 3250 fema~ 2007
## 4 Adelie NA NA NA NA <NA> 2007
## 5 Adelie 36.7 19.3 193 3450 fema~ 2007
## 6 Adelie 39.3 20.6 190 3650 male 2007
## 7 Adelie 38.9 17.8 181 3625 fema~ 2007
## 8 Adelie 39.2 19.6 195 4675 male 2007
## 9 Adelie 34.1 18.1 193 3475 <NA> 2007
## 10 Adelie 42 20.2 190 4250 <NA> 2007
## # ... with 334 more rows
We can also drop by position:
penguins %>% select(-2)
## # A tibble: 344 x 7
## species bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex year
## <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie 39.1 18.7 181 3750 male 2007
## 2 Adelie 39.5 17.4 186 3800 fema~ 2007
## 3 Adelie 40.3 18 195 3250 fema~ 2007
## 4 Adelie NA NA NA NA <NA> 2007
## 5 Adelie 36.7 19.3 193 3450 fema~ 2007
## 6 Adelie 39.3 20.6 190 3650 male 2007
## 7 Adelie 38.9 17.8 181 3625 fema~ 2007
## 8 Adelie 39.2 19.6 195 4675 male 2007
## 9 Adelie 34.1 18.1 193 3475 <NA> 2007
## 10 Adelie 42 20.2 190 4250 <NA> 2007
## # ... with 334 more rows
It is easy to select a set of columns, just put the selection inside the vector function c()
. Suppose, we want to keep the species
, the body_mass_g
, the sex
and the year
variables:
penguins %>%
select(c(species, body_mass_g, sex, year))
## # A tibble: 344 x 4
## species body_mass_g sex year
## <fct> <int> <fct> <int>
## 1 Adelie 3750 male 2007
## 2 Adelie 3800 female 2007
## 3 Adelie 3250 female 2007
## 4 Adelie NA <NA> 2007
## 5 Adelie 3450 female 2007
## 6 Adelie 3650 male 2007
## 7 Adelie 3625 female 2007
## 8 Adelie 4675 male 2007
## 9 Adelie 3475 <NA> 2007
## 10 Adelie 4250 <NA> 2007
## # ... with 334 more rows
We can also select a range of columns as follows:
penguins %>% select(species:bill_depth_mm)
## # A tibble: 344 x 4
## species island bill_length_mm bill_depth_mm
## <fct> <fct> <dbl> <dbl>
## 1 Adelie Torgersen 39.1 18.7
## 2 Adelie Torgersen 39.5 17.4
## 3 Adelie Torgersen 40.3 18
## 4 Adelie Torgersen NA NA
## 5 Adelie Torgersen 36.7 19.3
## 6 Adelie Torgersen 39.3 20.6
## 7 Adelie Torgersen 38.9 17.8
## 8 Adelie Torgersen 39.2 19.6
## 9 Adelie Torgersen 34.1 18.1
## 10 Adelie Torgersen 42 20.2
## # ... with 334 more rows
It’s also possible to mix individual with range selections:
penguins %>% select(c(species:bill_depth_mm, year))
## # A tibble: 344 x 5
## species island bill_length_mm bill_depth_mm year
## <fct> <fct> <dbl> <dbl> <int>
## 1 Adelie Torgersen 39.1 18.7 2007
## 2 Adelie Torgersen 39.5 17.4 2007
## 3 Adelie Torgersen 40.3 18 2007
## 4 Adelie Torgersen NA NA 2007
## 5 Adelie Torgersen 36.7 19.3 2007
## 6 Adelie Torgersen 39.3 20.6 2007
## 7 Adelie Torgersen 38.9 17.8 2007
## 8 Adelie Torgersen 39.2 19.6 2007
## 9 Adelie Torgersen 34.1 18.1 2007
## 10 Adelie Torgersen 42 20.2 2007
## # ... with 334 more rows
In the same way, it is possible to drop a range of columns:
penguins %>% select(!(bill_length_mm:year))
## # A tibble: 344 x 2
## species island
## <fct> <fct>
## 1 Adelie Torgersen
## 2 Adelie Torgersen
## 3 Adelie Torgersen
## 4 Adelie Torgersen
## 5 Adelie Torgersen
## 6 Adelie Torgersen
## 7 Adelie Torgersen
## 8 Adelie Torgersen
## 9 Adelie Torgersen
## 10 Adelie Torgersen
## # ... with 334 more rows
The same principles apply when using columns position. In the following, we select the range from column 1 to column3:
penguins %>% select(1:3)
## # A tibble: 344 x 3
## species island bill_length_mm
## <fct> <fct> <dbl>
## 1 Adelie Torgersen 39.1
## 2 Adelie Torgersen 39.5
## 3 Adelie Torgersen 40.3
## 4 Adelie Torgersen NA
## 5 Adelie Torgersen 36.7
## 6 Adelie Torgersen 39.3
## 7 Adelie Torgersen 38.9
## 8 Adelie Torgersen 39.2
## 9 Adelie Torgersen 34.1
## 10 Adelie Torgersen 42
## # ... with 334 more rows
Exercise 2:
🧠 🧠 🧠 🧠 🧠
> The ggplot2 package provides a data frame called txhousing
.
Q1: Select the following columns: city, sales, volume, median, listings, inventory
Q2: Select the city, year, sales and median variables. Make sure that the year variable is the first column from the left and sales is the last.
Q3: It is possible to rename colums on the fly during the selection process. Repeat Q2 but this time, rename sales to number_of_sales
Q4: Select all the columns. Put the month and year variables at the beginning of the data frame, dont’t change the order of the remaining columns. 🧠 🧠 🧠 🧠 🧠
dplyr
also provides some helper functions that allow us to use a string pattern to select a specific set of columns. Suppose that we want to select the columns that contain the word “bill”. We can use the contains()
function as follows:
penguins %>% select(contains("bill"))
## # A tibble: 344 x 2
## bill_length_mm bill_depth_mm
## <dbl> <dbl>
## 1 39.1 18.7
## 2 39.5 17.4
## 3 40.3 18
## 4 NA NA
## 5 36.7 19.3
## 6 39.3 20.6
## 7 38.9 17.8
## 8 39.2 19.6
## 9 34.1 18.1
## 10 42 20.2
## # ... with 334 more rows
We can also use the starts_with()
and ends_with()
functions. Let’s say we want to select the columns that end with the mm
character:
penguins %>% select(ends_with("mm"))
## # A tibble: 344 x 3
## bill_length_mm bill_depth_mm flipper_length_mm
## <dbl> <dbl> <int>
## 1 39.1 18.7 181
## 2 39.5 17.4 186
## 3 40.3 18 195
## 4 NA NA NA
## 5 36.7 19.3 193
## 6 39.3 20.6 190
## 7 38.9 17.8 181
## 8 39.2 19.6 195
## 9 34.1 18.1 193
## 10 42 20.2 190
## # ... with 334 more rows
We can mix the these helper functions with simple selection, so if we want to get the species variables and all the columns that start with the character “flipper”, we can do the following:
penguins %>% select(species, starts_with("flipper"))
## # A tibble: 344 x 2
## species flipper_length_mm
## <fct> <int>
## 1 Adelie 181
## 2 Adelie 186
## 3 Adelie 195
## 4 Adelie NA
## 5 Adelie 193
## 6 Adelie 190
## 7 Adelie 181
## 8 Adelie 195
## 9 Adelie 193
## 10 Adelie 190
## # ... with 334 more rows
Exercise 3:
🧠 🧠 🧠 🧠 🧠
The radous package allows you to get a data frame about fake personal information. In order to get the data we use the following function:
library(radous)
rad_data <- get_data(n = 200)
## Rows: 200 Columns: 34-- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (27): gender, name.title, name.first, name.last, location.street.name, ...
## dbl (5): location.street.number, location.coordinates.latitude, location.c...
## dttm (2): dob.date, registered.date
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
rad_data
## # A tibble: 200 x 34
## gender name_title name_first name_last location_street_nu~ location_street~
## <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 female Mademoiselle Luisa Lecomte 8496 Rue Abel-Ferry
## 2 female Miss Toligniva Govera 677 Chervona kalina
## 3 male Monsieur Mike Morin 3773 Rue de L'Abbé-P~
## 4 female Mrs Lauren Morrison 2082 Henry Street
## 5 male Mr Viljami Rintala 6094 Verkatehtaankatu
## 6 female Ms Christine Chavez 6176 Marsh Ln
## 7 female Mrs Carolyn Matthews 8472 Church Lane
## 8 male Mr Demid Zyablyuk 1484 Zaporizke shose
## 9 male Mr Roberto Vázquez 926 Calle Nebrija
## 10 male Mr Eilif Sandtorv 2479 Husebybakken
## # ... with 190 more rows, and 28 more variables: location_city <chr>,
## # location_state <chr>, location_country <chr>, location_postcode <chr>,
## # location_coordinates_latitude <dbl>, location_coordinates_longitude <dbl>,
## # location_timezone_offset <chr>, location_timezone_description <chr>,
## # email <chr>, login_uuid <chr>, login_username <chr>, login_password <chr>,
## # login_salt <chr>, login_md5 <chr>, login_sha1 <chr>, login_sha256 <chr>,
## # dob_date <dttm>, dob_age <dbl>, registered_date <dttm>, ...
We can get a list of the column names with:
names(rad_data)
## [1] "gender" "name_title"
## [3] "name_first" "name_last"
## [5] "location_street_number" "location_street_name"
## [7] "location_city" "location_state"
## [9] "location_country" "location_postcode"
## [11] "location_coordinates_latitude" "location_coordinates_longitude"
## [13] "location_timezone_offset" "location_timezone_description"
## [15] "email" "login_uuid"
## [17] "login_username" "login_password"
## [19] "login_salt" "login_md5"
## [21] "login_sha1" "login_sha256"
## [23] "dob_date" "dob_age"
## [25] "registered_date" "registered_age"
## [27] "phone" "cell"
## [29] "id_name" "id_value"
## [31] "picture_large" "picture_medium"
## [33] "picture_thumbnail" "nat"
Q1: Select all the columns that contain information about the date of birth of the person (see dob). Q2: Select all the columns that contain an
id
. Q3: Select all the columns that contain information about the name of the person and its location. Q4: Select all the columns except those that relate to the picture of the person. 🧠 🧠 🧠 🧠 🧠
We achieve powerful selections when using select()
in conjunction with the where()
function. Let’s work withe the mpg
data frame from the ggplot2
package:
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.3
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto~ f 18 29 p comp~
## 2 audi a4 1.8 1999 4 manu~ f 21 29 p comp~
## 3 audi a4 2 2008 4 manu~ f 20 31 p comp~
## 4 audi a4 2 2008 4 auto~ f 21 30 p comp~
## 5 audi a4 2.8 1999 6 auto~ f 16 26 p comp~
## 6 audi a4 2.8 1999 6 manu~ f 18 26 p comp~
## 7 audi a4 3.1 2008 6 auto~ f 18 27 p comp~
## 8 audi a4 quattro 1.8 1999 4 manu~ 4 18 26 p comp~
## 9 audi a4 quattro 1.8 1999 4 auto~ 4 16 25 p comp~
## 10 audi a4 quattro 2 2008 4 manu~ 4 20 28 p comp~
## # ... with 224 more rows
Suppose that we want to select the character string columns only, we can do:
mpg %>% select(where(is.character))
## # A tibble: 234 x 6
## manufacturer model trans drv fl class
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 audi a4 auto(l5) f p compact
## 2 audi a4 manual(m5) f p compact
## 3 audi a4 manual(m6) f p compact
## 4 audi a4 auto(av) f p compact
## 5 audi a4 auto(l5) f p compact
## 6 audi a4 manual(m5) f p compact
## 7 audi a4 auto(av) f p compact
## 8 audi a4 quattro manual(m5) 4 p compact
## 9 audi a4 quattro auto(l5) 4 p compact
## 10 audi a4 quattro manual(m6) 4 p compact
## # ... with 224 more rows
Of course, you can mix the where()
selection with other types of selection:
mpg %>% select(year,
where(is.character) & starts_with("m"))
## # A tibble: 234 x 3
## year manufacturer model
## <int> <chr> <chr>
## 1 1999 audi a4
## 2 1999 audi a4
## 3 2008 audi a4
## 4 2008 audi a4
## 5 1999 audi a4
## 6 1999 audi a4
## 7 2008 audi a4
## 8 1999 audi a4 quattro
## 9 1999 audi a4 quattro
## 10 2008 audi a4 quattro
## # ... with 224 more rows
Finally, it is possible to select columns based on an expression:
In the mpg
data frame, suppose we want to keep only the columns in which the median is below 20:
mpg %>%
select(where(is.numeric)) %>%
select(where(~median(.) < 20))
## # A tibble: 234 x 3
## displ cyl cty
## <dbl> <int> <int>
## 1 1.8 4 18
## 2 1.8 4 21
## 3 2 4 20
## 4 2 4 21
## 5 2.8 6 16
## 6 2.8 6 18
## 7 3.1 6 18
## 8 1.8 4 18
## 9 1.8 4 16
## 10 2 4 20
## # ... with 224 more rows
Exercise 4:
🧠 🧠 🧠 🧠 🧠 > Use again the radous package to generate some data. > Q1: Select all the columns except those of type character. > Q2: Select only the columns that contain name of location information about a person. > Q3: Select the columns that contain the name information and the numeric columns. > Q4: Above, we have used the following code chunk:
mpg %>%
select(where(is.numeric)) %>%
select(where(~median(.) < 20))
Try to rewrite it in a more compact way.
🧠 🧠 🧠 🧠 🧠
2. Filtering data with filter()
The filter
function in dplyr allows us to keep a specific set of the data according a certain condition. Working with the mpg
data frame, we can for example filter (keep) the vehicles that have a cty greater than or equal to 20:
mpg %>% filter(cty >= 20)
## # A tibble: 56 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 manu~ f 21 29 p comp~
## 2 audi a4 2 2008 4 manu~ f 20 31 p comp~
## 3 audi a4 2 2008 4 auto~ f 21 30 p comp~
## 4 audi a4 quattro 2 2008 4 manu~ 4 20 28 p comp~
## 5 chevrolet malibu 2.4 2008 4 auto~ f 22 30 r mids~
## 6 honda civic 1.6 1999 4 manu~ f 28 33 r subc~
## 7 honda civic 1.6 1999 4 auto~ f 24 32 r subc~
## 8 honda civic 1.6 1999 4 manu~ f 25 32 r subc~
## 9 honda civic 1.6 1999 4 manu~ f 23 29 p subc~
## 10 honda civic 1.6 1999 4 auto~ f 24 32 r subc~
## # ... with 46 more rows
We can put several conditions inside the filter()
function. For example, we can subset the vehicles that have a cty greater than or equal to 20 and which have been produced in 2008:
mpg %>% filter(cty >= 20 & year == 2008)
## # A tibble: 34 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 2 2008 4 manu~ f 20 31 p comp~
## 2 audi a4 2 2008 4 auto~ f 21 30 p comp~
## 3 audi a4 quattro 2 2008 4 manu~ 4 20 28 p comp~
## 4 chevrolet malibu 2.4 2008 4 auto~ f 22 30 r mids~
## 5 honda civic 1.8 2008 4 manu~ f 26 34 r subc~
## 6 honda civic 1.8 2008 4 auto~ f 25 36 r subc~
## 7 honda civic 1.8 2008 4 auto~ f 24 36 c subc~
## 8 honda civic 2 2008 4 manu~ f 21 29 p subc~
## 9 hyundai sonata 2.4 2008 4 auto~ f 21 30 r mids~
## 10 hyundai sonata 2.4 2008 4 manu~ f 21 31 r mids~
## # ... with 24 more rows
We can also use union selection, for example, we might be interested to keep only German vehicles:
mpg %>% filter(
manufacturer == "audi" | manufacturer == "volkswagen"
)
## # A tibble: 45 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto~ f 18 29 p comp~
## 2 audi a4 1.8 1999 4 manu~ f 21 29 p comp~
## 3 audi a4 2 2008 4 manu~ f 20 31 p comp~
## 4 audi a4 2 2008 4 auto~ f 21 30 p comp~
## 5 audi a4 2.8 1999 6 auto~ f 16 26 p comp~
## 6 audi a4 2.8 1999 6 manu~ f 18 26 p comp~
## 7 audi a4 3.1 2008 6 auto~ f 18 27 p comp~
## 8 audi a4 quattro 1.8 1999 4 manu~ 4 18 26 p comp~
## 9 audi a4 quattro 1.8 1999 4 auto~ 4 16 25 p comp~
## 10 audi a4 quattro 2 2008 4 manu~ 4 20 28 p comp~
## # ... with 35 more rows
Above, we’re asking filter()
to keep the observations where the manufacturer is either audi
or volkswagen
.
Sometimes, we’re interested in filtering a range of values, so let’s we want to keep the cars with a cty
between 20 and 30 (both not included). We can write:
mpg %>%
filter(cty > 20 & cty < 30)
## # A tibble: 43 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 manual(m~ f 21 29 p comp~
## 2 audi a4 2 2008 4 auto(av) f 21 30 p comp~
## 3 chevrolet malibu 2.4 2008 4 auto(l4) f 22 30 r mids~
## 4 honda civic 1.6 1999 4 manual(m~ f 28 33 r subc~
## 5 honda civic 1.6 1999 4 auto(l4) f 24 32 r subc~
## 6 honda civic 1.6 1999 4 manual(m~ f 25 32 r subc~
## 7 honda civic 1.6 1999 4 manual(m~ f 23 29 p subc~
## 8 honda civic 1.6 1999 4 auto(l4) f 24 32 r subc~
## 9 honda civic 1.8 2008 4 manual(m~ f 26 34 r subc~
## 10 honda civic 1.8 2008 4 auto(l5) f 25 36 r subc~
## # ... with 33 more rows
We can also use the convenient between()
function however be aware the between()
will include the lower and the upper limit. So, if we want to select cars that have a cty
between 20 and 30 both included:
mpg %>%
filter(between(cty, left = 20, right = 30))
## # A tibble: 54 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 manu~ f 21 29 p comp~
## 2 audi a4 2 2008 4 manu~ f 20 31 p comp~
## 3 audi a4 2 2008 4 auto~ f 21 30 p comp~
## 4 audi a4 quattro 2 2008 4 manu~ 4 20 28 p comp~
## 5 chevrolet malibu 2.4 2008 4 auto~ f 22 30 r mids~
## 6 honda civic 1.6 1999 4 manu~ f 28 33 r subc~
## 7 honda civic 1.6 1999 4 auto~ f 24 32 r subc~
## 8 honda civic 1.6 1999 4 manu~ f 25 32 r subc~
## 9 honda civic 1.6 1999 4 manu~ f 23 29 p subc~
## 10 honda civic 1.6 1999 4 auto~ f 24 32 r subc~
## # ... with 44 more rows
Sometimes, you want to filter around a specific value. For example, we might want to keep only the vehicles with an hwy
near 30 with a tolerance of +2/-2. We can achieve that using the near()
function:
mpg %>%
filter(near(x = hwy, y = 30, tol = 2))
## # A tibble: 33 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p comp~
## 2 audi a4 1.8 1999 4 manual(m~ f 21 29 p comp~
## 3 audi a4 2 2008 4 manual(m~ f 20 31 p comp~
## 4 audi a4 2 2008 4 auto(av) f 21 30 p comp~
## 5 chevrolet malibu 2.4 2008 4 auto(l4) f 22 30 r mids~
## 6 chevrolet malibu 3.5 2008 6 auto(l4) f 18 29 r mids~
## 7 honda civic 1.6 1999 4 manual(m~ f 23 29 p subc~
## 8 honda civic 2 2008 4 manual(m~ f 21 29 p subc~
## 9 hyundai sonata 2.4 2008 4 auto(l4) f 21 30 r mids~
## 10 hyundai sonata 2.4 2008 4 manual(m~ f 21 31 r mids~
## # ... with 23 more rows
Exercise 5:
🧠 🧠 🧠 🧠 🧠 > Q1: From the mpg data frame, keep only the German cars that are produced in 2008.
Q2: From the penguins data frame, keep the penguins recorded in the
Torgersen
island that have bill lingth greater or equal than 40mm. Then select the columnspecies
,island
,body_mass_g
andsex
.
Q3: Filter (keep) the female penguins, then select only the numeric columns.
Q4: Now, keep only the female
Chinstrap
penguins with a body mass inferior than 3100 gramme and with a bill length of at least 40mm. Finally, keep only thefactors
columns and those that are measured inmm
.
Q5: Filter the
Chinstrap
penguins with abody_mass_g
between 4000 and 4500 (both included). Finally select thebody_mass_g
column. (Bonus point if you can use thebetween()
function)
Q6: Filter the penguins that have a
body_mass_g
near 4000. Give it a tolerance of 2 standard deviation. 🧠 🧠 🧠 🧠 🧠
3. Renaming columns with rename()
We’ve seen that we can rename columns in the selection process. Nonetheless, rename()
allows to rename columns in a more convenient way:
penguins %>%
rename(penguin_spec = species, location = island)
## # A tibble: 344 x 8
## penguin_spec location bill_length_mm bill_depth_mm flipper_length_mm
## <fct> <fct> <dbl> <dbl> <int>
## 1 Adelie Torgersen 39.1 18.7 181
## 2 Adelie Torgersen 39.5 17.4 186
## 3 Adelie Torgersen 40.3 18 195
## 4 Adelie Torgersen NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193
## 6 Adelie Torgersen 39.3 20.6 190
## 7 Adelie Torgersen 38.9 17.8 181
## 8 Adelie Torgersen 39.2 19.6 195
## 9 Adelie Torgersen 34.1 18.1 193
## 10 Adelie Torgersen 42 20.2 190
## # ... with 334 more rows, and 3 more variables: body_mass_g <int>, sex <fct>,
## # year <int>
The pattern is clear, the new name is on the left and the old name on the right.
Exercise 6:
🧠 🧠 🧠 🧠 🧠
> Q1: Choose 3 columns from the Penguins
data frame of your choice and rename them using the rename()
function.
🧠 🧠 🧠 🧠 🧠
4. Getting distinct values with distinct()
and n_distinct()
distinct()
returns all the values available inside a variable nonetheless it drops the duplicates, you get distinct values. Consider the following example, we want to know which year
observations are available within the penguins
data frame:
penguins %>% distinct(year)
## # A tibble: 3 x 1
## year
## <int>
## 1 2007
## 2 2008
## 3 2009
n_distinct()
on the other hand counts the number of unique values available within a column. So if we use it in our previous example, we get:
penguins %>% pull(year) %>% n_distinct()
## [1] 3
Suppose we want to grab all the columns that take a limited number of values, let’s say 4 or below. We can mix distict()
with where()
as follows:
penguins %>%
select(where(~n_distinct(.) < 4))
## # A tibble: 344 x 4
## species island sex year
## <fct> <fct> <fct> <int>
## 1 Adelie Torgersen male 2007
## 2 Adelie Torgersen female 2007
## 3 Adelie Torgersen female 2007
## 4 Adelie Torgersen <NA> 2007
## 5 Adelie Torgersen female 2007
## 6 Adelie Torgersen male 2007
## 7 Adelie Torgersen female 2007
## 8 Adelie Torgersen male 2007
## 9 Adelie Torgersen <NA> 2007
## 10 Adelie Torgersen <NA> 2007
## # ... with 334 more rows
Exercise 7:
🧠 🧠 🧠 🧠 🧠
> Q1: From the mpg
data frame choose only the columns that can take only 10 unique values or below.
Q2: Using
distinct()
list the unique values that can be taken by themanufacturer
column, do the same withmodel
. 🧠 🧠 🧠 🧠 🧠
5. Column sorting with arrange()
It is possible to order a data frame according to the values of a specific column. Using the arrange()
function we can for example sort the mpg
vehicles according the displ
variable:
mpg %>%
arrange(displ)
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 honda civic 1.6 1999 4 manu~ f 28 33 r subc~
## 2 honda civic 1.6 1999 4 auto~ f 24 32 r subc~
## 3 honda civic 1.6 1999 4 manu~ f 25 32 r subc~
## 4 honda civic 1.6 1999 4 manu~ f 23 29 p subc~
## 5 honda civic 1.6 1999 4 auto~ f 24 32 r subc~
## 6 audi a4 1.8 1999 4 auto~ f 18 29 p comp~
## 7 audi a4 1.8 1999 4 manu~ f 21 29 p comp~
## 8 audi a4 quattro 1.8 1999 4 manu~ 4 18 26 p comp~
## 9 audi a4 quattro 1.8 1999 4 auto~ 4 16 25 p comp~
## 10 honda civic 1.8 2008 4 manu~ f 26 34 r subc~
## # ... with 224 more rows
As you can see from above the ordering is ascending order (from the lowest to the highest value), nonetheless, using desc()
we can specify than we want a descending ordering:
mpg %>%
arrange(desc(displ))
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 chevrolet corvette 7 2008 8 manu~ r 15 24 p 2sea~
## 2 chevrolet k1500 tah~ 6.5 1999 8 auto~ 4 14 17 d suv
## 3 chevrolet corvette 6.2 2008 8 manu~ r 16 26 p 2sea~
## 4 chevrolet corvette 6.2 2008 8 auto~ r 15 25 p 2sea~
## 5 jeep grand che~ 6.1 2008 8 auto~ 4 11 14 p suv
## 6 chevrolet c1500 sub~ 6 2008 8 auto~ r 12 17 r suv
## 7 dodge durango 4~ 5.9 1999 8 auto~ 4 11 15 r suv
## 8 dodge ram 1500 ~ 5.9 1999 8 auto~ 4 11 15 r pick~
## 9 chevrolet c1500 sub~ 5.7 1999 8 auto~ r 13 17 r suv
## 10 chevrolet corvette 5.7 1999 8 manu~ r 16 26 p 2sea~
## # ... with 224 more rows
We can also arrange character columns by alphabetical order:
mpg %>%
arrange(manufacturer)
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto~ f 18 29 p comp~
## 2 audi a4 1.8 1999 4 manu~ f 21 29 p comp~
## 3 audi a4 2 2008 4 manu~ f 20 31 p comp~
## 4 audi a4 2 2008 4 auto~ f 21 30 p comp~
## 5 audi a4 2.8 1999 6 auto~ f 16 26 p comp~
## 6 audi a4 2.8 1999 6 manu~ f 18 26 p comp~
## 7 audi a4 3.1 2008 6 auto~ f 18 27 p comp~
## 8 audi a4 quattro 1.8 1999 4 manu~ 4 18 26 p comp~
## 9 audi a4 quattro 1.8 1999 4 auto~ 4 16 25 p comp~
## 10 audi a4 quattro 2 2008 4 manu~ 4 20 28 p comp~
## # ... with 224 more rows
We can arrange several columns at the sime time:
mpg %>%
arrange(manufacturer, year)
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto~ f 18 29 p comp~
## 2 audi a4 1.8 1999 4 manu~ f 21 29 p comp~
## 3 audi a4 2.8 1999 6 auto~ f 16 26 p comp~
## 4 audi a4 2.8 1999 6 manu~ f 18 26 p comp~
## 5 audi a4 quattro 1.8 1999 4 manu~ 4 18 26 p comp~
## 6 audi a4 quattro 1.8 1999 4 auto~ 4 16 25 p comp~
## 7 audi a4 quattro 2.8 1999 6 auto~ 4 15 25 p comp~
## 8 audi a4 quattro 2.8 1999 6 manu~ 4 17 25 p comp~
## 9 audi a6 quattro 2.8 1999 6 auto~ 4 15 24 p mids~
## 10 audi a4 2 2008 4 manu~ f 20 31 p comp~
## # ... with 224 more rows
As you can see from above, the manufacturer
column is first ordered, then the year
column.
Exercise 8:
🧠 🧠 🧠 🧠 🧠
> Q1: From the penguins
data frame, select all the columns that contains the character mm
, then keep only the column where the mean is below 50. Finally arrange the values in descending order by the bill_length_mm
variable.
> Q2: Now, the repeat Q1 however this time, sort the final data frame according to its first column no matter its content.
🧠 🧠 🧠 🧠 🧠
- Dropping
NA
values withdrop_na()
fromtidyr
NA
is an abbreviation for Non Available
and they are used in R
to exhibit the fact that a value is a missing. Let’s take a look at the penguins
data frame again:
penguins
## # A tibble: 344 x 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>
We can see that there are some NA
values in it. In order to get the number of NA
values in each column, the simplest way is to use the base R
colSums()
function that calculate the sum of each numeric column within a data frame:
colSums(is.na(penguins))
## species island bill_length_mm bill_depth_mm
## 0 0 2 2
## flipper_length_mm body_mass_g sex year
## 2 2 11 0
From above, we can see that we have exactly 19 NAs
:
na_val <- colSums(is.na(penguins))
sum(na_val)
## [1] 19
We can drop all the rows that contains NA
values using the drop_na()
function:
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.1.3
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
peng_no_na <- penguins %>% drop_na()
peng_no_na
## # A tibble: 333 x 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## 7 Adelie Torgersen 39.2 19.6 195 4675
## 8 Adelie Torgersen 41.1 17.6 182 3200
## 9 Adelie Torgersen 38.6 21.2 191 3800
## 10 Adelie Torgersen 34.6 21.1 198 4400
## # ... with 323 more rows, and 2 more variables: sex <fct>, year <int>
Now if we repeat the previous colSums
command, we get the following:
colSums(is.na(peng_no_na))
## species island bill_length_mm bill_depth_mm
## 0 0 0 0
## flipper_length_mm body_mass_g sex year
## 0 0 0 0
So all the NA
values and their corresponding rows have been dropped.
In practice however, we just want to drop the NA
values that are available within a specific column without impacting the other NAs
in our data frame. Suppose, that we need to do some computations on our body_mass_g
function and those computations don’t accept NAs
. We can drop the missing values available within a particular column by specifying the column inside the drop_na()
function:
peng_bm_no_na <- penguins %>%
drop_na(body_mass_g)
Now, we can see that the body_mass_g
has no NAs
anymore:
colSums(is.na(peng_bm_no_na))
## species island bill_length_mm bill_depth_mm
## 0 0 0 0
## flipper_length_mm body_mass_g sex year
## 0 0 9 0
Nonetheless, some NAs
in the sex
variable hasn’t been impacted simply because they are on different rows. Remember than removing an NA
will remove the entire row.
- Summary values with
group_by()
andsummarise()
The summarise()
function allows us to proceed calculations on a specific column or a set of columns.
Suppose, we want to get the mean of the bill_length_mm
variable in the penguins
data frame:
penguins %>%
drop_na(bill_length_mm) %>%
summarise(mean(bill_length_mm))
## # A tibble: 1 x 1
## `mean(bill_length_mm)`
## <dbl>
## 1 43.9
It is possible to give a name to our new summary metric:
penguins %>%
drop_na(bill_length_mm) %>%
summarise(avg = mean(bill_length_mm))
## # A tibble: 1 x 1
## avg
## <dbl>
## 1 43.9
It is possible to calculate a metric on several columns:
penguins %>%
drop_na(bill_length_mm, body_mass_g) %>%
summarise(
avg_bill_length = mean(bill_length_mm),
avg_body_mass = mean(body_mass_g)
)
## # A tibble: 1 x 2
## avg_bill_length avg_body_mass
## <dbl> <dbl>
## 1 43.9 4202.
Exercise 9:
🧠 🧠 🧠 🧠 🧠
> Q1: Using summarise()
calculate the median of the flipper_length_mm
and the bill_depth_mm
.
Q2: Same as Q1, but now calculate the sum of the values.
Q3: We’ve seen the following code above:
penguins %>%
drop_na(bill_length_mm, body_mass_g) %>%
summarise(
avg_bill_length = mean(bill_length_mm),
avg_body_mass = mean(body_mass_g)
)
Try to get the same results however this time without using the
drop_na()
function. 🧠 🧠 🧠 🧠 🧠
Now, what if we wanted to calculate a metric on several columns but without having the need to write each column name. Let’s say we’re interested in the following columns: bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g
penguins %>%
summarise(
across(.cols = bill_length_mm:body_mass_g,
.fns = ~mean(., na.rm = TRUE))
)
## # A tibble: 1 x 4
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <dbl> <dbl> <dbl> <dbl>
## 1 43.9 17.2 201. 4202.
Now, what if we want to calculate the average on all the numeric columns. We can use across
and where
:
penguins %>%
summarise(
across(
where(is.numeric),
~mean(., na.rm = TRUE)
)
)
## # A tibble: 1 x 5
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 43.9 17.2 201. 4202. 2008.
We can drop the year
column if we want to:
penguins %>%
summarise(
across(
.cols = c(where(is.numeric), -year),
~mean(., na.rm = TRUE)
)
)
## # A tibble: 1 x 4
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <dbl> <dbl> <dbl> <dbl>
## 1 43.9 17.2 201. 4202.
Exercise 10:
🧠 🧠 🧠 🧠 🧠
> * Q1: In the mpg data frame, calculate the median of all the numeric variables.
> * Q2: Do the same as Q1, but now drop the year
variable.
> * Q3: Calculate the number of distinct values for each character column in the mpg
data frame.
> * Q4: In the penguins
data frame, calculate the number of distinct values for the columns of type factor
.
🧠 🧠 🧠 🧠 🧠
summarise()
is a powerful function, especially when you combine it with where()
and across()
however it becomes way more powerful when you add a grouping layer using the group_by()
function. The combination of group_by()
and summarise()
allows us to calculate metrics per group. Let’s take an example. In our mpg
data frame, suppose we’re interested in the average miles per gallon in highway per manufacturer
. We can write:
mpg %>%
group_by(manufacturer) %>%
summarise(avg_hwy = mean(hwy))
## # A tibble: 15 x 2
## manufacturer avg_hwy
## <chr> <dbl>
## 1 audi 26.4
## 2 chevrolet 21.9
## 3 dodge 17.9
## 4 ford 19.4
## 5 honda 32.6
## 6 hyundai 26.9
## 7 jeep 17.6
## 8 land rover 16.5
## 9 lincoln 17
## 10 mercury 18
## 11 nissan 24.6
## 12 pontiac 26.4
## 13 subaru 25.6
## 14 toyota 24.9
## 15 volkswagen 29.2
We can also arrange the results in descending order using the arrange()
and desc()
functions:
mpg %>%
group_by(manufacturer) %>%
summarise(avg_hwy = mean(hwy)) %>%
arrange(desc(avg_hwy))
## # A tibble: 15 x 2
## manufacturer avg_hwy
## <chr> <dbl>
## 1 honda 32.6
## 2 volkswagen 29.2
## 3 hyundai 26.9
## 4 audi 26.4
## 5 pontiac 26.4
## 6 subaru 25.6
## 7 toyota 24.9
## 8 nissan 24.6
## 9 chevrolet 21.9
## 10 ford 19.4
## 11 mercury 18
## 12 dodge 17.9
## 13 jeep 17.6
## 14 lincoln 17
## 15 land rover 16.5
We can also group several variables at the same time. Suppose, I’m interested in the average hwy
for each manufacturer and model:
mpg %>%
group_by(manufacturer, model) %>%
summarise(avg_hwy = mean(hwy)) %>%
arrange(desc(avg_hwy))
## `summarise()` has grouped output by 'manufacturer'. You can override using the
## `.groups` argument.
## # A tibble: 38 x 3
## # Groups: manufacturer [15]
## manufacturer model avg_hwy
## <chr> <chr> <dbl>
## 1 toyota corolla 34
## 2 volkswagen new beetle 32.8
## 3 honda civic 32.6
## 4 volkswagen jetta 29.1
## 5 nissan altima 28.7
## 6 audi a4 28.3
## 7 toyota camry 28.3
## 8 toyota camry solara 28.1
## 9 hyundai sonata 27.7
## 10 chevrolet malibu 27.6
## # ... with 28 more rows
I can also calculate several measures:
mpg %>%
group_by(manufacturer, model) %>%
summarise(avg_hwy = mean(hwy), median_hwy = median(hwy)) %>%
arrange(desc(avg_hwy))
## `summarise()` has grouped output by 'manufacturer'. You can override using the
## `.groups` argument.
## # A tibble: 38 x 4
## # Groups: manufacturer [15]
## manufacturer model avg_hwy median_hwy
## <chr> <chr> <dbl> <dbl>
## 1 toyota corolla 34 35
## 2 volkswagen new beetle 32.8 29
## 3 honda civic 32.6 32
## 4 volkswagen jetta 29.1 29
## 5 nissan altima 28.7 28
## 6 audi a4 28.3 29
## 7 toyota camry 28.3 28
## 8 toyota camry solara 28.1 27
## 9 hyundai sonata 27.7 27
## 10 chevrolet malibu 27.6 27
## # ... with 28 more rows
We can also group by a variable and then get a metric for all numerical variables (for example):
mpg %>%
group_by(manufacturer) %>%
summarise(
across(
.cols = where(is.numeric),
.fns = median
)
)
## # A tibble: 15 x 6
## manufacturer displ year cyl cty hwy
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 audi 2.8 2004. 6 17.5 26
## 2 chevrolet 5.3 2008 8 15 23
## 3 dodge 4.7 2008 8 13 17
## 4 ford 4.6 1999 8 14 18
## 5 honda 1.6 1999 4 24 32
## 6 hyundai 2.4 2008 4 18.5 26.5
## 7 jeep 4.7 2008 8 14 18.5
## 8 land rover 4.3 2004. 8 11.5 16.5
## 9 lincoln 5.4 1999 8 11 17
## 10 mercury 4.3 2004. 7 13 18
## 11 nissan 3.3 2008 6 19 26
## 12 pontiac 3.8 1999 6 17 26
## 13 subaru 2.5 2008 4 19 26
## 14 toyota 2.7 1999 4 18 26
## 15 volkswagen 2 1999 4 21 29
group_by()
and summarise()
are especially useful to check how many times a specific value of a variable appear within the column, or in other word, the count of the variable. For example, we might be interesting in counting the values of the manufacturer
column:
mpg %>%
group_by(manufacturer) %>%
summarise(amazing_count = n()) %>%
arrange(desc(amazing_count))
## # A tibble: 15 x 2
## manufacturer amazing_count
## <chr> <int>
## 1 dodge 37
## 2 toyota 34
## 3 volkswagen 27
## 4 ford 25
## 5 chevrolet 19
## 6 audi 18
## 7 hyundai 14
## 8 subaru 14
## 9 nissan 13
## 10 honda 9
## 11 jeep 8
## 12 pontiac 5
## 13 land rover 4
## 14 mercury 4
## 15 lincoln 3
The n()
function is part of dplyr
and is used to count the observations available within a particular column.
Counting variable’s observations is a recurrent process and is undertaken very often, so ofter than dply
provides a shortcut, so instead of writing the above code, we can use the count()
function:
mpg %>%
count(manufacturer, sort = TRUE)
## # A tibble: 15 x 2
## manufacturer n
## <chr> <int>
## 1 dodge 37
## 2 toyota 34
## 3 volkswagen 27
## 4 ford 25
## 5 chevrolet 19
## 6 audi 18
## 7 hyundai 14
## 8 subaru 14
## 9 nissan 13
## 10 honda 9
## 11 jeep 8
## 12 pontiac 5
## 13 land rover 4
## 14 mercury 4
## 15 lincoln 3
Exercise 10:
🧠 🧠 🧠 🧠 🧠
> *Q1: Using the starwars
data frame in dplyr
, calculate the average mass
according to each homeworld
.
*Q2: Calculate the median
mass
for each species, which one has the highestmass
and which one has the lowest.
*Q3: Calculate the median and the mean
height
for each eye_color and homeworld (both at the same time)
*Q4: How many characteres are from Tatooine?
*Q5: How many feminine and how many masculin (use the
gender
variable) human characters are from Naboo?
*Q6: How many Human characters with blue eyes are from Tatooine?
*Q7: How many characters have a mass superior than 100KG?
*Q8: How many characters have a height inferior than 70cm?
*Q9: Which homeland has the highest average mass?
*Q10: Are characters in Blong hair in average taller than the others?
starwars
## # A tibble: 87 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu~
## 3 R2-D2 96 32 <NA> white, bl~ red 33 none mascu~
## 4 Darth V~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia Or~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen La~ 178 120 brown, gr~ light blue 52 male mascu~
## 7 Beru Wh~ 165 75 brown light blue 47 fema~ femin~
## 8 R5-D4 97 32 <NA> white, red red NA none mascu~
## 9 Biggs D~ 183 84 black light brown 24 male mascu~
## 10 Obi-Wan~ 182 77 auburn, w~ fair blue-gray 57 male mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
8. Adding a new column or modifying an existing one with mutate()
In dplyr
, it is possible to add a new column or to modify an existing one using the mutate()
function. Let’s have an example and work with the chickwts
built in data frame. It exhibits several information about the feeding types and the weight of 71 chicks.
chickwts
## weight feed
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
## 6 168 horsebean
## 7 108 horsebean
## 8 124 horsebean
## 9 143 horsebean
## 10 140 horsebean
## 11 309 linseed
## 12 229 linseed
## 13 181 linseed
## 14 141 linseed
## 15 260 linseed
## 16 203 linseed
## 17 148 linseed
## 18 169 linseed
## 19 213 linseed
## 20 257 linseed
## 21 244 linseed
## 22 271 linseed
## 23 243 soybean
## 24 230 soybean
## 25 248 soybean
## 26 327 soybean
## 27 329 soybean
## 28 250 soybean
## 29 193 soybean
## 30 271 soybean
## 31 316 soybean
## 32 267 soybean
## 33 199 soybean
## 34 171 soybean
## 35 158 soybean
## 36 248 soybean
## 37 423 sunflower
## 38 340 sunflower
## 39 392 sunflower
## 40 339 sunflower
## 41 341 sunflower
## 42 226 sunflower
## 43 320 sunflower
## 44 295 sunflower
## 45 334 sunflower
## 46 322 sunflower
## 47 297 sunflower
## 48 318 sunflower
## 49 325 meatmeal
## 50 257 meatmeal
## 51 303 meatmeal
## 52 315 meatmeal
## 53 380 meatmeal
## 54 153 meatmeal
## 55 263 meatmeal
## 56 242 meatmeal
## 57 206 meatmeal
## 58 344 meatmeal
## 59 258 meatmeal
## 60 368 casein
## 61 390 casein
## 62 379 casein
## 63 260 casein
## 64 404 casein
## 65 318 casein
## 66 352 casein
## 67 359 casein
## 68 216 casein
## 69 222 casein
## 70 283 casein
## 71 332 casein
As you can see from above the weight
column is expressed in grams. What if for whatever reason, we wanted to express the weight
in Kg
. We will use the mutate()
function to create the needed column as follow:
chickwts %>%
mutate(weight / 1000)
## weight feed weight/1000
## 1 179 horsebean 0.179
## 2 160 horsebean 0.160
## 3 136 horsebean 0.136
## 4 227 horsebean 0.227
## 5 217 horsebean 0.217
## 6 168 horsebean 0.168
## 7 108 horsebean 0.108
## 8 124 horsebean 0.124
## 9 143 horsebean 0.143
## 10 140 horsebean 0.140
## 11 309 linseed 0.309
## 12 229 linseed 0.229
## 13 181 linseed 0.181
## 14 141 linseed 0.141
## 15 260 linseed 0.260
## 16 203 linseed 0.203
## 17 148 linseed 0.148
## 18 169 linseed 0.169
## 19 213 linseed 0.213
## 20 257 linseed 0.257
## 21 244 linseed 0.244
## 22 271 linseed 0.271
## 23 243 soybean 0.243
## 24 230 soybean 0.230
## 25 248 soybean 0.248
## 26 327 soybean 0.327
## 27 329 soybean 0.329
## 28 250 soybean 0.250
## 29 193 soybean 0.193
## 30 271 soybean 0.271
## 31 316 soybean 0.316
## 32 267 soybean 0.267
## 33 199 soybean 0.199
## 34 171 soybean 0.171
## 35 158 soybean 0.158
## 36 248 soybean 0.248
## 37 423 sunflower 0.423
## 38 340 sunflower 0.340
## 39 392 sunflower 0.392
## 40 339 sunflower 0.339
## 41 341 sunflower 0.341
## 42 226 sunflower 0.226
## 43 320 sunflower 0.320
## 44 295 sunflower 0.295
## 45 334 sunflower 0.334
## 46 322 sunflower 0.322
## 47 297 sunflower 0.297
## 48 318 sunflower 0.318
## 49 325 meatmeal 0.325
## 50 257 meatmeal 0.257
## 51 303 meatmeal 0.303
## 52 315 meatmeal 0.315
## 53 380 meatmeal 0.380
## 54 153 meatmeal 0.153
## 55 263 meatmeal 0.263
## 56 242 meatmeal 0.242
## 57 206 meatmeal 0.206
## 58 344 meatmeal 0.344
## 59 258 meatmeal 0.258
## 60 368 casein 0.368
## 61 390 casein 0.390
## 62 379 casein 0.379
## 63 260 casein 0.260
## 64 404 casein 0.404
## 65 318 casein 0.318
## 66 352 casein 0.352
## 67 359 casein 0.359
## 68 216 casein 0.216
## 69 222 casein 0.222
## 70 283 casein 0.283
## 71 332 casein 0.332
So, above we used the initial weight
column and divided it by 1000 to the get the new weight/1000
column. We could easily give a nice name to our new column:
chickwts %>%
mutate(weight_kg = weight / 1000)
## weight feed weight_kg
## 1 179 horsebean 0.179
## 2 160 horsebean 0.160
## 3 136 horsebean 0.136
## 4 227 horsebean 0.227
## 5 217 horsebean 0.217
## 6 168 horsebean 0.168
## 7 108 horsebean 0.108
## 8 124 horsebean 0.124
## 9 143 horsebean 0.143
## 10 140 horsebean 0.140
## 11 309 linseed 0.309
## 12 229 linseed 0.229
## 13 181 linseed 0.181
## 14 141 linseed 0.141
## 15 260 linseed 0.260
## 16 203 linseed 0.203
## 17 148 linseed 0.148
## 18 169 linseed 0.169
## 19 213 linseed 0.213
## 20 257 linseed 0.257
## 21 244 linseed 0.244
## 22 271 linseed 0.271
## 23 243 soybean 0.243
## 24 230 soybean 0.230
## 25 248 soybean 0.248
## 26 327 soybean 0.327
## 27 329 soybean 0.329
## 28 250 soybean 0.250
## 29 193 soybean 0.193
## 30 271 soybean 0.271
## 31 316 soybean 0.316
## 32 267 soybean 0.267
## 33 199 soybean 0.199
## 34 171 soybean 0.171
## 35 158 soybean 0.158
## 36 248 soybean 0.248
## 37 423 sunflower 0.423
## 38 340 sunflower 0.340
## 39 392 sunflower 0.392
## 40 339 sunflower 0.339
## 41 341 sunflower 0.341
## 42 226 sunflower 0.226
## 43 320 sunflower 0.320
## 44 295 sunflower 0.295
## 45 334 sunflower 0.334
## 46 322 sunflower 0.322
## 47 297 sunflower 0.297
## 48 318 sunflower 0.318
## 49 325 meatmeal 0.325
## 50 257 meatmeal 0.257
## 51 303 meatmeal 0.303
## 52 315 meatmeal 0.315
## 53 380 meatmeal 0.380
## 54 153 meatmeal 0.153
## 55 263 meatmeal 0.263
## 56 242 meatmeal 0.242
## 57 206 meatmeal 0.206
## 58 344 meatmeal 0.344
## 59 258 meatmeal 0.258
## 60 368 casein 0.368
## 61 390 casein 0.390
## 62 379 casein 0.379
## 63 260 casein 0.260
## 64 404 casein 0.404
## 65 318 casein 0.318
## 66 352 casein 0.352
## 67 359 casein 0.359
## 68 216 casein 0.216
## 69 222 casein 0.222
## 70 283 casein 0.283
## 71 332 casein 0.332
It is possible to position the new column created according to our wish using the .after
and the .before
parameters of the mutate()
function. By default, the new variable will be set at the last position (from the last) of the data frame.
Suppose, we want to put the new created column weight_kg
at the beginning:
chickwts %>%
mutate(
weight_kg = weight / 1000,
.before = weight
)
## weight_kg weight feed
## 1 0.179 179 horsebean
## 2 0.160 160 horsebean
## 3 0.136 136 horsebean
## 4 0.227 227 horsebean
## 5 0.217 217 horsebean
## 6 0.168 168 horsebean
## 7 0.108 108 horsebean
## 8 0.124 124 horsebean
## 9 0.143 143 horsebean
## 10 0.140 140 horsebean
## 11 0.309 309 linseed
## 12 0.229 229 linseed
## 13 0.181 181 linseed
## 14 0.141 141 linseed
## 15 0.260 260 linseed
## 16 0.203 203 linseed
## 17 0.148 148 linseed
## 18 0.169 169 linseed
## 19 0.213 213 linseed
## 20 0.257 257 linseed
## 21 0.244 244 linseed
## 22 0.271 271 linseed
## 23 0.243 243 soybean
## 24 0.230 230 soybean
## 25 0.248 248 soybean
## 26 0.327 327 soybean
## 27 0.329 329 soybean
## 28 0.250 250 soybean
## 29 0.193 193 soybean
## 30 0.271 271 soybean
## 31 0.316 316 soybean
## 32 0.267 267 soybean
## 33 0.199 199 soybean
## 34 0.171 171 soybean
## 35 0.158 158 soybean
## 36 0.248 248 soybean
## 37 0.423 423 sunflower
## 38 0.340 340 sunflower
## 39 0.392 392 sunflower
## 40 0.339 339 sunflower
## 41 0.341 341 sunflower
## 42 0.226 226 sunflower
## 43 0.320 320 sunflower
## 44 0.295 295 sunflower
## 45 0.334 334 sunflower
## 46 0.322 322 sunflower
## 47 0.297 297 sunflower
## 48 0.318 318 sunflower
## 49 0.325 325 meatmeal
## 50 0.257 257 meatmeal
## 51 0.303 303 meatmeal
## 52 0.315 315 meatmeal
## 53 0.380 380 meatmeal
## 54 0.153 153 meatmeal
## 55 0.263 263 meatmeal
## 56 0.242 242 meatmeal
## 57 0.206 206 meatmeal
## 58 0.344 344 meatmeal
## 59 0.258 258 meatmeal
## 60 0.368 368 casein
## 61 0.390 390 casein
## 62 0.379 379 casein
## 63 0.260 260 casein
## 64 0.404 404 casein
## 65 0.318 318 casein
## 66 0.352 352 casein
## 67 0.359 359 casein
## 68 0.216 216 casein
## 69 0.222 222 casein
## 70 0.283 283 casein
## 71 0.332 332 casein
Instead of creating a new column, we can override the initial column if we provide inside the mutate()
function the same column name:
chickwts %>%
mutate(weight = weight / 1000)
## weight feed
## 1 0.179 horsebean
## 2 0.160 horsebean
## 3 0.136 horsebean
## 4 0.227 horsebean
## 5 0.217 horsebean
## 6 0.168 horsebean
## 7 0.108 horsebean
## 8 0.124 horsebean
## 9 0.143 horsebean
## 10 0.140 horsebean
## 11 0.309 linseed
## 12 0.229 linseed
## 13 0.181 linseed
## 14 0.141 linseed
## 15 0.260 linseed
## 16 0.203 linseed
## 17 0.148 linseed
## 18 0.169 linseed
## 19 0.213 linseed
## 20 0.257 linseed
## 21 0.244 linseed
## 22 0.271 linseed
## 23 0.243 soybean
## 24 0.230 soybean
## 25 0.248 soybean
## 26 0.327 soybean
## 27 0.329 soybean
## 28 0.250 soybean
## 29 0.193 soybean
## 30 0.271 soybean
## 31 0.316 soybean
## 32 0.267 soybean
## 33 0.199 soybean
## 34 0.171 soybean
## 35 0.158 soybean
## 36 0.248 soybean
## 37 0.423 sunflower
## 38 0.340 sunflower
## 39 0.392 sunflower
## 40 0.339 sunflower
## 41 0.341 sunflower
## 42 0.226 sunflower
## 43 0.320 sunflower
## 44 0.295 sunflower
## 45 0.334 sunflower
## 46 0.322 sunflower
## 47 0.297 sunflower
## 48 0.318 sunflower
## 49 0.325 meatmeal
## 50 0.257 meatmeal
## 51 0.303 meatmeal
## 52 0.315 meatmeal
## 53 0.380 meatmeal
## 54 0.153 meatmeal
## 55 0.263 meatmeal
## 56 0.242 meatmeal
## 57 0.206 meatmeal
## 58 0.344 meatmeal
## 59 0.258 meatmeal
## 60 0.368 casein
## 61 0.390 casein
## 62 0.379 casein
## 63 0.260 casein
## 64 0.404 casein
## 65 0.318 casein
## 66 0.352 casein
## 67 0.359 casein
## 68 0.216 casein
## 69 0.222 casein
## 70 0.283 casein
## 71 0.332 casein
Let’s take another example, in the penguins
data frame we’ve the following variables:
penguins
## # A tibble: 344 x 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>
Let’s say, we want to create a ratio of bill_length_mm
on bill_depth_mm
, call it bill_ratio
and put just after the island
column. We can write:
penguins %>%
mutate(bill_ratio = bill_length_mm / bill_depth_mm,
.after = island)
## # A tibble: 344 x 9
## species island bill_ratio bill_length_mm bill_depth_mm flipper_length_mm
## <fct> <fct> <dbl> <dbl> <dbl> <int>
## 1 Adelie Torgersen 2.09 39.1 18.7 181
## 2 Adelie Torgersen 2.27 39.5 17.4 186
## 3 Adelie Torgersen 2.24 40.3 18 195
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 1.90 36.7 19.3 193
## 6 Adelie Torgersen 1.91 39.3 20.6 190
## 7 Adelie Torgersen 2.19 38.9 17.8 181
## 8 Adelie Torgersen 2 39.2 19.6 195
## 9 Adelie Torgersen 1.88 34.1 18.1 193
## 10 Adelie Torgersen 2.08 42 20.2 190
## # ... with 334 more rows, and 3 more variables: body_mass_g <int>, sex <fct>,
## # year <int>
Exercise 11:
*Q1: In the
mpg
data frame, calculate the ratiohwy
oncty
, name it whatever you want and put just before thecty
variable.
*Q2: Using
mutate()
, convert theyear
variable (which is ofdouble
type) into afactor
.
*Q3: Use
mutate()
to convert all themanufacturer
values into upper case.
*Q4: Use the
radous
package and theget_date()
function to get 100 observations. Then, usemutate()
to create an new column that will hold the full name of the person (first name and last name). Put the new column name at the start of the data frame.