Data wrangling sometimes could become very tedious. No matter what language do you prefer: R, python, or even SQL, the process of preprocessing your dataset is generally very time consuming. But, this is not the case if you know how to properly use packages included in tidyverse{target="_blank"}.
The term of tidyverse is actually referring to a set of packages that you’ll find very helpful in any data analysis tasks; many of them are already popular among R users, like dplyr{target="_blank"}, ggplot2{target="_blank"}, and lubridate{target="_blank"}. These packages are intended to work side-by-side; they share the same API style, and sharing a common data representation format. But another things that often missed is that the tidyverse itself is also a package! The tidyverse package is intended to make your environment set-up process faster; you only need to install tidyverse library to install the bundled packages, and just load the tidyverse library to load all the core packages.
In this articles (or maybe I should say, in this episode), I will cover the basic usage of core packages in tidyverse, using (of course) starwars{target="_blank"} dataset which included along with dplyr library.
There will be several sequels, and even prequels, for data wrangling using R and tidyverse tutorial–or to put it simple, data wars :grin:–so stay tuned!
Enter the tidyverse
As I mentioned before, you can install all of the included packages just by installing tidyverse. So let’s start by installing the package first:
1
|
install.packages("tidyverse")
|
If you are in an interactive session, you could see which packages are installed along with tidyverse. Just to make sure, let’s load the library:
1
2
3
4
5
6
7
8
9
|
library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
#> ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
#> ✔ tibble 3.1.8 ✔ dplyr 1.1.0
#> ✔ tidyr 1.3.0 ✔ stringr 1.5.0
#> ✔ readr 2.1.3 ✔ forcats 0.5.2
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
|
When importing the library, it give you messages about which packages that are attached to your session. But it only give you information about the core packages that fully loaded to your session, while actually tidyverse also attaching some great libraries in the background. You can see them all through sessionInfo():
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
|
sessionInfo()
#> R version 4.2.2 (2022-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_Indonesia.utf8 LC_CTYPE=English_Indonesia.utf8
#> [3] LC_MONETARY=English_Indonesia.utf8 LC_NUMERIC=C
#> [5] LC_TIME=English_Indonesia.utf8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] forcats_0.5.2 stringr_1.5.0 dplyr_1.1.0 purrr_1.0.1
#> [5] readr_2.1.3 tidyr_1.3.0 tibble_3.1.8 ggplot2_3.4.0
#> [9] tidyverse_1.3.2
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.0 xfun_0.35 bslib_0.4.2
#> [4] haven_2.5.1 gargle_1.3.0 colorspace_2.0-3
#> [7] vctrs_0.5.2 generics_0.1.3 htmltools_0.5.4
#> [10] yaml_2.3.6 utf8_1.2.3 rlang_1.0.6
#> [13] jquerylib_0.1.4 pillar_1.8.1 withr_2.5.0
#> [16] glue_1.6.2 DBI_1.1.3 dbplyr_2.2.1
#> [19] readxl_1.4.1 modelr_0.1.10 lifecycle_1.0.3
#> [22] munsell_0.5.0 blogdown_1.16 gtable_0.3.1
#> [25] cellranger_1.1.0 rvest_1.0.3 evaluate_0.19
#> [28] knitr_1.41 tzdb_0.3.0 fastmap_1.1.0
#> [31] fansi_1.0.4 broom_1.0.2 backports_1.4.1
#> [34] scales_1.2.1 googlesheets4_1.0.1 cachem_1.0.6
#> [37] jsonlite_1.8.4 fs_1.5.2 hms_1.1.2
#> [40] digest_0.6.30 stringi_1.7.8 bookdown_0.32
#> [43] grid_4.2.2 cli_3.4.1 tools_4.2.2
#> [46] magrittr_2.0.3 sass_0.4.4 crayon_1.5.2
#> [49] pkgconfig_2.0.3 ellipsis_0.3.2 xml2_1.3.3
#> [52] reprex_2.0.2 timechange_0.2.0 googledrive_2.0.0
#> [55] lubridate_1.9.1 assertthat_0.2.1 rmarkdown_2.19
#> [58] httr_1.4.4 rstudioapi_0.14 R6_2.5.1
#> [61] compiler_4.2.2
|
As you can see, it actually loads many packages in the background. Don’t worry, I will not cover all of them in this article; But I will cover them for sure in later sequels and prequels :grin:. If you really curious about those packages, I suggest you to checkout the detailed list at the tidyverse official documentations{target="_blank"}.
Use the standard data representation: tibble
The first step in learning the tidy data wrangling is to be familiar with its standard data representation. Before I explain it in detail, let me give you a quick example by printing the iris dataset:
1
2
3
4
5
6
7
8
|
head(iris)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
|
and compare the print output with starwars dataset from dplyr:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
starwars
#> # A tibble: 87 × 14
#> name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi…
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
#> 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo
#> 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
#> 5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera…
#> 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi…
#> 7 Beru White… 165 75 brown light blue 47 fema… femin… Tatooi…
#> 8 R5-D4 97 32 <NA> white,… red NA none mascu… Tatooi…
#> 9 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi…
#> 10 Obi-Wan Ke… 182 77 auburn… fair blue-g… 57 male mascu… Stewjon
#> # … with 77 more rows, 4 more variables: species <chr>, films <list>,
#> # vehicles <list>, starships <list>, and abbreviated variable names
#> # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
|
There are sooo many differences between the two outputs that I could elaborate more, but let’s focus on several important points:
- When printing
iris dataset, I had to call it through head() to avoid printing the whole dataset. This is not the case for starwars dataset
- Base print output is just as-is, not giving some useful information
- If you see the details in
starwars’ print output, it is showing that it can store a more complex object, like list in this case
Why they behave differently? The answer lies in their class:
1
2
3
4
5
|
class(iris)
#> [1] "data.frame"
class(starwars)
#> [1] "tbl_df" "tbl" "data.frame"
|
The starwars dataset class is showing that it’s actually a tibble; which often referred as tbl or tbl_* when related to a specific data class. A tibble is a little bit different than your usual data.frame. I already outlined the main differences above, but in additition to that, it should be noted that tibble is the main data representation across the tidyverse; and also some extension that adopting the tidy principle, e.g., tbl_ts from tsibble{target="_blank"} for tidy time series data representation.
There are so many other advantages if you use tibble. For example, if you want to subset just one column, a simple data.frame will failed you again:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
iris[, "Species"]
#> [1] setosa setosa setosa setosa setosa setosa
#> [7] setosa setosa setosa setosa setosa setosa
#> [13] setosa setosa setosa setosa setosa setosa
#> [19] setosa setosa setosa setosa setosa setosa
#> [25] setosa setosa setosa setosa setosa setosa
#> [31] setosa setosa setosa setosa setosa setosa
#> [37] setosa setosa setosa setosa setosa setosa
#> [43] setosa setosa setosa setosa setosa setosa
#> [49] setosa setosa versicolor versicolor versicolor versicolor
#> [55] versicolor versicolor versicolor versicolor versicolor versicolor
#> [61] versicolor versicolor versicolor versicolor versicolor versicolor
#> [67] versicolor versicolor versicolor versicolor versicolor versicolor
#> [73] versicolor versicolor versicolor versicolor versicolor versicolor
#> [79] versicolor versicolor versicolor versicolor versicolor versicolor
#> [85] versicolor versicolor versicolor versicolor versicolor versicolor
#> [91] versicolor versicolor versicolor versicolor versicolor versicolor
#> [97] versicolor versicolor versicolor versicolor virginica virginica
#> [103] virginica virginica virginica virginica virginica virginica
#> [109] virginica virginica virginica virginica virginica virginica
#> [115] virginica virginica virginica virginica virginica virginica
#> [121] virginica virginica virginica virginica virginica virginica
#> [127] virginica virginica virginica virginica virginica virginica
#> [133] virginica virginica virginica virginica virginica virginica
#> [139] virginica virginica virginica virginica virginica virginica
#> [145] virginica virginica virginica virginica virginica virginica
#> Levels: setosa versicolor virginica
|
It will drop its “table” representation and convert the data into a vector, instead of a data.frame with one column. But this is not the case with tibble:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
iris_tbl <- as_tibble(iris)
iris_tbl[, "Species"]
#> # A tibble: 150 × 1
#> Species
#> <fct>
#> 1 setosa
#> 2 setosa
#> 3 setosa
#> 4 setosa
#> 5 setosa
#> 6 setosa
#> 7 setosa
#> 8 setosa
#> 9 setosa
#> 10 setosa
#> # … with 140 more rows
|
Why this is justified as a more tidy way? because if you recall the R’s standard way to access a specific part from an object is actually using $:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
iris_tbl$Species
#> [1] setosa setosa setosa setosa setosa setosa
#> [7] setosa setosa setosa setosa setosa setosa
#> [13] setosa setosa setosa setosa setosa setosa
#> [19] setosa setosa setosa setosa setosa setosa
#> [25] setosa setosa setosa setosa setosa setosa
#> [31] setosa setosa setosa setosa setosa setosa
#> [37] setosa setosa setosa setosa setosa setosa
#> [43] setosa setosa setosa setosa setosa setosa
#> [49] setosa setosa versicolor versicolor versicolor versicolor
#> [55] versicolor versicolor versicolor versicolor versicolor versicolor
#> [61] versicolor versicolor versicolor versicolor versicolor versicolor
#> [67] versicolor versicolor versicolor versicolor versicolor versicolor
#> [73] versicolor versicolor versicolor versicolor versicolor versicolor
#> [79] versicolor versicolor versicolor versicolor versicolor versicolor
#> [85] versicolor versicolor versicolor versicolor versicolor versicolor
#> [91] versicolor versicolor versicolor versicolor versicolor versicolor
#> [97] versicolor versicolor versicolor versicolor virginica virginica
#> [103] virginica virginica virginica virginica virginica virginica
#> [109] virginica virginica virginica virginica virginica virginica
#> [115] virginica virginica virginica virginica virginica virginica
#> [121] virginica virginica virginica virginica virginica virginica
#> [127] virginica virginica virginica virginica virginica virginica
#> [133] virginica virginica virginica virginica virginica virginica
#> [139] virginica virginica virginica virginica virginica virginica
#> [145] virginica virginica virginica virginica virginica virginica
#> Levels: setosa versicolor virginica
|
which gives you the exact results.
So my concluding notes are:
- Use
tibble format whenever it’s possible. I rarely found any situation that using tbl or tbl_* object is not acceptable, because it is still inheriting the standard data.frame class.
- When working a specific case, e.g., time series dataset, or big data, always look if there are any tidy representation for the data; see
tsibble{target="_blank"} and sparklyr{target="_blank"} for example.
Easier data manipulation with dplyr
The dplyr package is probably the most important part of tidyverse. It brings so many convenient utilities for data manipulation, particularly for some basic tasks; yet sometimes, could be troublesome if you only use base R functionalities.
Among many advanced features inside dplyr, I suggest you to start with the cores data manipulation functions:
select(){target="_blank"}: for selecting a subset of data by column
filter(){target="_blank"}: for filtering a subset of data by row using condition
arrange(){target="_blank"}: for re-arranging the rows order
mutate(){target="_blank"}: for mutating (:confused:) a new or existing variables
As implied by my description, most of dplyr functions could be described by its names; except mutate, which is very likely to be confusing at the first time.
Those function can be naturally combined with the core aggregation functions:
group_by(){target="_blank"}: for specifying that following operations should be done by group(s)
summarise(){target="_blank"}: for obtaining the summarise of the data by the specified group(s).
These functions probably would cover most of your data wrangling tasks; and if done in a proper tidy way, then you’ll find that data wrangling process will be easier. Generally, most of these workflow could be combined through a pipeline using %>% (pipe) operator{target="_blank"} to make it tidier.
Let’s try out these features using starwars dataset. The first step–and I think the most important step–that we should do is setting our output goal. For this tutorial, let’s try to find out how the height and mass differ across homeworld and gender for human species. So we will transform the dataset from:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
starwars
#> # A tibble: 87 × 14
#> name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi…
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
#> 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo
#> 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
#> 5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera…
#> 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi…
#> 7 Beru White… 165 75 brown light blue 47 fema… femin… Tatooi…
#> 8 R5-D4 97 32 <NA> white,… red NA none mascu… Tatooi…
#> 9 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi…
#> 10 Obi-Wan Ke… 182 77 auburn… fair blue-g… 57 male mascu… Stewjon
#> # … with 77 more rows, 4 more variables: species <chr>, films <list>,
#> # vehicles <list>, starships <list>, and abbreviated variable names
#> # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
|
into this format:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
# the output goal:
#> `summarise()` has grouped output by 'homeworld'. You can override using the
#> `.groups` argument.
#> # A tibble: 21 × 5
#> homeworld gender mass height bmi
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Bestine IV masculine 110 180 61.1
#> 2 Tatooine masculine 100. 184. 54.0
#> 3 <NA> masculine 89 193 46.1
#> 4 Tatooine feminine 75 164 45.5
#> 5 Bespin masculine 79 175 45.1
#> 6 Naboo masculine 80 178. 45.0
#> 7 Corellia masculine 78.5 175 44.9
#> 8 Haruun Kal masculine 84 188 44.7
#> 9 Socorro masculine 79 177 44.6
#> 10 Concord Dawn masculine 79 183 43.2
#> # … with 11 more rows
|
Column subsetting using select()
select() is probably the first function you will call in every data manipulation tasks. By using select(), you can narrow down the number variables to fewer variables–which are what you truly need.
The select() usage (and also every core functions in dplyr) is very straighforward. You just need to pass the dataset and select the column name you want to select:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
select(starwars, name)
#> # A tibble: 87 × 1
#> name
#> <chr>
#> 1 Luke Skywalker
#> 2 C-3PO
#> 3 R2-D2
#> 4 Darth Vader
#> 5 Leia Organa
#> 6 Owen Lars
#> 7 Beru Whitesun lars
#> 8 R5-D4
#> 9 Biggs Darklighter
#> 10 Obi-Wan Kenobi
#> # … with 77 more rows
|
It’s very easy right? Before we continue to our objective, there are some tips you need to consider.
Chaining operation using %>%
When we doing a complex data manipulation using dplyr, I suggest to use %>% operator to make our code tidier:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
starwars %>%
select(name)
#> # A tibble: 87 × 1
#> name
#> <chr>
#> 1 Luke Skywalker
#> 2 C-3PO
#> 3 R2-D2
#> 4 Darth Vader
#> 5 Leia Organa
#> 6 Owen Lars
#> 7 Beru Whitesun lars
#> 8 R5-D4
#> 9 Biggs Darklighter
#> 10 Obi-Wan Kenobi
#> # … with 77 more rows
|
This is because %>% operator could help you to make a chaining operation. The pipe operator will pass the results from its left-hand side, to the first parameter in the function of its right-hand side. So you could make a chaining operation like this:
1
2
3
4
5
6
|
starwars %>%
select(name) %>%
anotherfuns(<arguments>) %>%
anotherfuns(<arguments>) %>%
<continue the chain as long as necessary> %>%
anotherfuns(<arguments>)
|
So as long as anotherfuns you chained could accept the previous result as its first argument, you can chaining the operation as long as you want.
Checking the variables using glimpse()
Now back to our objective. Remember that we want to make a summary of height and mass of human species across homeworld and gender. So we actually need only those few variables to go on. Still, remembering the name of those variables could be frustating sometimes, so I suggest to use glimpse() function before you jump into variable selection:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
glimpse(starwars)
#> Rows: 87
#> Columns: 14
#> $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
#> $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
#> $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
#> $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
#> $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
#> $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
#> $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
#> $ sex <chr> "male", "none", "none", "male", "female", "male", "female",…
#> $ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini…
#> $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
#> $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
#> $ films <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
#> $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
#> $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
|
The glimpse() function is practically the simplified version of str(); but it only print the most general information like the variable names, class, etc. Another alternative is to use the base colnames(), but it doesn’t provide additional information like glimpse(), so I suggest to use the later.
From glimpse() output, now we have the list of variable names and can be sure that we have to select species, homeworld, gender, mass, height:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
starwars %>%
select(name, species, homeworld, gender, mass, height)
#> # A tibble: 87 × 6
#> name species homeworld gender mass height
#> <chr> <chr> <chr> <chr> <dbl> <int>
#> 1 Luke Skywalker Human Tatooine masculine 77 172
#> 2 C-3PO Droid Tatooine masculine 75 167
#> 3 R2-D2 Droid Naboo masculine 32 96
#> 4 Darth Vader Human Tatooine masculine 136 202
#> 5 Leia Organa Human Alderaan feminine 49 150
#> 6 Owen Lars Human Tatooine masculine 120 178
#> 7 Beru Whitesun lars Human Tatooine feminine 75 165
#> 8 R5-D4 Droid Tatooine masculine 32 97
#> 9 Biggs Darklighter Human Tatooine masculine 84 183
#> 10 Obi-Wan Kenobi Human Stewjon masculine 77 182
#> # … with 77 more rows
|
Now the dataset looks a lot simpler :grin:
Filtering specific observation using filter()
You already see that a simple column selection could help you in starting the data wrangling process. But row/observation selection is sometimes a compulsory step rather than just a helper step. For example, you want to exclude or include specific observations which inheriting some specific traits. If you recall our objective, and to be more specific at “for human species”, then we are absolutely need to filter our dataset to only containing human species.
As aforementioned, filter() function is also very easy to use: just pass the dataset object as the first argument, then you can pass the condition(s) as the following argument(s).
To declare the condition, the most common way is to use the base condition testing:
== and != for testing equal, and not equal condition
> or >= for testing greater than, and greater than or equal
< or <= for testing less than, and less than or equal
Let’s try to filter observation with mass less than or equal to 50 from starwars dataset:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
filter(starwars, mass <= 50)
#> # A tibble: 13 × 14
#> name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo
#> 2 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera…
#> 3 R5-D4 97 32 <NA> white,… red NA none mascu… Tatooi…
#> 4 Yoda 66 17 white green brown 896 male mascu… <NA>
#> 5 Wicket Sys… 88 20 brown brown brown 8 male mascu… Endor
#> 6 Sebulba 112 40 none grey, … orange NA male mascu… Malast…
#> 7 Dud Bolt 94 45 none blue, … yellow NA male mascu… Vulpter
#> 8 Adi Gallia 184 50 none dark blue NA fema… femin… Corusc…
#> 9 Barriss Of… 166 50 black yellow blue 40 fema… femin… Mirial
#> 10 Ratts Tyer… 79 15 none grey, … unknown NA male mascu… Aleen …
#> 11 Wat Tambor 193 48 none green,… unknown NA male mascu… Skako
#> 12 Sly Moore 178 48 none pale white NA <NA> <NA> Umbara
#> 13 Padmé Amid… 165 45 brown light brown 46 fema… femin… Naboo
#> # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
#> # starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,
#> # ³eye_color, ⁴birth_year, ⁵homeworld
|
It might be looks very simple, but let’s dive deeper on how condition testing works to see its full potential.
Behind the scene of condition testing
Before I explain the details, let’s take a look at the first 5 mass value from starwars:
1
2
|
head(starwars$mass)
#> [1] 77 75 32 136 49 120
|
If we test the condition of <= 50 to above data, we will actually get this output:
1
2
|
head(starwars$mass) <= 50
#> [1] FALSE FALSE TRUE FALSE TRUE FALSE
|
So basically filtering condition is picking up rows with the TRUE value; and it will discard the rest. Based on this mechanism, we could also use other helper functions to test a condition, e.g., when we want to see rows with NA values, we can use is.na() function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
filter(starwars, is.na(mass))
#> # A tibble: 28 × 14
#> name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 Wilhuff Ta… 180 NA auburn… fair blue 64 male mascu… Eriadu
#> 2 Mon Mothma 150 NA auburn fair blue 48 fema… femin… Chandr…
#> 3 Arvel Cryn… NA NA brown fair brown NA male mascu… <NA>
#> 4 Finis Valo… 170 NA blond fair blue 91 male mascu… Corusc…
#> 5 Rugor Nass 206 NA none green orange NA male mascu… Naboo
#> 6 Ric Olié 183 NA brown fair blue NA <NA> <NA> Naboo
#> 7 Watto 137 NA black blue, … yellow NA male mascu… Toydar…
#> 8 Quarsh Pan… 183 NA black dark brown 62 <NA> <NA> Naboo
#> 9 Shmi Skywa… 163 NA black fair brown 72 fema… femin… Tatooi…
#> 10 Bib Fortuna 180 NA none pale pink NA male mascu… Ryloth
#> # … with 18 more rows, 4 more variables: species <chr>, films <list>,
#> # vehicles <list>, starships <list>, and abbreviated variable names
#> # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
|
or if we want to keep observations with non-NA value, then we could create a negation using !:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
filter(starwars, !is.na(mass))
#> # A tibble: 59 × 14
#> name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi…
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
#> 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo
#> 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
#> 5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera…
#> 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi…
#> 7 Beru White… 165 75 brown light blue 47 fema… femin… Tatooi…
#> 8 R5-D4 97 32 <NA> white,… red NA none mascu… Tatooi…
#> 9 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi…
#> 10 Obi-Wan Ke… 182 77 auburn… fair blue-g… 57 male mascu… Stewjon
#> # … with 49 more rows, 4 more variables: species <chr>, films <list>,
#> # vehicles <list>, starships <list>, and abbreviated variable names
#> # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
|
Quick-check on conditions availability
Just like when we check the variables using glimpse(), we also need to check our variables to make sure about any available options to be filtered. For our case, we already know that we want to get only human observations, but straighly jump into filter() is very unwise:
1
2
3
4
5
6
|
starwars %>%
select(name, species, homeworld, gender, mass, height) %>%
filter(species == "human")
#> # A tibble: 0 × 6
#> # … with 6 variables: name <chr>, species <chr>, homeworld <chr>, gender <chr>,
#> # mass <dbl>, height <int>
|
It seems like there is no "human" species, but actually it’s just because we specified a wrong condition. For categorical variable like species, I really suggest you to check it using unique() first:
1
2
3
4
5
6
7
8
9
10
11
|
unique(starwars$species)
#> [1] "Human" "Droid" "Wookiee" "Rodian"
#> [5] "Hutt" "Yoda's species" "Trandoshan" "Mon Calamari"
#> [9] "Ewok" "Sullustan" "Neimodian" "Gungan"
#> [13] NA "Toydarian" "Dug" "Zabrak"
#> [17] "Twi'lek" "Vulptereen" "Xexto" "Toong"
#> [21] "Cerean" "Nautolan" "Tholothian" "Iktotchi"
#> [25] "Quermian" "Kel Dor" "Chagrian" "Geonosian"
#> [29] "Mirialan" "Clawdite" "Besalisk" "Kaminoan"
#> [33] "Aleena" "Skakoan" "Muun" "Togruta"
#> [37] "Kaleesh" "Pau'an"
|
Which shows that it’s actually "Human". For numerical variables, you can use summary() to help you quick-check the value range:
1
2
3
|
summary(starwars$height)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 66.0 167.0 180.0 174.4 191.0 264.0 6
|
After we quick-check the values within our variable of interest, it’s safe now to call filter() function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
starwars %>%
select(name, species, homeworld, gender, mass, height) %>%
filter(species == "Human")
#> # A tibble: 35 × 6
#> name species homeworld gender mass height
#> <chr> <chr> <chr> <chr> <dbl> <int>
#> 1 Luke Skywalker Human Tatooine masculine 77 172
#> 2 Darth Vader Human Tatooine masculine 136 202
#> 3 Leia Organa Human Alderaan feminine 49 150
#> 4 Owen Lars Human Tatooine masculine 120 178
#> 5 Beru Whitesun lars Human Tatooine feminine 75 165
#> 6 Biggs Darklighter Human Tatooine masculine 84 183
#> 7 Obi-Wan Kenobi Human Stewjon masculine 77 182
#> 8 Anakin Skywalker Human Tatooine masculine 84 188
#> 9 Wilhuff Tarkin Human Eriadu masculine NA 180
#> 10 Han Solo Human Corellia masculine 80 180
#> # … with 25 more rows
|
Manipulating variables using mutate()
When I mentioned mutate() terms, most of you that are not familiar with dplyr probably not sure on its meaning. This function is basically helping us to manipulating the variables inside our dataframe. For example, you could change the units of mass and height index:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
mutate(starwars, mass = mass * 100, height = height / 100)
#> # A tibble: 87 × 14
#> name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
#> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywa… 1.72 7700 blond fair blue 19 male mascu… Tatooi…
#> 2 C-3PO 1.67 7500 <NA> gold yellow 112 none mascu… Tatooi…
#> 3 R2-D2 0.96 3200 <NA> white,… red 33 none mascu… Naboo
#> 4 Darth Vader 2.02 13600 none white yellow 41.9 male mascu… Tatooi…
#> 5 Leia Organa 1.5 4900 brown light brown 19 fema… femin… Aldera…
#> 6 Owen Lars 1.78 12000 brown,… light blue 52 male mascu… Tatooi…
#> 7 Beru White… 1.65 7500 brown light blue 47 fema… femin… Tatooi…
#> 8 R5-D4 0.97 3200 <NA> white,… red NA none mascu… Tatooi…
#> 9 Biggs Dark… 1.83 8400 black light brown 24 male mascu… Tatooi…
#> 10 Obi-Wan Ke… 1.82 7700 auburn… fair blue-g… 57 male mascu… Stewjon
#> # … with 77 more rows, 4 more variables: species <chr>, films <list>,
#> # vehicles <list>, starships <list>, and abbreviated variable names
#> # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
|
But it doesn’t limited to manipulating the existing variables; you could also make a new variable using this function. Say, we want to add Body Mass Index (BMI) to our current analysis, we could create it using mutate():
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
starwars %>%
select(name, species, homeworld, gender, mass, height) %>%
filter(species == "Human") %>%
mutate(bmi = mass / height * 100)
#> # A tibble: 35 × 7
#> name species homeworld gender mass height bmi
#> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
#> 1 Luke Skywalker Human Tatooine masculine 77 172 44.8
#> 2 Darth Vader Human Tatooine masculine 136 202 67.3
#> 3 Leia Organa Human Alderaan feminine 49 150 32.7
#> 4 Owen Lars Human Tatooine masculine 120 178 67.4
#> 5 Beru Whitesun lars Human Tatooine feminine 75 165 45.5
#> 6 Biggs Darklighter Human Tatooine masculine 84 183 45.9
#> 7 Obi-Wan Kenobi Human Stewjon masculine 77 182 42.3
#> 8 Anakin Skywalker Human Tatooine masculine 84 188 44.7
#> 9 Wilhuff Tarkin Human Eriadu masculine NA 180 NA
#> 10 Han Solo Human Corellia masculine 80 180 44.4
#> # … with 25 more rows
|
This function is very versatile, so play around with your creativity here :sunglasses:
Data aggregation using group_by and summarise
We are reaching the final parts! Let’s recall the part of our objective that we haven’t done yet: “across the homeworld and gender”. This particular phrase is indicating that we want each of our observation to representing its homeworld and gender instead of each character’s name like what we currently have. We can achieve this using group_by() and summarise() function.
To declare a group, you just need to specify the name of variables that you want to set as group identifier:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
starwars %>%
select(name, species, homeworld, gender, mass, height) %>%
filter(species == "Human") %>%
mutate(bmi = mass / height * 100) %>%
group_by(homeworld, gender)
#> # A tibble: 35 × 7
#> # Groups: homeworld, gender [21]
#> name species homeworld gender mass height bmi
#> <chr> <chr> <chr> <chr> <dbl> <int> <dbl>
#> 1 Luke Skywalker Human Tatooine masculine 77 172 44.8
#> 2 Darth Vader Human Tatooine masculine 136 202 67.3
#> 3 Leia Organa Human Alderaan feminine 49 150 32.7
#> 4 Owen Lars Human Tatooine masculine 120 178 67.4
#> 5 Beru Whitesun lars Human Tatooine feminine 75 165 45.5
#> 6 Biggs Darklighter Human Tatooine masculine 84 183 45.9
#> 7 Obi-Wan Kenobi Human Stewjon masculine 77 182 42.3
#> 8 Anakin Skywalker Human Tatooine masculine 84 188 44.7
#> 9 Wilhuff Tarkin Human Eriadu masculine NA 180 NA
#> 10 Han Solo Human Corellia masculine 80 180 44.4
#> # … with 25 more rows
|
Then the tibble will automatically give you information about which group is active.
Note that the order of specified groups is also matters. The function will process the group operation from the right-most variable. So the best practice here to set the order from major to minor group:
1
2
3
|
... %>%
group_by(<major group>, <minor group>, <definitely the minorest group>) %>%
...
|
After we set the group, we can use summarise() to do the aggregation. This function is almost identic to how mutate works, but the output will be aggregated value by the group we specified; note that if we don’t specify any group, then it will aggregate the value to one observation that representing our dataset.
Let’s try to take the mean of height, mass, and bmi variables:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
starwars %>%
select(name, species, homeworld, gender, mass, height) %>%
filter(species == "Human") %>%
mutate(bmi = mass / height * 100) %>%
group_by(homeworld, gender) %>%
summarise(
mass = mean(mass, na.rm = TRUE), # use na.rm since there are
height = mean(height, na.rm = TRUE), # some missing values
bmi = mean(bmi, na.rm = TRUE)
)
#> `summarise()` has grouped output by 'homeworld'. You can override using the
#> `.groups` argument.
#> # A tibble: 21 × 5
#> # Groups: homeworld [16]
#> homeworld gender mass height bmi
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Alderaan feminine 49 150 32.7
#> 2 Alderaan masculine 79 190. 42.0
#> 3 Bespin masculine 79 175 45.1
#> 4 Bestine IV masculine 110 180 61.1
#> 5 Chandrila feminine NaN 150 NaN
#> 6 Concord Dawn masculine 79 183 43.2
#> 7 Corellia masculine 78.5 175 44.9
#> 8 Coruscant feminine NaN 167 NaN
#> 9 Coruscant masculine NaN 170 NaN
#> 10 Eriadu masculine NaN 180 NaN
#> # … with 11 more rows
|
When we done with group operations, don’t forget to set the group declaration to off. The most straighforward ways to do this is by passing the result to ungroup():
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
starwars %>%
select(name, species, homeworld, gender, mass, height) %>%
filter(species == "Human") %>%
mutate(bmi = mass / height * 100) %>%
group_by(homeworld, gender) %>%
summarise(
mass = mean(mass, na.rm = TRUE),
height = mean(height, na.rm = TRUE),
bmi = mean(bmi, na.rm = TRUE)
) %>%
ungroup()
#> `summarise()` has grouped output by 'homeworld'. You can override using the
#> `.groups` argument.
#> # A tibble: 21 × 5
#> homeworld gender mass height bmi
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Alderaan feminine 49 150 32.7
#> 2 Alderaan masculine 79 190. 42.0
#> 3 Bespin masculine 79 175 45.1
#> 4 Bestine IV masculine 110 180 61.1
#> 5 Chandrila feminine NaN 150 NaN
#> 6 Concord Dawn masculine 79 183 43.2
#> 7 Corellia masculine 78.5 175 44.9
#> 8 Coruscant feminine NaN 167 NaN
#> 9 Coruscant masculine NaN 170 NaN
#> 10 Eriadu masculine NaN 180 NaN
#> # … with 11 more rows
|
Re-ordering observation using arrange()
Our last results actually enough for showing the summary of mass and height across homeworld and gender. But it doesn’t give us some basic informations, like, which homeworld-gender combination has the lowest or higher bmi.
Fortunately, dplyr provided arrange() function for this specific task. By default, this function will arrange the specified variable in ascending mode:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
starwars %>%
select(name, species, homeworld, gender, mass, height) %>%
filter(species == "Human") %>%
mutate(bmi = mass / height * 100) %>%
group_by(homeworld, gender) %>%
summarise(
mass = mean(mass, na.rm = TRUE),
height = mean(height, na.rm = TRUE),
bmi = mean(bmi, na.rm = TRUE)
) %>%
ungroup() %>%
arrange(bmi)
#> `summarise()` has grouped output by 'homeworld'. You can override using the
#> `.groups` argument.
#> # A tibble: 21 × 5
#> homeworld gender mass height bmi
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Naboo feminine 45 162. 27.3
#> 2 Alderaan feminine 49 150 32.7
#> 3 Serenno masculine 80 193 41.5
#> 4 Alderaan masculine 79 190. 42.0
#> 5 Stewjon masculine 77 182 42.3
#> 6 Kamino masculine 78.2 183 42.7
#> 7 Concord Dawn masculine 79 183 43.2
#> 8 Socorro masculine 79 177 44.6
#> 9 Haruun Kal masculine 84 188 44.7
#> 10 Corellia masculine 78.5 175 44.9
#> # … with 11 more rows
|
And if you want descending sorting, you just need to wrap the variable using desc():
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
starwars %>%
select(name, species, homeworld, gender, mass, height) %>%
filter(species == "Human") %>%
mutate(bmi = mass / height * 100) %>%
group_by(homeworld, gender) %>%
summarise(
mass = mean(mass, na.rm = TRUE),
height = mean(height, na.rm = TRUE),
bmi = mean(bmi, na.rm = TRUE)
) %>%
ungroup() %>%
arrange(desc(bmi))
#> `summarise()` has grouped output by 'homeworld'. You can override using the
#> `.groups` argument.
#> # A tibble: 21 × 5
#> homeworld gender mass height bmi
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Bestine IV masculine 110 180 61.1
#> 2 Tatooine masculine 100. 184. 54.0
#> 3 <NA> masculine 89 193 46.1
#> 4 Tatooine feminine 75 164 45.5
#> 5 Bespin masculine 79 175 45.1
#> 6 Naboo masculine 80 178. 45.0
#> 7 Corellia masculine 78.5 175 44.9
#> 8 Haruun Kal masculine 84 188 44.7
#> 9 Socorro masculine 79 177 44.6
#> 10 Concord Dawn masculine 79 183 43.2
#> # … with 11 more rows
|
Now we know that the females from Naboo have the lowest mean BMI, while the males from Bestine have the highest mean BMI :grin:
In this article, you already see how easy–and tidy!–it is to do data wrangling tasks using tibble and dplyr. But we just scratched the surface of tidyverse; there are so many package and functionalities available for us to explore!
For the next episode, we will try more realistic cases, and see how tidyverse packages could help us in a more complex data wrangling tasks. So stay tuned on this series! :sunglasses:
Here is a tidyverse logo for you:
1
2
3
4
5
|
#> ⬢ __ _ __ . ⬡ ⬢ .
#> / /_(_)__/ /_ ___ _____ _______ ___
#> / __/ / _ / // / |/ / -_) __(_-</ -_)
#> \__/_/\_,_/\_, /|___/\__/_/ /___/\__/
#> ⬢ . /___/ ⬡ . ⬢
|