Featured image of post Data Wars: Episode IV

Data Wars: Episode IV

R. Dimas Bagas Herlambang

Data wrangling sometimes could become very tedious. No matter what language do you prefer: R, python, or even SQL, the process of preprocessing your dataset is generally very time consuming. But, this is not the case if you know how to properly use packages included in tidyverse{target="_blank"}.

The term of tidyverse is actually referring to a set of packages that you’ll find very helpful in any data analysis tasks; many of them are already popular among R users, like dplyr{target="_blank"}, ggplot2{target="_blank"}, and lubridate{target="_blank"}. These packages are intended to work side-by-side; they share the same API style, and sharing a common data representation format. But another things that often missed is that the tidyverse itself is also a package! The tidyverse package is intended to make your environment set-up process faster; you only need to install tidyverse library to install the bundled packages, and just load the tidyverse library to load all the core packages.

In this articles (or maybe I should say, in this episode), I will cover the basic usage of core packages in tidyverse, using (of course) starwars{target="_blank"} dataset which included along with dplyr library.

There will be several sequels, and even prequels, for data wrangling using R and tidyverse tutorial–or to put it simple, data wars :grin:–so stay tuned!

Enter the tidyverse

As I mentioned before, you can install all of the included packages just by installing tidyverse. So let’s start by installing the package first:

1
install.packages("tidyverse")

If you are in an interactive session, you could see which packages are installed along with tidyverse. Just to make sure, let’s load the library:

1
2
3
4
5
6
7
8
9
library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
#> ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
#> ✔ tibble  3.1.8     ✔ dplyr   1.1.0
#> ✔ tidyr   1.3.0     ✔ stringr 1.5.0
#> ✔ readr   2.1.3     ✔ forcats 0.5.2
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()

When importing the library, it give you messages about which packages that are attached to your session. But it only give you information about the core packages that fully loaded to your session, while actually tidyverse also attaching some great libraries in the background. You can see them all through sessionInfo():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
sessionInfo()
#> R version 4.2.2 (2022-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_Indonesia.utf8  LC_CTYPE=English_Indonesia.utf8   
#> [3] LC_MONETARY=English_Indonesia.utf8 LC_NUMERIC=C                      
#> [5] LC_TIME=English_Indonesia.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] forcats_0.5.2   stringr_1.5.0   dplyr_1.1.0     purrr_1.0.1    
#> [5] readr_2.1.3     tidyr_1.3.0     tibble_3.1.8    ggplot2_3.4.0  
#> [9] tidyverse_1.3.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.0    xfun_0.35           bslib_0.4.2        
#>  [4] haven_2.5.1         gargle_1.3.0        colorspace_2.0-3   
#>  [7] vctrs_0.5.2         generics_0.1.3      htmltools_0.5.4    
#> [10] yaml_2.3.6          utf8_1.2.3          rlang_1.0.6        
#> [13] jquerylib_0.1.4     pillar_1.8.1        withr_2.5.0        
#> [16] glue_1.6.2          DBI_1.1.3           dbplyr_2.2.1       
#> [19] readxl_1.4.1        modelr_0.1.10       lifecycle_1.0.3    
#> [22] munsell_0.5.0       blogdown_1.16       gtable_0.3.1       
#> [25] cellranger_1.1.0    rvest_1.0.3         evaluate_0.19      
#> [28] knitr_1.41          tzdb_0.3.0          fastmap_1.1.0      
#> [31] fansi_1.0.4         broom_1.0.2         backports_1.4.1    
#> [34] scales_1.2.1        googlesheets4_1.0.1 cachem_1.0.6       
#> [37] jsonlite_1.8.4      fs_1.5.2            hms_1.1.2          
#> [40] digest_0.6.30       stringi_1.7.8       bookdown_0.32      
#> [43] grid_4.2.2          cli_3.4.1           tools_4.2.2        
#> [46] magrittr_2.0.3      sass_0.4.4          crayon_1.5.2       
#> [49] pkgconfig_2.0.3     ellipsis_0.3.2      xml2_1.3.3         
#> [52] reprex_2.0.2        timechange_0.2.0    googledrive_2.0.0  
#> [55] lubridate_1.9.1     assertthat_0.2.1    rmarkdown_2.19     
#> [58] httr_1.4.4          rstudioapi_0.14     R6_2.5.1           
#> [61] compiler_4.2.2

As you can see, it actually loads many packages in the background. Don’t worry, I will not cover all of them in this article; But I will cover them for sure in later sequels and prequels :grin:. If you really curious about those packages, I suggest you to checkout the detailed list at the tidyverse official documentations{target="_blank"}.

Use the standard data representation: tibble

The first step in learning the tidy data wrangling is to be familiar with its standard data representation. Before I explain it in detail, let me give you a quick example by printing the iris dataset:

1
2
3
4
5
6
7
8
head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

and compare the print output with starwars dataset from dplyr:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
starwars
#> # A tibble: 87 × 14
#>    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#>  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
#>  2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
#>  3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
#>  4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
#>  5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
#>  6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
#>  7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
#>  8 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
#>  9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
#> 10 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
#> # … with 77 more rows, 4 more variables: species <chr>, films <list>,
#> #   vehicles <list>, starships <list>, and abbreviated variable names
#> #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

There are sooo many differences between the two outputs that I could elaborate more, but let’s focus on several important points:

  • When printing iris dataset, I had to call it through head() to avoid printing the whole dataset. This is not the case for starwars dataset
  • Base print output is just as-is, not giving some useful information
  • If you see the details in starwars’ print output, it is showing that it can store a more complex object, like list in this case

Why they behave differently? The answer lies in their class:

1
2
3
4
5
class(iris)
#> [1] "data.frame"

class(starwars)
#> [1] "tbl_df"     "tbl"        "data.frame"

The starwars dataset class is showing that it’s actually a tibble; which often referred as tbl or tbl_* when related to a specific data class. A tibble is a little bit different than your usual data.frame. I already outlined the main differences above, but in additition to that, it should be noted that tibble is the main data representation across the tidyverse; and also some extension that adopting the tidy principle, e.g., tbl_ts from tsibble{target="_blank"} for tidy time series data representation.

There are so many other advantages if you use tibble. For example, if you want to subset just one column, a simple data.frame will failed you again:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
iris[, "Species"]
#>   [1] setosa     setosa     setosa     setosa     setosa     setosa    
#>   [7] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [13] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [19] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [25] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [31] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [37] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [43] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [49] setosa     setosa     versicolor versicolor versicolor versicolor
#>  [55] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [61] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [67] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [73] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [79] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [85] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [91] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [97] versicolor versicolor versicolor versicolor virginica  virginica 
#> [103] virginica  virginica  virginica  virginica  virginica  virginica 
#> [109] virginica  virginica  virginica  virginica  virginica  virginica 
#> [115] virginica  virginica  virginica  virginica  virginica  virginica 
#> [121] virginica  virginica  virginica  virginica  virginica  virginica 
#> [127] virginica  virginica  virginica  virginica  virginica  virginica 
#> [133] virginica  virginica  virginica  virginica  virginica  virginica 
#> [139] virginica  virginica  virginica  virginica  virginica  virginica 
#> [145] virginica  virginica  virginica  virginica  virginica  virginica 
#> Levels: setosa versicolor virginica

It will drop its “table” representation and convert the data into a vector, instead of a data.frame with one column. But this is not the case with tibble:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
iris_tbl <- as_tibble(iris)

iris_tbl[, "Species"]
#> # A tibble: 150 × 1
#>    Species
#>    <fct>  
#>  1 setosa 
#>  2 setosa 
#>  3 setosa 
#>  4 setosa 
#>  5 setosa 
#>  6 setosa 
#>  7 setosa 
#>  8 setosa 
#>  9 setosa 
#> 10 setosa 
#> # … with 140 more rows

Why this is justified as a more tidy way? because if you recall the R’s standard way to access a specific part from an object is actually using $:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
iris_tbl$Species
#>   [1] setosa     setosa     setosa     setosa     setosa     setosa    
#>   [7] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [13] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [19] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [25] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [31] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [37] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [43] setosa     setosa     setosa     setosa     setosa     setosa    
#>  [49] setosa     setosa     versicolor versicolor versicolor versicolor
#>  [55] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [61] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [67] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [73] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [79] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [85] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [91] versicolor versicolor versicolor versicolor versicolor versicolor
#>  [97] versicolor versicolor versicolor versicolor virginica  virginica 
#> [103] virginica  virginica  virginica  virginica  virginica  virginica 
#> [109] virginica  virginica  virginica  virginica  virginica  virginica 
#> [115] virginica  virginica  virginica  virginica  virginica  virginica 
#> [121] virginica  virginica  virginica  virginica  virginica  virginica 
#> [127] virginica  virginica  virginica  virginica  virginica  virginica 
#> [133] virginica  virginica  virginica  virginica  virginica  virginica 
#> [139] virginica  virginica  virginica  virginica  virginica  virginica 
#> [145] virginica  virginica  virginica  virginica  virginica  virginica 
#> Levels: setosa versicolor virginica

which gives you the exact results.

So my concluding notes are:

  • Use tibble format whenever it’s possible. I rarely found any situation that using tbl or tbl_* object is not acceptable, because it is still inheriting the standard data.frame class.
  • When working a specific case, e.g., time series dataset, or big data, always look if there are any tidy representation for the data; see tsibble{target="_blank"} and sparklyr{target="_blank"} for example.

Easier data manipulation with dplyr

The dplyr package is probably the most important part of tidyverse. It brings so many convenient utilities for data manipulation, particularly for some basic tasks; yet sometimes, could be troublesome if you only use base R functionalities.

Among many advanced features inside dplyr, I suggest you to start with the cores data manipulation functions:

  • select(){target="_blank"}: for selecting a subset of data by column
  • filter(){target="_blank"}: for filtering a subset of data by row using condition
  • arrange(){target="_blank"}: for re-arranging the rows order
  • mutate(){target="_blank"}: for mutating (:confused:) a new or existing variables

As implied by my description, most of dplyr functions could be described by its names; except mutate, which is very likely to be confusing at the first time.

Those function can be naturally combined with the core aggregation functions:

  • group_by(){target="_blank"}: for specifying that following operations should be done by group(s)
  • summarise(){target="_blank"}: for obtaining the summarise of the data by the specified group(s).

These functions probably would cover most of your data wrangling tasks; and if done in a proper tidy way, then you’ll find that data wrangling process will be easier. Generally, most of these workflow could be combined through a pipeline using %>% (pipe) operator{target="_blank"} to make it tidier.

Let’s try out these features using starwars dataset. The first step–and I think the most important step–that we should do is setting our output goal. For this tutorial, let’s try to find out how the height and mass differ across homeworld and gender for human species. So we will transform the dataset from:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
starwars
#> # A tibble: 87 × 14
#>    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#>  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
#>  2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
#>  3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
#>  4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
#>  5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
#>  6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
#>  7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
#>  8 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
#>  9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
#> 10 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
#> # … with 77 more rows, 4 more variables: species <chr>, films <list>,
#> #   vehicles <list>, starships <list>, and abbreviated variable names
#> #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

into this format:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# the output goal:

#> `summarise()` has grouped output by 'homeworld'. You can override using the
#> `.groups` argument.
#> # A tibble: 21 × 5
#>    homeworld    gender     mass height   bmi
#>    <chr>        <chr>     <dbl>  <dbl> <dbl>
#>  1 Bestine IV   masculine 110     180   61.1
#>  2 Tatooine     masculine 100.    184.  54.0
#>  3 <NA>         masculine  89     193   46.1
#>  4 Tatooine     feminine   75     164   45.5
#>  5 Bespin       masculine  79     175   45.1
#>  6 Naboo        masculine  80     178.  45.0
#>  7 Corellia     masculine  78.5   175   44.9
#>  8 Haruun Kal   masculine  84     188   44.7
#>  9 Socorro      masculine  79     177   44.6
#> 10 Concord Dawn masculine  79     183   43.2
#> # … with 11 more rows

Column subsetting using select()

select() is probably the first function you will call in every data manipulation tasks. By using select(), you can narrow down the number variables to fewer variables–which are what you truly need.

The select() usage (and also every core functions in dplyr) is very straighforward. You just need to pass the dataset and select the column name you want to select:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
select(starwars, name)
#> # A tibble: 87 × 1
#>    name              
#>    <chr>             
#>  1 Luke Skywalker    
#>  2 C-3PO             
#>  3 R2-D2             
#>  4 Darth Vader       
#>  5 Leia Organa       
#>  6 Owen Lars         
#>  7 Beru Whitesun lars
#>  8 R5-D4             
#>  9 Biggs Darklighter 
#> 10 Obi-Wan Kenobi    
#> # … with 77 more rows

It’s very easy right? Before we continue to our objective, there are some tips you need to consider.

Chaining operation using %>%

When we doing a complex data manipulation using dplyr, I suggest to use %>% operator to make our code tidier:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
starwars %>%
  select(name)
#> # A tibble: 87 × 1
#>    name              
#>    <chr>             
#>  1 Luke Skywalker    
#>  2 C-3PO             
#>  3 R2-D2             
#>  4 Darth Vader       
#>  5 Leia Organa       
#>  6 Owen Lars         
#>  7 Beru Whitesun lars
#>  8 R5-D4             
#>  9 Biggs Darklighter 
#> 10 Obi-Wan Kenobi    
#> # … with 77 more rows

This is because %>% operator could help you to make a chaining operation. The pipe operator will pass the results from its left-hand side, to the first parameter in the function of its right-hand side. So you could make a chaining operation like this:

1
2
3
4
5
6
starwars %>%
  select(name) %>%
  anotherfuns(<arguments>) %>%
  anotherfuns(<arguments>) %>%
  <continue the chain as long as necessary> %>%
  anotherfuns(<arguments>)

So as long as anotherfuns you chained could accept the previous result as its first argument, you can chaining the operation as long as you want.

Checking the variables using glimpse()

Now back to our objective. Remember that we want to make a summary of height and mass of human species across homeworld and gender. So we actually need only those few variables to go on. Still, remembering the name of those variables could be frustating sometimes, so I suggest to use glimpse() function before you jump into variable selection:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
glimpse(starwars)
#> Rows: 87
#> Columns: 14
#> $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
#> $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
#> $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
#> $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
#> $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
#> $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
#> $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
#> $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
#> $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
#> $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
#> $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
#> $ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
#> $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
#> $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…

The glimpse() function is practically the simplified version of str(); but it only print the most general information like the variable names, class, etc. Another alternative is to use the base colnames(), but it doesn’t provide additional information like glimpse(), so I suggest to use the later.

From glimpse() output, now we have the list of variable names and can be sure that we have to select species, homeworld, gender, mass, height:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
starwars %>%
  select(name, species, homeworld, gender, mass, height)
#> # A tibble: 87 × 6
#>    name               species homeworld gender     mass height
#>    <chr>              <chr>   <chr>     <chr>     <dbl>  <int>
#>  1 Luke Skywalker     Human   Tatooine  masculine    77    172
#>  2 C-3PO              Droid   Tatooine  masculine    75    167
#>  3 R2-D2              Droid   Naboo     masculine    32     96
#>  4 Darth Vader        Human   Tatooine  masculine   136    202
#>  5 Leia Organa        Human   Alderaan  feminine     49    150
#>  6 Owen Lars          Human   Tatooine  masculine   120    178
#>  7 Beru Whitesun lars Human   Tatooine  feminine     75    165
#>  8 R5-D4              Droid   Tatooine  masculine    32     97
#>  9 Biggs Darklighter  Human   Tatooine  masculine    84    183
#> 10 Obi-Wan Kenobi     Human   Stewjon   masculine    77    182
#> # … with 77 more rows

Now the dataset looks a lot simpler :grin:

Filtering specific observation using filter()

You already see that a simple column selection could help you in starting the data wrangling process. But row/observation selection is sometimes a compulsory step rather than just a helper step. For example, you want to exclude or include specific observations which inheriting some specific traits. If you recall our objective, and to be more specific at “for human species”, then we are absolutely need to filter our dataset to only containing human species.

As aforementioned, filter() function is also very easy to use: just pass the dataset object as the first argument, then you can pass the condition(s) as the following argument(s).

To declare the condition, the most common way is to use the base condition testing:

  • == and != for testing equal, and not equal condition
  • > or >= for testing greater than, and greater than or equal
  • < or <= for testing less than, and less than or equal

Let’s try to filter observation with mass less than or equal to 50 from starwars dataset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
filter(starwars, mass <= 50)
#> # A tibble: 13 × 14
#>    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#>  1 R2-D2           96    32 <NA>    white,… red          33 none  mascu… Naboo  
#>  2 Leia Organa    150    49 brown   light   brown        19 fema… femin… Aldera…
#>  3 R5-D4           97    32 <NA>    white,… red          NA none  mascu… Tatooi…
#>  4 Yoda            66    17 white   green   brown       896 male  mascu… <NA>   
#>  5 Wicket Sys…     88    20 brown   brown   brown         8 male  mascu… Endor  
#>  6 Sebulba        112    40 none    grey, … orange       NA male  mascu… Malast…
#>  7 Dud Bolt        94    45 none    blue, … yellow       NA male  mascu… Vulpter
#>  8 Adi Gallia     184    50 none    dark    blue         NA fema… femin… Corusc…
#>  9 Barriss Of…    166    50 black   yellow  blue         40 fema… femin… Mirial 
#> 10 Ratts Tyer…     79    15 none    grey, … unknown      NA male  mascu… Aleen …
#> 11 Wat Tambor     193    48 none    green,… unknown      NA male  mascu… Skako  
#> 12 Sly Moore      178    48 none    pale    white        NA <NA>  <NA>   Umbara 
#> 13 Padmé Amid…    165    45 brown   light   brown        46 fema… femin… Naboo  
#> # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
#> #   starships <list>, and abbreviated variable names ¹​hair_color, ²​skin_color,
#> #   ³​eye_color, ⁴​birth_year, ⁵​homeworld

It might be looks very simple, but let’s dive deeper on how condition testing works to see its full potential.

Behind the scene of condition testing

Before I explain the details, let’s take a look at the first 5 mass value from starwars:

1
2
head(starwars$mass)
#> [1]  77  75  32 136  49 120

If we test the condition of <= 50 to above data, we will actually get this output:

1
2
head(starwars$mass) <= 50
#> [1] FALSE FALSE  TRUE FALSE  TRUE FALSE

So basically filtering condition is picking up rows with the TRUE value; and it will discard the rest. Based on this mechanism, we could also use other helper functions to test a condition, e.g., when we want to see rows with NA values, we can use is.na() function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
filter(starwars, is.na(mass))
#> # A tibble: 28 × 14
#>    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#>  1 Wilhuff Ta…    180    NA auburn… fair    blue         64 male  mascu… Eriadu 
#>  2 Mon Mothma     150    NA auburn  fair    blue         48 fema… femin… Chandr…
#>  3 Arvel Cryn…     NA    NA brown   fair    brown        NA male  mascu… <NA>   
#>  4 Finis Valo…    170    NA blond   fair    blue         91 male  mascu… Corusc…
#>  5 Rugor Nass     206    NA none    green   orange       NA male  mascu… Naboo  
#>  6 Ric Olié       183    NA brown   fair    blue         NA <NA>  <NA>   Naboo  
#>  7 Watto          137    NA black   blue, … yellow       NA male  mascu… Toydar…
#>  8 Quarsh Pan…    183    NA black   dark    brown        62 <NA>  <NA>   Naboo  
#>  9 Shmi Skywa…    163    NA black   fair    brown        72 fema… femin… Tatooi…
#> 10 Bib Fortuna    180    NA none    pale    pink         NA male  mascu… Ryloth 
#> # … with 18 more rows, 4 more variables: species <chr>, films <list>,
#> #   vehicles <list>, starships <list>, and abbreviated variable names
#> #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

or if we want to keep observations with non-NA value, then we could create a negation using !:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
filter(starwars, !is.na(mass))
#> # A tibble: 59 × 14
#>    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#>  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
#>  2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
#>  3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
#>  4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
#>  5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
#>  6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
#>  7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
#>  8 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
#>  9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
#> 10 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
#> # … with 49 more rows, 4 more variables: species <chr>, films <list>,
#> #   vehicles <list>, starships <list>, and abbreviated variable names
#> #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

Quick-check on conditions availability

Just like when we check the variables using glimpse(), we also need to check our variables to make sure about any available options to be filtered. For our case, we already know that we want to get only human observations, but straighly jump into filter() is very unwise:

1
2
3
4
5
6
starwars %>%
  select(name, species, homeworld, gender, mass, height) %>%
  filter(species == "human")
#> # A tibble: 0 × 6
#> # … with 6 variables: name <chr>, species <chr>, homeworld <chr>, gender <chr>,
#> #   mass <dbl>, height <int>

It seems like there is no "human" species, but actually it’s just because we specified a wrong condition. For categorical variable like species, I really suggest you to check it using unique() first:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
unique(starwars$species)
#>  [1] "Human"          "Droid"          "Wookiee"        "Rodian"        
#>  [5] "Hutt"           "Yoda's species" "Trandoshan"     "Mon Calamari"  
#>  [9] "Ewok"           "Sullustan"      "Neimodian"      "Gungan"        
#> [13] NA               "Toydarian"      "Dug"            "Zabrak"        
#> [17] "Twi'lek"        "Vulptereen"     "Xexto"          "Toong"         
#> [21] "Cerean"         "Nautolan"       "Tholothian"     "Iktotchi"      
#> [25] "Quermian"       "Kel Dor"        "Chagrian"       "Geonosian"     
#> [29] "Mirialan"       "Clawdite"       "Besalisk"       "Kaminoan"      
#> [33] "Aleena"         "Skakoan"        "Muun"           "Togruta"       
#> [37] "Kaleesh"        "Pau'an"

Which shows that it’s actually "Human". For numerical variables, you can use summary() to help you quick-check the value range:

1
2
3
summary(starwars$height)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>    66.0   167.0   180.0   174.4   191.0   264.0       6

After we quick-check the values within our variable of interest, it’s safe now to call filter() function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
starwars %>%
  select(name, species, homeworld, gender, mass, height) %>%
  filter(species == "Human")
#> # A tibble: 35 × 6
#>    name               species homeworld gender     mass height
#>    <chr>              <chr>   <chr>     <chr>     <dbl>  <int>
#>  1 Luke Skywalker     Human   Tatooine  masculine    77    172
#>  2 Darth Vader        Human   Tatooine  masculine   136    202
#>  3 Leia Organa        Human   Alderaan  feminine     49    150
#>  4 Owen Lars          Human   Tatooine  masculine   120    178
#>  5 Beru Whitesun lars Human   Tatooine  feminine     75    165
#>  6 Biggs Darklighter  Human   Tatooine  masculine    84    183
#>  7 Obi-Wan Kenobi     Human   Stewjon   masculine    77    182
#>  8 Anakin Skywalker   Human   Tatooine  masculine    84    188
#>  9 Wilhuff Tarkin     Human   Eriadu    masculine    NA    180
#> 10 Han Solo           Human   Corellia  masculine    80    180
#> # … with 25 more rows

Manipulating variables using mutate()

When I mentioned mutate() terms, most of you that are not familiar with dplyr probably not sure on its meaning. This function is basically helping us to manipulating the variables inside our dataframe. For example, you could change the units of mass and height index:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
mutate(starwars, mass = mass * 100, height = height / 100)
#> # A tibble: 87 × 14
#>    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
#>    <chr>        <dbl> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
#>  1 Luke Skywa…   1.72  7700 blond   fair    blue       19   male  mascu… Tatooi…
#>  2 C-3PO         1.67  7500 <NA>    gold    yellow    112   none  mascu… Tatooi…
#>  3 R2-D2         0.96  3200 <NA>    white,… red        33   none  mascu… Naboo  
#>  4 Darth Vader   2.02 13600 none    white   yellow     41.9 male  mascu… Tatooi…
#>  5 Leia Organa   1.5   4900 brown   light   brown      19   fema… femin… Aldera…
#>  6 Owen Lars     1.78 12000 brown,… light   blue       52   male  mascu… Tatooi…
#>  7 Beru White…   1.65  7500 brown   light   blue       47   fema… femin… Tatooi…
#>  8 R5-D4         0.97  3200 <NA>    white,… red        NA   none  mascu… Tatooi…
#>  9 Biggs Dark…   1.83  8400 black   light   brown      24   male  mascu… Tatooi…
#> 10 Obi-Wan Ke…   1.82  7700 auburn… fair    blue-g…    57   male  mascu… Stewjon
#> # … with 77 more rows, 4 more variables: species <chr>, films <list>,
#> #   vehicles <list>, starships <list>, and abbreviated variable names
#> #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year, ⁵​homeworld

But it doesn’t limited to manipulating the existing variables; you could also make a new variable using this function. Say, we want to add Body Mass Index (BMI) to our current analysis, we could create it using mutate():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
starwars %>%
  select(name, species, homeworld, gender, mass, height) %>%
  filter(species == "Human") %>%
  mutate(bmi = mass / height * 100)
#> # A tibble: 35 × 7
#>    name               species homeworld gender     mass height   bmi
#>    <chr>              <chr>   <chr>     <chr>     <dbl>  <int> <dbl>
#>  1 Luke Skywalker     Human   Tatooine  masculine    77    172  44.8
#>  2 Darth Vader        Human   Tatooine  masculine   136    202  67.3
#>  3 Leia Organa        Human   Alderaan  feminine     49    150  32.7
#>  4 Owen Lars          Human   Tatooine  masculine   120    178  67.4
#>  5 Beru Whitesun lars Human   Tatooine  feminine     75    165  45.5
#>  6 Biggs Darklighter  Human   Tatooine  masculine    84    183  45.9
#>  7 Obi-Wan Kenobi     Human   Stewjon   masculine    77    182  42.3
#>  8 Anakin Skywalker   Human   Tatooine  masculine    84    188  44.7
#>  9 Wilhuff Tarkin     Human   Eriadu    masculine    NA    180  NA  
#> 10 Han Solo           Human   Corellia  masculine    80    180  44.4
#> # … with 25 more rows

This function is very versatile, so play around with your creativity here :sunglasses:

Data aggregation using group_by and summarise

We are reaching the final parts! Let’s recall the part of our objective that we haven’t done yet: “across the homeworld and gender”. This particular phrase is indicating that we want each of our observation to representing its homeworld and gender instead of each character’s name like what we currently have. We can achieve this using group_by() and summarise() function.

To declare a group, you just need to specify the name of variables that you want to set as group identifier:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
starwars %>%
  select(name, species, homeworld, gender, mass, height) %>%
  filter(species == "Human") %>%
  mutate(bmi = mass / height * 100) %>%
  group_by(homeworld, gender)
#> # A tibble: 35 × 7
#> # Groups:   homeworld, gender [21]
#>    name               species homeworld gender     mass height   bmi
#>    <chr>              <chr>   <chr>     <chr>     <dbl>  <int> <dbl>
#>  1 Luke Skywalker     Human   Tatooine  masculine    77    172  44.8
#>  2 Darth Vader        Human   Tatooine  masculine   136    202  67.3
#>  3 Leia Organa        Human   Alderaan  feminine     49    150  32.7
#>  4 Owen Lars          Human   Tatooine  masculine   120    178  67.4
#>  5 Beru Whitesun lars Human   Tatooine  feminine     75    165  45.5
#>  6 Biggs Darklighter  Human   Tatooine  masculine    84    183  45.9
#>  7 Obi-Wan Kenobi     Human   Stewjon   masculine    77    182  42.3
#>  8 Anakin Skywalker   Human   Tatooine  masculine    84    188  44.7
#>  9 Wilhuff Tarkin     Human   Eriadu    masculine    NA    180  NA  
#> 10 Han Solo           Human   Corellia  masculine    80    180  44.4
#> # … with 25 more rows

Then the tibble will automatically give you information about which group is active.

Note that the order of specified groups is also matters. The function will process the group operation from the right-most variable. So the best practice here to set the order from major to minor group:

1
2
3
... %>%
  group_by(<major group>, <minor group>, <definitely the minorest group>) %>%
  ...

After we set the group, we can use summarise() to do the aggregation. This function is almost identic to how mutate works, but the output will be aggregated value by the group we specified; note that if we don’t specify any group, then it will aggregate the value to one observation that representing our dataset.

Let’s try to take the mean of height, mass, and bmi variables:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
starwars %>%
  select(name, species, homeworld, gender, mass, height) %>%
  filter(species == "Human") %>%
  mutate(bmi = mass / height * 100) %>%
  group_by(homeworld, gender) %>%
  summarise(
    mass = mean(mass, na.rm = TRUE), # use na.rm since there are
    height = mean(height, na.rm = TRUE), # some missing values
    bmi = mean(bmi, na.rm = TRUE)
  )
#> `summarise()` has grouped output by 'homeworld'. You can override using the
#> `.groups` argument.
#> # A tibble: 21 × 5
#> # Groups:   homeworld [16]
#>    homeworld    gender     mass height   bmi
#>    <chr>        <chr>     <dbl>  <dbl> <dbl>
#>  1 Alderaan     feminine   49     150   32.7
#>  2 Alderaan     masculine  79     190.  42.0
#>  3 Bespin       masculine  79     175   45.1
#>  4 Bestine IV   masculine 110     180   61.1
#>  5 Chandrila    feminine  NaN     150  NaN  
#>  6 Concord Dawn masculine  79     183   43.2
#>  7 Corellia     masculine  78.5   175   44.9
#>  8 Coruscant    feminine  NaN     167  NaN  
#>  9 Coruscant    masculine NaN     170  NaN  
#> 10 Eriadu       masculine NaN     180  NaN  
#> # … with 11 more rows

When we done with group operations, don’t forget to set the group declaration to off. The most straighforward ways to do this is by passing the result to ungroup():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
starwars %>%
  select(name, species, homeworld, gender, mass, height) %>%
  filter(species == "Human") %>%
  mutate(bmi = mass / height * 100) %>%
  group_by(homeworld, gender) %>%
  summarise(
    mass = mean(mass, na.rm = TRUE),
    height = mean(height, na.rm = TRUE),
    bmi = mean(bmi, na.rm = TRUE)
  ) %>%
  ungroup()
#> `summarise()` has grouped output by 'homeworld'. You can override using the
#> `.groups` argument.
#> # A tibble: 21 × 5
#>    homeworld    gender     mass height   bmi
#>    <chr>        <chr>     <dbl>  <dbl> <dbl>
#>  1 Alderaan     feminine   49     150   32.7
#>  2 Alderaan     masculine  79     190.  42.0
#>  3 Bespin       masculine  79     175   45.1
#>  4 Bestine IV   masculine 110     180   61.1
#>  5 Chandrila    feminine  NaN     150  NaN  
#>  6 Concord Dawn masculine  79     183   43.2
#>  7 Corellia     masculine  78.5   175   44.9
#>  8 Coruscant    feminine  NaN     167  NaN  
#>  9 Coruscant    masculine NaN     170  NaN  
#> 10 Eriadu       masculine NaN     180  NaN  
#> # … with 11 more rows

Re-ordering observation using arrange()

Our last results actually enough for showing the summary of mass and height across homeworld and gender. But it doesn’t give us some basic informations, like, which homeworld-gender combination has the lowest or higher bmi.

Fortunately, dplyr provided arrange() function for this specific task. By default, this function will arrange the specified variable in ascending mode:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
starwars %>%
  select(name, species, homeworld, gender, mass, height) %>%
  filter(species == "Human") %>%
  mutate(bmi = mass / height * 100) %>%
  group_by(homeworld, gender) %>%
  summarise(
    mass = mean(mass, na.rm = TRUE),
    height = mean(height, na.rm = TRUE),
    bmi = mean(bmi, na.rm = TRUE)
  ) %>%
  ungroup() %>%
  arrange(bmi)
#> `summarise()` has grouped output by 'homeworld'. You can override using the
#> `.groups` argument.
#> # A tibble: 21 × 5
#>    homeworld    gender     mass height   bmi
#>    <chr>        <chr>     <dbl>  <dbl> <dbl>
#>  1 Naboo        feminine   45     162.  27.3
#>  2 Alderaan     feminine   49     150   32.7
#>  3 Serenno      masculine  80     193   41.5
#>  4 Alderaan     masculine  79     190.  42.0
#>  5 Stewjon      masculine  77     182   42.3
#>  6 Kamino       masculine  78.2   183   42.7
#>  7 Concord Dawn masculine  79     183   43.2
#>  8 Socorro      masculine  79     177   44.6
#>  9 Haruun Kal   masculine  84     188   44.7
#> 10 Corellia     masculine  78.5   175   44.9
#> # … with 11 more rows

And if you want descending sorting, you just need to wrap the variable using desc():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
starwars %>%
  select(name, species, homeworld, gender, mass, height) %>%
  filter(species == "Human") %>%
  mutate(bmi = mass / height * 100) %>%
  group_by(homeworld, gender) %>%
  summarise(
    mass = mean(mass, na.rm = TRUE),
    height = mean(height, na.rm = TRUE),
    bmi = mean(bmi, na.rm = TRUE)
  ) %>%
  ungroup() %>%
  arrange(desc(bmi))
#> `summarise()` has grouped output by 'homeworld'. You can override using the
#> `.groups` argument.
#> # A tibble: 21 × 5
#>    homeworld    gender     mass height   bmi
#>    <chr>        <chr>     <dbl>  <dbl> <dbl>
#>  1 Bestine IV   masculine 110     180   61.1
#>  2 Tatooine     masculine 100.    184.  54.0
#>  3 <NA>         masculine  89     193   46.1
#>  4 Tatooine     feminine   75     164   45.5
#>  5 Bespin       masculine  79     175   45.1
#>  6 Naboo        masculine  80     178.  45.0
#>  7 Corellia     masculine  78.5   175   44.9
#>  8 Haruun Kal   masculine  84     188   44.7
#>  9 Socorro      masculine  79     177   44.6
#> 10 Concord Dawn masculine  79     183   43.2
#> # … with 11 more rows

Now we know that the females from Naboo have the lowest mean BMI, while the males from Bestine have the highest mean BMI :grin:

Concluding remarks

In this article, you already see how easy–and tidy!–it is to do data wrangling tasks using tibble and dplyr. But we just scratched the surface of tidyverse; there are so many package and functionalities available for us to explore!

For the next episode, we will try more realistic cases, and see how tidyverse packages could help us in a more complex data wrangling tasks. So stay tuned on this series! :sunglasses:

Here is a tidyverse logo for you:

1
2
3
4
5
#> ⬢ __  _    __   .    ⬡           ⬢  . 
#>  / /_(_)__/ /_ ___  _____ _______ ___ 
#> / __/ / _  / // / |/ / -_) __(_-</ -_)
#> \__/_/\_,_/\_, /|___/\__/_/ /___/\__/ 
#>      ⬢  . /___/      ⬡      .       ⬢
Built with Hugo
Theme Stack designed by Jimmy