Featured image of post Introduction to tidymodels

Introduction to tidymodels

R. Dimas Bagas Herlambang

The following presentation is produced by the team at Algoritma for its internal training This presentation is intended for a restricted audience only. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.

Outline

Why tidymodels Matters?

  • Things we think we’re doing it right
  • Things we never think we could do it better

Setting the Cross-Validation Scheme using rsample

  • Rethinking: Why we need validation?
  • Tidy way to rsample-ing your dataset

Data Preprocess using recipes

  • Rethinking: How we should treat train and test?
  • Reproducible preprocess recipes

Model Fitting using parsnip

  • Rethinking: How many machine learning packages you used?
  • One vegeta.. I mean package to rule them all: parsnip

Model Evaluation using yardstick

  • Rethinking: How we measure the goodness of our model?
  • It’s always good to bring your own yardstick

Why tidymodels Matters?

Things we think we’re doing it right

Sample splitting

1
2
3
4
5
6
7
# import libs
library(plotly)
library(randomForest)
library(ranger)
library(tidyverse)
library(tidymodels)
library(themis)
1
2
3
4
5
# import additional libs
library(lubridate)

# prepare example datasets
attrition <- read_csv("data_input/attrition.csv")
1
2
3
4
5
6
7
8
# set seed
set.seed(100)

# train rows
in_train <- sample(1:nrow(attrition), nrow(attrition) * 0.8)

# check target distribution in train
prop.table(table(attrition$attrition[in_train]))
#> 
#>        no       yes 
#> 0.8469388 0.1530612
1
2
# check target distribution in test
prop.table(table(attrition$attrition[-in_train]))
#> 
#>        no       yes 
#> 0.8061224 0.1938776

Numeric scaling

1
2
3
4
5
# scale age in full dataset
age_scaled <- scale(attrition$age)

# check mean and standard deviation
attr(age_scaled, "scaled:center")
#> [1] 36.92381
1
attr(age_scaled, "scaled:scale")
#> [1] 9.135373
1
2
3
4
5
# scale age in train dataset
age_train_scaled <- scale(attrition$age[in_train])

# check mean and standard deviation
attr(age_train_scaled, "scaled:center")
#> [1] 36.84014
1
attr(age_train_scaled, "scaled:scale")
#> [1] 9.210907
1
2
3
4
5
# scale age in test dataset
age_test_scaled <- scale(attrition$age[-in_train])

# check mean and standard deviation
attr(age_test_scaled, "scaled:center")
#> [1] 37.2585
1
attr(age_test_scaled, "scaled:scale")
#> [1] 8.834148

Things we never think we could do it better

How we see model performance

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  1029  204
#>        yes  204   33
#>                                           
#>                Accuracy : 0.7224          
#>                  95% CI : (0.6988, 0.7452)
#>     No Information Rate : 0.8388          
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : -0.0262         
#>                                           
#>  Mcnemar's Test P-Value : 1               
#>                                           
#>             Sensitivity : 0.13924         
#>             Specificity : 0.83455         
#>          Pos Pred Value : 0.13924         
#>          Neg Pred Value : 0.83455         
#>              Prevalence : 0.16122         
#>          Detection Rate : 0.02245         
#>    Detection Prevalence : 0.16122         
#>       Balanced Accuracy : 0.48690         
#>                                           
#>        'Positive' Class : yes             
#> 

How we use Receiver Operating Curve

Setting the Cross-Validation Scheme using rsample

Rethinking: Why we need validation?

Tidy way to rsample-ing your dataset

rsample is part of tidymodels that could help us in splitting or resampling or machine learning dataset.

There are so many splitting and resampling approach provided by rsample–as you could see in its full function references page. In this introduction, we will use two most general function:

Initial splitting

Initial random sampling for splitting train and test could be done using initial_split():

1
2
3
4
5
6
7
8
# set seed
set.seed(100)

# create initial split
splitted <- initial_split(attrition, prop = 0.8)

# check train dataset
training(splitted)
#> # A tibble: 1,176 × 35
#>    attrition   age business_travel   daily_rate department    distance_from_home
#>    <chr>     <dbl> <chr>                  <dbl> <chr>                      <dbl>
#>  1 no           53 travel_rarely            238 sales                          1
#>  2 no           28 non_travel              1103 research_dev…                 16
#>  3 no           28 travel_rarely            736 sales                         26
#>  4 no           25 travel_rarely            949 research_dev…                  1
#>  5 no           51 travel_frequently        237 sales                          9
#>  6 yes          32 non_travel              1474 sales                         11
#>  7 no           34 travel_frequently       1003 research_dev…                  2
#>  8 no           40 travel_frequently        593 research_dev…                  9
#>  9 no           36 travel_rarely            884 sales                          1
#> 10 no           27 travel_rarely           1167 research_dev…                  4
#> # ℹ 1,166 more rows
#> # ℹ 29 more variables: education <dbl>, education_field <chr>,
#> #   employee_count <dbl>, employee_number <dbl>,
#> #   environment_satisfaction <dbl>, gender <chr>, hourly_rate <dbl>,
#> #   job_involvement <dbl>, job_level <dbl>, job_role <chr>,
#> #   job_satisfaction <dbl>, marital_status <chr>, monthly_income <dbl>,
#> #   monthly_rate <dbl>, num_companies_worked <dbl>, over_18 <chr>, …
1
2
# check test dataset
testing(splitted)
#> # A tibble: 294 × 35
#>    attrition   age business_travel   daily_rate department    distance_from_home
#>    <chr>     <dbl> <chr>                  <dbl> <chr>                      <dbl>
#>  1 no           27 travel_rarely            591 research_dev…                  2
#>  2 no           59 travel_rarely           1324 research_dev…                  3
#>  3 no           38 travel_frequently        216 research_dev…                 23
#>  4 no           34 travel_rarely           1346 research_dev…                 19
#>  5 yes          36 travel_rarely           1218 sales                          9
#>  6 yes          34 travel_rarely            699 research_dev…                  6
#>  7 no           30 travel_rarely            125 research_dev…                  9
#>  8 no           36 travel_rarely            852 research_dev…                  5
#>  9 no           33 travel_frequently       1141 sales                          1
#> 10 no           27 travel_frequently        994 sales                          8
#> # ℹ 284 more rows
#> # ℹ 29 more variables: education <dbl>, education_field <chr>,
#> #   employee_count <dbl>, employee_number <dbl>,
#> #   environment_satisfaction <dbl>, gender <chr>, hourly_rate <dbl>,
#> #   job_involvement <dbl>, job_level <dbl>, job_role <chr>,
#> #   job_satisfaction <dbl>, marital_status <chr>, monthly_income <dbl>,
#> #   monthly_rate <dbl>, num_companies_worked <dbl>, over_18 <chr>, …

But sometimes, simple random sampling is not enough:

1
2
# target distribution in full dataset
prop.table(table(attrition$attrition))
#> 
#>        no       yes 
#> 0.8387755 0.1612245
1
2
# target distribution in train dataset
prop.table(table(training(splitted)$attrition))
#> 
#>        no       yes 
#> 0.8469388 0.1530612
1
2
# target distribution in test dataset
prop.table(table(testing(splitted)$attrition))
#> 
#>        no       yes 
#> 0.8061224 0.1938776

This is where we need strata argument to use stratified random sampling:

1
2
3
4
5
6
7
8
# set seed
set.seed(100)

# create stratified initial split
splitted <- initial_split(attrition, prop = 0.8, strata = "attrition")

# target distribution in full dataset
prop.table(table(attrition$attrition))
#> 
#>        no       yes 
#> 0.8387755 0.1612245
1
2
# target distribution in train dataset
prop.table(table(training(splitted)$attrition))
#> 
#>        no       yes 
#> 0.8391489 0.1608511
1
2
# target distribution in test dataset
prop.table(table(testing(splitted)$attrition))
#> 
#>        no       yes 
#> 0.8372881 0.1627119

Cross-sectional resampling

To use k-fold validation splits–and optionally, with repetition–we could use vfold_cv():

1
2
3
4
5
6
7
8
# set seed
set.seed(100)

# create k-fold splits with repetition
resampled <- vfold_cv(attrition, v = 3, repeats = 2, strata = "attrition")

# quick check
resampled
#> #  3-fold cross-validation repeated 2 times using stratification 
#> # A tibble: 6 × 3
#>   splits            id      id2  
#>   <list>            <chr>   <chr>
#> 1 <split [980/490]> Repeat1 Fold1
#> 2 <split [980/490]> Repeat1 Fold2
#> 3 <split [980/490]> Repeat1 Fold3
#> 4 <split [980/490]> Repeat2 Fold1
#> 5 <split [980/490]> Repeat2 Fold2
#> 6 <split [980/490]> Repeat2 Fold3

Each train and test dataset are stored in splits column. We could use analysis() and assessment() to get the train and test dataset, consecutively:

1
2
# check train dataset on an example split
analysis(resampled$splits[[1]])
#> # A tibble: 980 × 35
#>    attrition   age business_travel   daily_rate department    distance_from_home
#>    <chr>     <dbl> <chr>                  <dbl> <chr>                      <dbl>
#>  1 yes          41 travel_rarely           1102 sales                          1
#>  2 no           49 travel_frequently        279 research_dev…                  8
#>  3 no           27 travel_rarely            591 research_dev…                  2
#>  4 no           59 travel_rarely           1324 research_dev…                  3
#>  5 no           36 travel_rarely           1299 research_dev…                 27
#>  6 no           35 travel_rarely            809 research_dev…                 16
#>  7 no           29 travel_rarely            153 research_dev…                 15
#>  8 no           31 travel_rarely            670 research_dev…                 26
#>  9 no           34 travel_rarely           1346 research_dev…                 19
#> 10 yes          28 travel_rarely            103 research_dev…                 24
#> # ℹ 970 more rows
#> # ℹ 29 more variables: education <dbl>, education_field <chr>,
#> #   employee_count <dbl>, employee_number <dbl>,
#> #   environment_satisfaction <dbl>, gender <chr>, hourly_rate <dbl>,
#> #   job_involvement <dbl>, job_level <dbl>, job_role <chr>,
#> #   job_satisfaction <dbl>, marital_status <chr>, monthly_income <dbl>,
#> #   monthly_rate <dbl>, num_companies_worked <dbl>, over_18 <chr>, …
1
2
# check test dataset on an example split
assessment(resampled$splits[[1]])
#> # A tibble: 490 × 35
#>    attrition   age business_travel   daily_rate department    distance_from_home
#>    <chr>     <dbl> <chr>                  <dbl> <chr>                      <dbl>
#>  1 yes          37 travel_rarely           1373 research_dev…                  2
#>  2 no           33 travel_frequently       1392 research_dev…                  3
#>  3 no           32 travel_frequently       1005 research_dev…                  2
#>  4 no           30 travel_rarely           1358 research_dev…                 24
#>  5 no           38 travel_frequently        216 research_dev…                 23
#>  6 no           29 travel_rarely           1389 research_dev…                 21
#>  7 yes          34 travel_rarely            699 research_dev…                  6
#>  8 no           44 travel_rarely           1459 research_dev…                 10
#>  9 yes          24 travel_rarely            813 research_dev…                  1
#> 10 yes          50 travel_rarely            869 sales                          3
#> # ℹ 480 more rows
#> # ℹ 29 more variables: education <dbl>, education_field <chr>,
#> #   employee_count <dbl>, employee_number <dbl>,
#> #   environment_satisfaction <dbl>, gender <chr>, hourly_rate <dbl>,
#> #   job_involvement <dbl>, job_level <dbl>, job_role <chr>,
#> #   job_satisfaction <dbl>, marital_status <chr>, monthly_income <dbl>,
#> #   monthly_rate <dbl>, num_companies_worked <dbl>, over_18 <chr>, …

Data Preprocess using recipes

Rethinking: How do we should treat train and test?

Reproducible preprocess recipes

recipes is part of tidymodels that could help us in making a reproducible data preprocess.

There are so many data preprocess approach provided by recipes–as you could see in its full function references page . In this introduction, we will use several preprocess steps related to our example dataset.

There are several steps that we could apply to our dataset–some are very fundamental and sometimes mandatory, but some are also tuneable:

  • step_rm() : Manually remove unused columns.
  • step_nzv() : Automatically filter near-zero varianced columns.
  • step_string2factor() : Manually convert to factor columns. Downsampling step to balancing target’s class distribution (tuneable).
  • step_center() : Normalize the mean of numeric column(s) to zero (tuneable).
  • step_scale() : Normalize the standard deviation of numeric column(s) to one (tuneable).
  • step_pca() : Shrink numeric column(s) to several PCA components (tuneable).

Designing your first preprocess recipes

  1. Initiate a recipe using recipe()
    • Define your formula in the first argument.
    • Supply a template dataset in data argument.
  2. Pipe to every needed step_*()–always remember to put every step in proper consecutive manner.
  3. After finished with every needed step_*(), pipe to prep() function to train your recipe
    • It will automatically convert all character columns to factor;
    • If you used step_string2factor(), don’t forget to specify strings_as_factors = FALSE
    • It will train the recipe to the specified dataset in the recipe()’s data argument;
    • If you want to train to other dataset, you can supply the new dataset to training argument, and set the fresh argument to TRUE

Let’s see an example of defining a recipe:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# define preprocess recipe from train dataset
rec <- recipe(attrition ~ ., data = training(splitted)) %>% 
  step_rm(employee_count, employee_number) %>%
  step_nzv(all_predictors()) %>% 
  step_string2factor(all_nominal(), -attrition) %>%
  step_string2factor(attrition, levels = c("yes", "no")) %>%
  step_downsample(attrition, under_ratio = 1/1, seed = 100) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) %>%
  step_pca(all_numeric(), threshold = 0.85) %>%
  prep(strings_as_factors = FALSE)

# quick check
rec
#> Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor         34
#> 
#> Training data contained 1175 data points and no missing data.
#> 
#> Operations:
#> 
#> Variables removed employee_count, employee_number [trained]
#> Sparse, unbalanced variable filter removed over_18, standard_hours [trained]
#> Factor variables from business_travel, department, education_field, g... [trained]
#> Factor variables from attrition [trained]
#> Down-sampling based on attrition [trained]
#> Centering for age, daily_rate, distance_from_home, education,... [trained]
#> Scaling for age, daily_rate, distance_from_home, education,... [trained]
#> PCA extraction with age, daily_rate, distance_from_home, education, ... [trained]

There are two ways of obtaining the result from our recipe:

  • juice() : Extract preprocessed dataset from prep()-ed recipe. Normally, we will use this function to get preprocessed train dataset.

  • bake() : Apply a recipe to new dataset. Normally, we use we will use this function to preprocess new dataset, such as test dataset, or prediction dataset.

These are some example on how to get preprocessed train and test dataset:

1
2
3
4
5
# get preprocessed train dataset
data_train <- juice(rec)

# quick check
data_train
#> # A tibble: 378 × 22
#>    business_travel department     education_field gender job_role marital_status
#>    <fct>           <fct>          <fct>           <fct>  <fct>    <fct>         
#>  1 travel_rarely   sales          life_sciences   female sales_e… single        
#>  2 travel_rarely   research_deve… other           male   laborat… single        
#>  3 travel_rarely   research_deve… life_sciences   male   laborat… single        
#>  4 travel_rarely   sales          life_sciences   male   sales_r… single        
#>  5 travel_rarely   sales          technical_degr… male   sales_r… married       
#>  6 travel_rarely   research_deve… medical         male   researc… married       
#>  7 travel_rarely   research_deve… life_sciences   male   laborat… single        
#>  8 travel_rarely   research_deve… technical_degr… female researc… married       
#>  9 travel_rarely   research_deve… life_sciences   male   laborat… single        
#> 10 travel_rarely   research_deve… technical_degr… male   laborat… single        
#> # ℹ 368 more rows
#> # ℹ 16 more variables: over_time <fct>, attrition <fct>, PC01 <dbl>,
#> #   PC02 <dbl>, PC03 <dbl>, PC04 <dbl>, PC05 <dbl>, PC06 <dbl>, PC07 <dbl>,
#> #   PC08 <dbl>, PC09 <dbl>, PC10 <dbl>, PC11 <dbl>, PC12 <dbl>, PC13 <dbl>,
#> #   PC14 <dbl>
1
2
3
4
5
# get preprocessed test dataset
data_test <- bake(rec, testing(splitted))

# quick check
data_test
#> # A tibble: 295 × 22
#>    business_travel   department   education_field gender job_role marital_status
#>    <fct>             <fct>        <fct>           <fct>  <fct>    <fct>         
#>  1 travel_rarely     research_de… medical         female laborat… married       
#>  2 travel_frequently research_de… life_sciences   male   manufac… single        
#>  3 travel_rarely     research_de… medical         male   healthc… married       
#>  4 non_travel        research_de… medical         male   laborat… divorced      
#>  5 travel_rarely     research_de… medical         male   researc… single        
#>  6 travel_frequently research_de… life_sciences   female researc… single        
#>  7 travel_rarely     research_de… medical         male   laborat… single        
#>  8 travel_rarely     research_de… medical         female researc… divorced      
#>  9 travel_rarely     sales        marketing       male   sales_r… married       
#> 10 travel_rarely     sales        marketing       female sales_r… married       
#> # ℹ 285 more rows
#> # ℹ 16 more variables: over_time <fct>, attrition <fct>, PC01 <dbl>,
#> #   PC02 <dbl>, PC03 <dbl>, PC04 <dbl>, PC05 <dbl>, PC06 <dbl>, PC07 <dbl>,
#> #   PC08 <dbl>, PC09 <dbl>, PC10 <dbl>, PC11 <dbl>, PC12 <dbl>, PC13 <dbl>,
#> #   PC14 <dbl>

Model Fitting using parsnip

Rethinking: How many machine learning packages you used?

One vegeta.. I mean package to rule them all: parsnip

parsnip is part of tidymodels that could help us in model fitting and prediction flows.

There are so many models supported by parsnip–as you could see in its full model list . In this introduction, we will use random forest as an example model.

There are two part of defining a model that should be noted:

  • Defining model’s specification: In this part, we need to define the model’s specification, such as mtry and trees for random forest, through model specific functions. For example, you can use rand_forest() to define a random forest specification. Make sure to check full model list to see every model and its available arguments.

  • Defining model’s engine: In this part, we need to define the model’s engine, which determines the package we will use to fit our model. This part could be done using set_engine() function. Note that in addition to defining which package we want to use as our engine, we could also passing package specific arguments to this function.

This is an example of defining a random forest model using randomForest::randomForest() as our engine:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# set-up model specification
model_spec <- rand_forest(
  mode = "classification",
  mtry = ncol(data_train) - 2,
  trees = 500,
  min_n = 15
)

# set-up model engine
model_engine <- set_engine(
  object = model_spec,
  engine = "randomForest"
)

# quick check
model_engine
#> Random Forest Model Specification (classification)
#> 
#> Main Arguments:
#>   mtry = ncol(data_train) - 2
#>   trees = 500
#>   min_n = 15
#> 
#> Computational engine: randomForest

To fit our model, we have two options:

  • Formula interface
  • X-Y interface

Note that some packages are behaving differently inside formula and x-y interface. For example, randomForest::randomForest() would convert all of our categorical variables into dummy variables in formula interface, but not in x-y interface.

Fit using formula interface using fit() function:

1
2
3
4
5
6
7
8
9
# fit the model
model <- fit(
  object = model_engine,
  formula = attrition ~ .,
  data = data_train
)

# quick check
model
#> parsnip model object
#> 
#> 
#> Call:
#>  randomForest(x = maybe_data_frame(x), y = y, ntree = ~500, mtry = min_cols(~ncol(data_train) -      2, x), nodesize = min_rows(~15, x)) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 20
#> 
#>         OOB estimate of  error rate: 28.31%
#> Confusion matrix:
#>     yes  no class.error
#> yes 141  48   0.2539683
#> no   59 130   0.3121693

Or through x-y interface using fit_xy() function:

1
2
3
4
5
6
7
8
9
# fit the model
model <- fit_xy(
  object = model_engine,
  x = select(data_train, -attrition),
  y = select(data_train, attrition)
)

# quick check
model
#> parsnip model object
#> 
#> 
#> Call:
#>  randomForest(x = maybe_data_frame(x), y = y, ntree = ~500, mtry = min_cols(~ncol(data_train) -      2, x), nodesize = min_rows(~15, x)) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 20
#> 
#>         OOB estimate of  error rate: 29.89%
#> Confusion matrix:
#>     yes  no class.error
#> yes 136  53   0.2804233
#> no   60 129   0.3174603

In this workflow, it should be relatively easy to change the model engine.

Let’s try to fit a same model specification, but now using ranger::ranger():

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# set-up other model engine
model_engine <- set_engine(
  object = model_spec,
  engine = "ranger",
  seed = 100,
  num.threads = parallel::detectCores() / 2,
  importance = "impurity"
)

# quick check
model_engine
#> Random Forest Model Specification (classification)
#> 
#> Main Arguments:
#>   mtry = ncol(data_train) - 2
#>   trees = 500
#>   min_n = 15
#> 
#> Engine-Specific Arguments:
#>   seed = 100
#>   num.threads = parallel::detectCores()/2
#>   importance = impurity
#> 
#> Computational engine: ranger

Now let’s try to fit the model, and see the result:

1
2
3
4
5
6
7
8
9
# fit the model
model <- fit(
  object = model_engine,
  formula = attrition ~ .,
  data = data_train
)

# quick check
model
#> parsnip model object
#> 
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~ncol(data_train) -      2, x), num.trees = ~500, min.node.size = min_rows(~15, x),      seed = ~100, num.threads = ~parallel::detectCores()/2, importance = ~"impurity",      verbose = FALSE, probability = TRUE) 
#> 
#> Type:                             Probability estimation 
#> Number of trees:                  500 
#> Sample size:                      378 
#> Number of independent variables:  21 
#> Mtry:                             20 
#> Target node size:                 15 
#> Variable importance mode:         impurity 
#> Splitrule:                        gini 
#> OOB prediction error (Brier s.):  0.2088251

Notice that ranger::ranger() doesn’t behave differently between fit() and fit_xy():

1
2
3
4
5
6
7
8
9
# fit the model
model <- fit_xy(
  object = model_engine,
  x = select(data_train, -attrition),
  y = select(data_train, attrition)
)

# quick check
model
#> parsnip model object
#> 
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~ncol(data_train) -      2, x), num.trees = ~500, min.node.size = min_rows(~15, x),      seed = ~100, num.threads = ~parallel::detectCores()/2, importance = ~"impurity",      verbose = FALSE, probability = TRUE) 
#> 
#> Type:                             Probability estimation 
#> Number of trees:                  500 
#> Sample size:                      378 
#> Number of independent variables:  21 
#> Mtry:                             20 
#> Target node size:                 15 
#> Variable importance mode:         impurity 
#> Splitrule:                        gini 
#> OOB prediction error (Brier s.):  0.2088251

To get the prediction, we could use predict() as usual–but note that it would return a tidied tibble instead of a vector, as in type = "class" cases, or a raw data.frame, as in type = "prob" cases.

In this way, the prediction results would be very convenient for further usage, such as simple recombining with the original dataset:

1
2
3
4
5
6
7
8
# get prediction on test
predicted <- data_test %>% 
  bind_cols(predict(model, data_test)) %>% 
  bind_cols(predict(model, data_test, type = "prob"))

# quick check
predicted %>% 
  select(attrition, matches(".pred"))
#> # A tibble: 295 × 4
#>    attrition .pred_class .pred_yes .pred_no
#>    <fct>     <fct>           <dbl>    <dbl>
#>  1 no        no              0.336    0.664
#>  2 no        no              0.161    0.839
#>  3 no        no              0.232    0.768
#>  4 no        yes             0.699    0.301
#>  5 yes       no              0.472    0.528
#>  6 yes       yes             0.635    0.365
#>  7 no        yes             0.542    0.458
#>  8 no        no              0.330    0.670
#>  9 yes       yes             0.790    0.210
#> 10 no        yes             0.553    0.447
#> # ℹ 285 more rows

Model Evaluation using yardstick

Rethinking: How do we measure the goodness of our model?

It’s always good to bring your own yardstick

yardstick is part of tidymodels that could help us in calculating model performance metrics.

There are so many metrics available by yardstick–as you could see in its full function references page . In this introduction, we will calculate some model performance metrics for classification task as an example.

There are two ways of calculating model performance metrics, which differ in its input and output:

  • tibble approach: We pass a dataset containing the truth and estimate to the function, and it will return a tibble containing the results, e.g., precision() function.
  • vector approach: We pass a vector as the truth and a vector as the estimate to the function, and it will return a vector which show the results, e.g., precision_vec() function.

Note that some function, like conf_mat() , only accept tibble approach, since it is not returned a vector of length one.

Let’s start by reporting the confusion matrix:

1
2
3
4
# show confusion matrix
predicted %>% 
  conf_mat(truth = attrition, estimate = .pred_class) %>% 
  autoplot(type = "heatmap")

Now, to calculate the performance metrics, let’s try to use the tibble approach–which also support group_by:

1
2
3
# calculate accuracy
predicted %>% 
  accuracy(attrition, .pred_class)
#> # A tibble: 1 × 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy binary         0.753
1
2
3
4
5
# calculate accuracy by group
predicted %>% 
  group_by(department) %>% 
  accuracy(attrition, .pred_class) %>% 
  ungroup()
#> # A tibble: 3 × 4
#>   department           .metric  .estimator .estimate
#>   <fct>                <chr>    <chr>          <dbl>
#> 1 human_resources      accuracy binary         0.8  
#> 2 research_development accuracy binary         0.790
#> 3 sales                accuracy binary         0.677

Or using vector approach, which is more flexible in general:

1
2
3
4
5
6
7
8
# metrics summary
predicted %>% 
  summarise(
    accuracy = accuracy_vec(attrition, .pred_class),
    sensitivity = sens_vec(attrition, .pred_class),
    specificity = spec_vec(attrition, .pred_class),
    precision = precision_vec(attrition, .pred_class)
  )
#> # A tibble: 1 × 4
#>   accuracy sensitivity specificity precision
#>      <dbl>       <dbl>       <dbl>     <dbl>
#> 1    0.753       0.667       0.769     0.360
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# metrics summary by group
predicted %>% 
  group_by(department) %>% 
  summarise(
    accuracy = accuracy_vec(attrition, .pred_class),
    sensitivity = sens_vec(attrition, .pred_class),
    specificity = spec_vec(attrition, .pred_class),
    precision = precision_vec(attrition, .pred_class)
  ) %>% 
  ungroup()
#> # A tibble: 3 × 5
#>   department           accuracy sensitivity specificity precision
#>   <fct>                   <dbl>       <dbl>       <dbl>     <dbl>
#> 1 human_resources         0.8         0.5         0.909     0.667
#> 2 research_development    0.790       0.609       0.816     0.326
#> 3 sales                   0.677       0.762       0.654     0.372

Sometimes the model performance metrics could also improving the models final results. For example, through Receiver Operating Curve, we could assess the probability threshold effect to sensitivity and specificity:

1
2
3
predicted %>% 
  roc_curve(attrition, .pred_yes) %>% 
  autoplot()

And, since it’s returning a tibble, we could do further data wrangling to help us see it more clearly:

1
2
3
4
5
6
# get roc curve data on test dataset
pred_test_roc <- predicted %>%
  roc_curve(attrition, .pred_yes)

# quick check
pred_test_roc
#> # A tibble: 297 × 3
#>    .threshold specificity sensitivity
#>         <dbl>       <dbl>       <dbl>
#>  1  -Inf          0                 1
#>  2     0.0546     0                 1
#>  3     0.0699     0.00405           1
#>  4     0.0946     0.00810           1
#>  5     0.103      0.0121            1
#>  6     0.106      0.0162            1
#>  7     0.112      0.0202            1
#>  8     0.120      0.0243            1
#>  9     0.123      0.0283            1
#> 10     0.128      0.0324            1
#> # ℹ 287 more rows

With some ggplot2:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# tidying
pred_test_roc <- pred_test_roc %>% 
  mutate_if(~ is.numeric(.), ~ round(., 4)) %>% 
  gather(metric, value, -.threshold)

# plot sensitivity-specificity trade-off
p <- ggplot(pred_test_roc, aes(x = .threshold, y = value)) +
  geom_line(aes(colour = metric)) +
  labs(x = "Probability Threshold to be Classified as Positive", y = "Value", colour = "Metrics") +
  theme_minimal()

and plotly magic, it would be perfect:

1
2
# convert to plotly
ggplotly(p)
Built with Hugo
Theme Stack designed by Jimmy