1. Setup
Libraries and Setup
We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get.
You will need to use install.packages() to install any packages that are not already downloaded onto your machine. You then load the package into your workspace using the library() function:
|
|
2. Nested Dataframe
You’ll learn how to use purrr, caret and dplyr to quickly create some of dataset + model combinations, store data & model objects neatly in one tibble, and post process programatically. These tools enable succinct functional programming in which a lot gets done with just a few lines of code. The data to be used is loan.csv which can be downloaded here link here. In this article we will predict the default variable which has a yes or no value.
|
|
|
|
|
|
The loan data will be divided into test data and loan data. Test data will be used when testing the model that has been made, while the data loan will be used to model the classification.
2.1 Single Data Frame x Multiple Model
Before creating a nested dataframe, we must prepare a model that will be used first. The model to be used must be used as a function to make it easier when used in the map () function that comes from the purrr package. besides that we can set the parameters that will be used in the function. In the chunk below 2 models are created, namely the decision tree and random forest using the caret package.
|
|
After making a model in the form of a function, then making the model into a dataframe.
|
|
|
|
model_list produces 2 columns, namely modelName, and model. ModelName is the name of the model, and the model contains the functions of the model.
Next the dataframe to be used replicates as many models as you want to use. The loan dataset will be replicates as much as the model used by rep () function.
|
|
|
|
nested.loan has 2 columns, namely Id and rawdata which contain the loan dataframe. Then rawdata will be separated into train.y which contains the default variable and train.x contains the others.
|
|
|
|
The next step is to join nested.loan with model_list using bind_cols ()
|
|
|
|
The model we have created can be used with the function invoke_map () which functions to combine functions and lists as parameters.
|
|
|
|
To see how well the model has been made, it can be seen from the Accuracy of each model.
|
|
|
|
From the above results it can be seen that the random forest model produces an accuracy of 0.75 and the decission tree is 0.74. Next, we will do predict to the test data that has been made using an existing model. the data test must replicate as many models as used and then join the nested.loan data using left_join()
|
|
|
|
Now we create a pred variable that contains results from predict
|
|
|
|
2.2 Multiple Data Frame x Single Model
Now we will split loan data by checking_balance variable which has 4 levels namely < 0 DM, > 200 DM, 1 - 200 DM, and unknown
|
|
|
|
The model that will be used is random forest
|
|
|
|
To see how well the model is made, we can see the accuracy obtained from the model we made.
|
|
|
|
2.3 Multiple Data Frame x Multiple Model
To run multi models against multi data, we must repeat data as much as the model that will be used. nested.split is a data loan that is divided based on the checking_balance variable which contains 4 levels while the model used is 2, namely random forest and decision tree, the amount of data is 8 (4 X 2).
|
|
|
|
Now we can modeling each data category with each model
|
|
|
|
|
|
|
|
And we can see the accuracy each model
|
|
|
|
|
|