Machine Learning the 'Tidy' Way

A tutorial on K-nearest neighbor and boosted trees using the tidymodels framework

Introduction to machine learning with tidymodels

Tidymodels provides a clean, organized, and–most importantly–consistent programming syntax for data pre-processing, model specification, model fitting, model evaluation, and prediction.

Anatomy of tidymodels

a meta-package that installs and load the core packages listed below that you need for modeling and machine learning

rsamples: provides infrastructure for efficient data splitting and resampling

parsnip: a tidy, unified interface to models that can be used to try a range of models without getting bogged down in the syntactical minutiae of the underlying packages

recipes: a tidy interface to data pre-processing tools for feature engineering

workflows: workflows bundle your pre-processing, modeling, and post-processing together

tune: helps you optimize the hyperparameters of your model and pre-processing steps

yardstick: measures the effectiveness of models using performance metrics

dials: contains tools to create and manage values of tuning parameters and is designed to integrate well with the parsnip package

broom: summarizes key information about models in tidy tibble()s

First, lets load the tidymodels meta-package:

library(tidymodels)
library(tidyverse)

Package tutorials:

Data

I’ll demonstrate it’s features using an existing data set from Bruno Oliveria, Amphibio:

Amphibio data

Download data:

# install.packages("downloader")
# library(downloader)
# 
# url <- "https://ndownloader.figstatic.com/files/8828578"
# download(url, dest="dial_broom/amphibio.zip", mode="wb") 
# unzip("dial_broom/amphibio.zip", exdir = "./dial_broom")

library(readr)

amphibio_raw <- read_csv("AmphiBIO_v1.csv")

The data consist of natural history information of amphibians, including habitat types, diet, size, ect.

Here’s the breakdown of taxonomic spread in the data:

  • Order: N = 3
  • Family: N = 61
  • Genera: N = 531
  • Species: N = 6776

There are also a lot of missing data, and what data do exist are wildly different scales. We’ll clean this up:

# Check how many NA's for each row
amphibio <- amphibio_raw %>%
  select("Order"
         ,"Body_mass_g"
         ,"Body_size_mm"
         ,"Litter_size_min_n"
         ,"Litter_size_max_n"
         ,"Reproductive_output_y"
         ) %>%
  na.omit %>%
  mutate(Body_mass_g = log(Body_mass_g),
         Body_size_mm = log(Body_size_mm),
         Litter_size_min_n = log(Litter_size_min_n),
         Litter_size_max_n = log(Litter_size_max_n),
         Reproductive_output_y = log(Reproductive_output_y)) %>%
  filter(!Order == "Gymnophiona")
  
amphibio %>%
  group_by(Order) %>%
  summarize(n = n())

Now let’s have a peak at the data:

  amphibio %>% 
  pivot_longer(!Order, names_to = "Metric", values_to = "Value") %>%
  ggplot(aes(Order, Value, col = Order)) + 
    geom_boxplot() + 
    facet_wrap(~Metric)

There are some trends in the data:

  • caudates are longer
  • anura have larger litter sizes

Given the data, one possible modeling application could be to use data to predict order using two models: knn and boosted regression trees.

To start the modeling process, we’ll use rsamples to split the data into training and testing sets.

set.seed(42)

tidy_split <- initial_split(amphibio, prop = 0.95)
tidy_train <- training(tidy_split)
tidy_test <- testing(tidy_split)
tidy_kfolds <- vfold_cv(tidy_train)

We can use recipes to preprocess the data:

# Recipes package 
## For preprocessing, feature engineering, and feature elimination 
tidy_rec <- recipe(Order ~ ., data = tidy_train) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_normalize(all_predictors()) %>%
  prep()

Now that we’ve created a recipe to process the data for modeling, we can use parsnip to model the data:

First, let’s have a look at the model’s description

library("webshot")
# ?boost_tree

boost_tree()

Description

boost_tree() defines a model that creates a series of decision trees forming an ensemble. Each tree depends on the results of previous trees. All trees in the ensemble are combined to produce a final prediction.

There are different ways to fit this model. See the engine-specific pages for more details:

  • xgboost (default)

  • C5.0

  • spark

    ?nearest_neighbors

nearest_neighbor():

defines a model that uses the K most similar data points from the training set to predict new samples.

There are different ways to fit this model. See the engine-specific pages for more details:

  • knn (default)

Now, let’s fit the models:

# Parsnip package 
## Standardized api for creating models 
tidy_boosted_model <- boost_tree(trees = tune(),
                                min_n = tune(),
                                learn_rate = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("xgboost")

tidy_knn_model <- nearest_neighbor(neighbors = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("kknn")

Our basic model recipe is complete, but now we want to use dials to tune parameters.

dials

For boosted regression trees, there are 3 basic parameters:

parameters(tidy_boosted_model)

## Collection of 3 parameters for tuning
## 
##  identifier       type    object
##       trees      trees nparam[+]
##       min_n      min_n nparam[+]
##  learn_rate learn_rate nparam[+]
  • trees: An integer for the number of trees contained in the ensemble.
  • min_n: An integer for the minimum number of data points in a node that is required for the node to be split further.
  • learn_rate: A number for the rate at which the boosting algorithm adapts from iteration-to-iteration (specific engines only).

Knn has a single parameter to tune: the neighbors

parameters(tidy_knn_model)

## Collection of 1 parameters for tuning
## 
##  identifier      type    object
##   neighbors neighbors nparam[+]
  • neighbors: A single integer for the number of neighbors to consider (often called k). For kknn, a value of 5 is used if neighbors is not specified.

So, we can use dials to set the possible parameter values, which can then be tuned using tune.

# Dials creates the parameter grids 
# Tune applies the parameter grid to the models 
# Dials pacakge 
boosted_params <- 5
knn_params <- 10

?grid_regular

## starting httpd help server ... done

boosted_grid <- grid_regular(parameters(tidy_boosted_model), levels = boosted_params)
boosted_grid

## # A tibble: 125 x 3
##    trees min_n   learn_rate
##    <int> <int>        <dbl>
##  1     1     2 0.0000000001
##  2   500     2 0.0000000001
##  3  1000     2 0.0000000001
##  4  1500     2 0.0000000001
##  5  2000     2 0.0000000001
##  6     1    11 0.0000000001
##  7   500    11 0.0000000001
##  8  1000    11 0.0000000001
##  9  1500    11 0.0000000001
## 10  2000    11 0.0000000001
## # ... with 115 more rows

knn_grid <- grid_regular(parameters(tidy_knn_model), levels = knn_params)
knn_grid

## # A tibble: 10 x 1
##    neighbors
##        <int>
##  1         1
##  2         2
##  3         4
##  4         5
##  5         7
##  6         8
##  7        10
##  8        11
##  9        13
## 10        15

Implement tuning grid using tune:

tune

# install.packages(c("xgboost", "kknn"))
library(xgboost)
library(kknn)

# Tune pacakge 
# system.time(
#   boosted_tune <- tune_grid(tidy_boosted_model,
#                             tidy_rec,
#                             resamples = tidy_kfolds,
#                             grid = boosted_grid)
# )
# write_rds(boosted_tune, "boosted_tune.rds")
boosted_tune <- read_rds("boosted_tune.rds")

# system.time(
#   knn_tune <- tune_grid(tidy_knn_model,
#                         tidy_rec,
#                         resamples = tidy_kfolds,
#                         grid = knn_grid)
# ) 
# write_rds(knn_tune, "knn_tune.rds")
knn_tune <- read_rds("knn_tune.rds")

#Use Tune package to extract best parameters using ROC_AUC handtill
boosted_param <- boosted_tune %>% select_best("roc_auc")
knn_param <- knn_tune %>% select_best("roc_auc")
#Apply parameters to the models
tidy_boosted_model_final <- finalize_model(tidy_boosted_model, boosted_param)
tidy_knn_model_final <- finalize_model(tidy_knn_model, knn_param)

Now, well try different options from dials for parameter tuning, using two additional methods for grid specification:

  • random grid with dials::grid_random
  • maximum entropy grid with dials::grid_max_entropy

grid_random

boosted_grid_rand <- grid_random(parameters(tidy_boosted_model), size = boosted_params)
boosted_grid_rand

## # A tibble: 5 x 3
##   trees min_n learn_rate
##   <int> <int>      <dbl>
## 1   190    21   2.32e- 5
## 2  1816    12   3.60e- 8
## 3   293    28   3.14e-10
## 4   314     8   2.52e- 7
## 5  1363     5   5.92e- 6

knn_grid_rand <- grid_random(parameters(tidy_knn_model), size = knn_params)
knn_grid_rand

## # A tibble: 7 x 1
##   neighbors
##       <int>
## 1         1
## 2        10
## 3         5
## 4         3
## 5        11
## 6         8
## 7         2

# system.time(
#   boosted_tune_rand <- tune_grid(tidy_boosted_model,
#                                  tidy_rec,
#                                  resamples = tidy_kfolds,
#                                  grid = boosted_grid_rand)
# )
# write_rds(boosted_tune_rand, "boosted_tune_rand.rds")
boosted_tune_rand <- read_rds("boosted_tune_rand.rds")

# system.time(
#   knn_tune_rand <- tune_grid(tidy_knn_model,
#                              tidy_rec,
#                              resamples = tidy_kfolds,
#                              grid = knn_grid_rand)
# )
# write_rds(knn_tune_rand, "knn_tune_rand.rds")
knn_tune_rand <- read_rds("knn_tune_rand.rds")

#Use Tune package to extract best parameters using ROC_AUC handtill
boosted_param_rand <- boosted_tune_rand %>% select_best("roc_auc")
knn_param_rand <- knn_tune_rand %>% select_best("roc_auc")

grid_max_entropy

boosted_grid_maxent <- grid_max_entropy(parameters(tidy_boosted_model), size = boosted_params)
boosted_grid_maxent

## # A tibble: 5 x 3
##   trees min_n learn_rate
##   <int> <int>      <dbl>
## 1   433    25   4.27e-10
## 2  1671    13   3.28e-10
## 3  1520     3   3.21e- 6
## 4   672     3   3.06e-10
## 5  1371    22   2.32e- 5

knn_grid_maxent <- grid_max_entropy(parameters(tidy_knn_model), size = knn_params)
knn_grid_maxent

## # A tibble: 10 x 1
##    neighbors
##        <int>
##  1         3
##  2        10
##  3         1
##  4        15
##  5        13
##  6         4
##  7         6
##  8         8
##  9         9
## 10        11

# system.time(
#   boosted_tune_maxent <- tune_grid(tidy_boosted_model,
#                                    tidy_rec,
#                                    resamples = tidy_kfolds,
#                                    grid = boosted_grid_maxent)
# )
# write_rds(boosted_tune_maxent, "boosted_tune_maxent.rds")
boosted_tune_maxent <- read_rds("boosted_tune_maxent.rds")

# system.time(
#   knn_tune_maxent <- tune_grid(tidy_knn_model,
#                                tidy_rec,
#                                resamples = tidy_kfolds,
#                                grid = knn_grid_maxent)
# )
# write_rds(knn_tune_maxent, "knn_tune_maxent.rds")
knn_tune_maxent <- read_rds("knn_tune.rds")

#Use Tune package to extract best parameters using ROC_AUC handtill
boosted_param_maxent <- boosted_tune_maxent %>% select_best("roc_auc")
knn_param_maxent <- knn_tune_maxent %>% select_best("roc_auc")

workflows

For combining model, parameters, and preprocessing

boosted_wf <- workflow() %>% 
  add_model(tidy_boosted_model_final) %>% 
  add_recipe(tidy_rec)

knn_wf <- workflow() %>% 
  add_model(tidy_knn_model_final) %>% 
  add_recipe(tidy_rec)

yardstick

For extracting metrics from the model

boosted_res <- last_fit(boosted_wf, tidy_split)
knn_res <- last_fit(knn_wf, tidy_split)

mods <- bind_rows(
  boosted_res %>% mutate(model = "xgb"),
  knn_res %>% mutate(model = "knn")) %>% 
  unnest(.metrics)

ggplot(bind_rows(mods$.predictions), aes(Order, .pred_Anura)) + 
  geom_boxplot()
ggplot(bind_rows(mods$.predictions), aes(Order, .pred_Caudata)) + 
  geom_boxplot()
ggplot(mods, aes(x = model, y = .estimate, col = model)) + 
  geom_point() + 
  facet_wrap(~.metric)

Confusion matrix to visualize model predictions against truth

boosted_res %>% unnest(.predictions) %>% 
  conf_mat(truth = Order, estimate = .pred_class) %>%
  autoplot()

Fit the entire data set using the final wf

final_boosted_model <- fit(boosted_wf, amphibio)

## [15:25:37] WARNING: amalgamation/../src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

final_knn_model <- fit(knn_wf, amphibio)

broom

Now we can use broom to tidy the results from these models, and provide an intuitive view of their meaning!

augment()

First, we’ll use augment to obtain predictions, residuals, and other items from the model, which auto-binds them to the original dataset.

boosted_aug <- augment(final_boosted_model, new_data = amphibio[,-1])
knn_aug <- augment(final_knn_model, new_data = amphibio[,-1])

boosted_aug_long <- boosted_aug %>%
  pivot_longer(-c(.pred_class, .pred_Anura, .pred_Caudata), names_to = "predictor", values_to = "value") 

Now we can evaluate the models using yardstick!

yardstick

final_boosted_model %>%
  predict(bake(tidy_rec, new_data = tidy_test), type = "prob") %>%
  bind_cols(tidy_test) %>%
  roc_auc(factor(Order), .pred_Anura)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.759

final_boosted_model %>%
  predict(bake(tidy_rec, new_data = tidy_test), type = "prob") %>%
  bind_cols(tidy_test) %>%
  roc_curve(factor(Order), .pred_Anura) %>%
  autoplot() 

Evaluating knn model

final_knn_model %>%
  predict(bake(tidy_rec, new_data = tidy_test), type = "prob") %>%
  bind_cols(tidy_test) %>%
  roc_auc(factor(Order), .pred_Anura)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary           0.5

final_knn_model %>%
  predict(bake(tidy_rec, new_data = tidy_test), type = "prob") %>%
  bind_cols(tidy_test) %>%
  roc_curve(factor(Order), .pred_Anura) %>%
  autoplot()
final_knn_model %>%
  predict(bake(tidy_rec, new_data = tidy_test), type = "prob") %>%
  bind_cols(tidy_test) %>%
  roc_auc(factor(Order), .pred_Anura)

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary           0.5

Visualizing predictions:

library(viridis)

## Loading required package: viridisLite

## 
## Attaching package: 'viridis'

## The following object is masked from 'package:scales':
## 
##     viridis_pal

ggplot(boosted_aug_long, aes(x = value, y = .pred_Anura, col = .pred_class)) + 
  geom_point() + 
  facet_wrap(~predictor) + 
  scale_color_viridis_d("Truth", option = "D") +
  theme_bw()
ggplot(boosted_aug_long, aes(x = value, y = .pred_Caudata, col = .pred_class)) + 
  geom_point() + 
  facet_wrap(~predictor) + 
  scale_color_viridis_d("Truth", option = "D") +
  theme_bw()
J Alex Baecher
J Alex Baecher
Postdoctoral Researcher

My research interests include landscape ecology and applied conservation of reptiles and amphibians

Related