Tidymodels: Plotting Predicted vs True Values using the functions collect_predictions() and ggplot() in R

Kaikash777 · December 13, 2020, 2:31pm

Overview

I have produced four models using the tidymodels package with the data frame FID (see below):

General Linear Model
Bagged Tree
Random Forest
Boosted Trees

The data frame contains three predictors:

Year (numeric)
Month (Factor)
Days (numeric)

The dependent variable is Frequency (numeric)

I am following this tutorial:-

Issue

I would like to plot the quantitative estimates for how well my model performed and whether these values can be compared across different kinds of models. I want to visually show the predicted frequency (dependent variable) plotted against the true frequency when they were published (see an example of what I want to plot below), for all the resampled datasets.

I am experiencing this error message

Error in FUN(X[[i]], ...) : object 'Frequency' not found

If anyone can help me with this error, I would be deeply appreciative.

Many thanks in advance.

R-Code

##Open the tidymodels package

library(tidymodels)
library(glmnet)
library(parsnip)
library(rpart.plot)
library(rpart)
library(tidyverse) # manipulating data
library(skimr) # data visualization
library(baguette) # bagged trees
library(future) # parallel processing & decrease computation time
library(xgboost) # boosted trees
library(ranger)
library(yardstick)
library(purrr)
library(forcats)
library(ggplot)

#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)

##Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

##Resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v=10)

###########################################################
##Produce the recipe

rec <- recipe(Frequency ~ ., data = FID) %>% 
          step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
          step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
          step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
          step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

##########################################################
##Produce Models
##########################################################
##General Linear Models
##########################################################

##Produce the glm model
mod_glm<-linear_reg(mode="regression",
                       penalty = 0.1, 
                       mixture = 1) %>% 
                            set_engine("glmnet")

##Create workflow
wflow_glm <- workflow() %>% 
                add_recipe(rec) %>%
                      add_model(mod_glm)

##Fit the model

#############################################################
##Estimate how well that model performs, let’s fit many times, 
##once to each of these resampled folds, and then evaluate on the held out 
##part of each resampled fold.
#############################################################

plan(multisession)

fit_glm <- fit_resamples(
                        wflow_glm,
                        cv,
                        metrics = metric_set(rmse, rsq),
                        control = control_resamples(save_pred = TRUE)
                        )

##Collect model predictions for each fold for the frequency 

Predictions<-fit_glm %>% 
                  collect_predictions()

Predictions

##Plot the predicted and true values 
fit_glm %>%
      collect_predictions() %>%
      ggplot(aes(Frequency, .pred, color = id)) +
      geom_abline(lty = 2, color = "gray80", size = 1.5) +
      geom_point(alpha = 0.3) +
      labs(
      x = "Truth",
      y = "Predicted year",
      color = NULL,
      title = "Predicted and True Years for Frequency",
      subtitle = "Each Cross-validation Fold is Shown in a Different Color"
      )`

Desired Plot

Data Frame - FID


structure(list(Year = c(2015, 2015, 2015, 2015, 2015, 2015, 2015, 
2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016, 2016, 
2016, 2016, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2017, 
2017, 2017, 2017, 2017, 2017, 2017, 2017), Month = structure(c(5L, 
4L, 8L, 1L, 9L, 7L, 6L, 2L, 12L, 11L, 10L, 3L, 5L, 4L, 8L, 1L, 
9L, 7L, 6L, 2L, 12L, 11L, 10L, 3L, 5L, 4L, 8L, 1L, 9L, 7L, 6L, 
2L, 12L, 11L, 10L, 3L), .Label = c("April", "August", "December", 
"February", "January", "July", "June", "March", "May", "November", 
"October", "September"), class = "factor"), Frequency = c(36, 
28, 39, 46, 5, 0, 0, 22, 10, 15, 8, 33, 33, 29, 31, 23, 8, 9, 
7, 40, 41, 41, 30, 30, 44, 37, 41, 42, 20, 0, 7, 27, 35, 27, 
43, 38), Days = c(31, 28, 31, 30, 6, 0, 0, 29, 15, 
29, 29, 31, 31, 29, 30, 30, 7, 0, 7, 30, 30, 31, 30, 27, 31, 
28, 30, 30, 21, 0, 7, 26, 29, 27, 29, 29)), row.names = c(NA, 
-36L), class = "data.frame")

Max · December 13, 2020, 4:52pm

A few of these lines fail and, once I fix them, I cannot reproduce your issue.

Can you please provide a minimal reprex (reproducible example)? The goal of a reprex is to make it as easy as possible for me to recreate your problem so that I can fix it: please help me help you!

If you've never heard of a reprex before, start by reading "What is a reprex", and follow the advice further down that page.

Kaikash777 · December 14, 2020, 1:49am

Hi Max, This example is a reprex. I have double checked with a clear Global environment in R Studio and I am still getting the same error. Please find another copy of the code. Thank you for your time and patience, it is deeply appreciated.

##Open the tidymodels package
library(tidymodels)
library(glmnet)
library(parsnip)
library(rpart.plot)
library(rpart)
library(tidyverse) # manipulating data
library(skimr) # data visualization
library(baguette) # bagged trees
library(future) # parallel processing & decrease computation time
library(xgboost) # boosted trees
library(ranger)
library(yardstick)
library(purrr)
library(forcats)
library(ggplot)

#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

# resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v=10)

###########################################################
##Produce the recipe

rec <- recipe(Frequency ~ ., data = FID) %>% 
          step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
          step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
          step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
          step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

##########################################################
##Produce Models
##########################################################
##General Linear Models
##########################################################

##Produce the glm model
mod_glm<-linear_reg(mode="regression",
                       penalty = 0.1, 
                       mixture = 1) %>% 
                            set_engine("glmnet")

##Create workflow
wflow_glm <- workflow() %>% 
                add_recipe(rec) %>%
                      add_model(mod_glm)

##Fit the model

###########################################################################
##Estimate how well that model performs, let’s fit many times, 
##once to each of these resampled folds, and then evaluate on the heldout 
##part of each resampled fold.
##########################################################################
plan(multisession)

fit_glm <- fit_resamples(
                        wflow_glm,
                        cv,
                        metrics = metric_set(rmse, rsq),
                        control = control_resamples(save_pred = TRUE)
                        )
##Collect model predictions for each fold for the number of blue whale sightings

Predictions<-fit_glm %>% 
                  collect_predictions()

Predictions

##Plot the predicted and true values 
fit_glm %>%
      collect_predictions() %>%
      ggplot(aes(Frequency, .pred, color = id)) +
      geom_abline(lty = 2, color = "gray80", size = 1.5) +
      geom_point(alpha = 0.3) +
      labs(
      x = "Truth",
      y = "Predicted year",
      color = NULL,
      title = "Predicted and True Years for Frequency",
      subtitle = "Each Cross-validation Fold is Shown in a Different Color"
      )

Max · December 14, 2020, 1:13pm

I'll look at this in a bit. However, just for future reference, it would be good to make a minimal reprex, meaning:

not unneeded libraries, function calls or plots.
sequential processing only (unless you know thee issue is with parallel processing)
set the seeds; I won't get the same random numbers as you and might not be able to reproduce the issue.
don't have the data code somewhere else

Max · December 14, 2020, 1:15pm

Are these the same results that you get?

library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 0.1.2 ──
#> ✓ broom     0.7.2          ✓ recipes   0.1.15    
#> ✓ dials     0.0.9.9000     ✓ rsample   0.0.8     
#> ✓ dplyr     1.0.2          ✓ tibble    3.0.4     
#> ✓ ggplot2   3.3.2          ✓ tidyr     1.1.2     
#> ✓ infer     0.5.3          ✓ tune      0.1.2     
#> ✓ modeldata 0.1.0          ✓ workflows 0.2.1     
#> ✓ parsnip   0.1.4          ✓ yardstick 0.0.7     
#> ✓ purrr     0.3.4
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter()  masks stats::filter()
#> x dplyr::lag()     masks stats::lag()
#> x recipes::step()  masks stats::step()

FID <- 
   tibble::tribble(
      ~Year,      ~Month, ~Frequency, ~Days,
      2015,   "January",         36,    31,
      2015,  "February",         28,    28,
      2015,     "March",         39,    31,
      2015,     "April",         46,    30,
      2015,       "May",          5,     6,
      2015,      "June",          0,     0,
      2015,      "July",          0,     0,
      2015,    "August",         22,    29,
      2015, "September",         10,    15,
      2015,   "October",         15,    29,
      2015,  "November",          8,    29,
      2015,  "December",         33,    31,
      2016,   "January",         33,    31,
      2016,  "February",         29,    29,
      2016,     "March",         31,    30,
      2016,     "April",         23,    30,
      2016,       "May",          8,     7,
      2016,      "June",          9,     0,
      2016,      "July",          7,     7,
      2016,    "August",         40,    30,
      2016, "September",         41,    30,
      2016,   "October",         41,    31,
      2016,  "November",         30,    30,
      2016,  "December",         30,    27,
      2017,   "January",         44,    31,
      2017,  "February",         37,    28,
      2017,     "March",         41,    30,
      2017,     "April",         42,    30,
      2017,       "May",         20,    21,
      2017,      "June",          0,     0,
      2017,      "July",          7,     7,
      2017,    "August",         27,    26,
      2017, "September",         35,    29,
      2017,   "October",         27,    27,
      2017,  "November",         43,    29,
      2017,  "December",         38,    29
   )

#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)

# resample the data with 10-fold cross-validation (10-fold by default)

set.seed(1)
cv <- vfold_cv(train_data, v=10)

###########################################################
##Produce the recipe

rec <- recipe(Frequency ~ ., data = FID) %>% 
   step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
   step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
   step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
   step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

##########################################################
##Produce Models
##########################################################
##General Linear Models
##########################################################

##Produce the glm model
mod_glm<-linear_reg(mode="regression",
                    penalty = 0.1, 
                    mixture = 1) %>% 
   set_engine("glmnet")

##Create workflow
wflow_glm <- workflow() %>% 
   add_recipe(rec) %>%
   add_model(mod_glm)

##Fit the model

###########################################################################
##Estimate how well that model performs, let’s fit many times, 
##once to each of these resampled folds, and then evaluate on the heldout 
##part of each resampled fold.
##########################################################################

fit_glm <- fit_resamples(
   wflow_glm,
   cv,
   metrics = metric_set(rmse, rsq),
   control = control_resamples(save_pred = TRUE)
)
#> 
#> Attaching package: 'rlang'
#> The following objects are masked from 'package:purrr':
#> 
#>     %@%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int,
#>     flatten_lgl, flatten_raw, invoke, list_along, modify, prepend,
#>     splice
#> 
#> Attaching package: 'vctrs'
#> The following object is masked from 'package:tibble':
#> 
#>     data_frame
#> The following object is masked from 'package:dplyr':
#> 
#>     data_frame
#> Loading required package: Matrix
#> 
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#> 
#>     expand, pack, unpack
#> Loaded glmnet 4.0-2
##Collect model predictions for each fold for the number of blue whale sightings

Predictions<-fit_glm %>% 
   collect_predictions()


##Plot the predicted and true values 
fit_glm %>%
   collect_predictions() %>%
   ggplot(aes(Frequency, .pred, color = id)) +
   geom_abline(lty = 2, color = "gray80", size = 1.5) +
   geom_point(alpha = 0.3) +
   labs(
      x = "Truth",
      y = "Predicted year",
      color = NULL,
      title = "Predicted and True Years for Frequency",
      subtitle = "Each Cross-validation Fold is Shown in a Different Color"
   )

^{Created on 2020-12-14 by the reprex package (v0.3.0)}

Kaikash777 · December 14, 2020, 2:37pm

Hi Max. Thank you for your help, I really appreciated it. After I downloaded the rlang package, the error went away and I produced the graph.

Sorry, I used the seed 45L, but it's at the beginning of my code, and I forgot to insert it.

Additionally, I will only include the packages that are relevant to the models that I am referring too in the question. I am using all those packages for all my models, but you are right. Even the libraries need to be part of the reprex.

Many thanks again for your advice, I don't find this topic easy being a newbie to tidymodels.

Max · December 14, 2020, 3:59pm

Not a problem. Thanks!

Kaikash777 · December 15, 2020, 3:00am

Hey Max. I hope you don't think I am overstepping but would it be possible to please help me understand the warning message for this question:

Tidymodels: Tuning hyperparameters in a bagged tree using the tune_grid() function

I am tuning all four models now: (1) glm; (2) bagged trees; (3) random forest; and (4) boosted trees.

Could I please also ask your opinion about using the function poisson_reg()? My independent predictor (frequency) represent physical counts. A commenter on Stack Overflow suggested that I use this function poissonreg::poisson_reg() in my models to prevent data leakage and thus turn the values into positive integers, which makes sense.

In your opinion, what do you think of incorporating this function into my models? Sorry for asking so many questions but I want to understand exactly what I am doing. Many thanks for your time and patience. Take care.

      ##Produce the glm model
       mod_glm<-linear_reg(mode="regression",
                   penalty = 0.1, 
                   mixture = 1) %>% 
                       poissonreg::poisson_reg() %>% 
                                   set_engine("glmnet")
   #Error
   Error: `mode` should be one of: 'unknown', 'regression'
   Run `rlang::last_error()` to see where the error occurred.
   In addition: Warning message:
   In if (!(mode %in% spec_modes)) 
   rlang::abort(glue::glue("`mode` should be one of: ",  :
  the condition has length > 1 and only the first element will be used

Max · December 15, 2020, 2:17pm

Poisson regression is run using (just) the poission_reg() function. Type ?poisson_reg to see examples.

This is different from linear regression; Poisson regression is not a simple linear model so you don't need to use linear_reg().

I'm happy to answer questions, but a lot of this can be found in the help pages. You also might want to look at the Get Started sections of tidymodels.org.

Kaikash777 · December 17, 2020, 3:11am

Hi Max.

Firstly, I would like to thank you for sending me the tutorial regarding 'Get started with sections of tidymodels'. That was a really good read, and it really helped. I am literally just starting out and this provided clarity.

I have read thoroughly online to solve my warning message regarding the tuning of my bagged tree model using the function tune_grid():

#Warning message

  ! Fold02: internal: A correlation computation is required, but `estimate` is constant and has 0 sta...
  ! Fold07: internal: A correlation computation is required, but `estimate` is constant and has 0 sta...
  ! Fold10: internal: A correlation computation is required, but `estimate` is constant and has 0 sta...

One of your explanations for this warning message in this post:

It happens when your model produces a single predicted value. R2 needs the variance (which is then zero) and produces an NA value.

There is not much mention online for this warning message, but I also found post 1, and post 2, where you and your colleague Davis Vaughan suggest that you silently return zero back into the model.

Could I please ask:

Why is this warning message occurring when I am attempting to tune the hyperparameters on my bagged model?
Do you have any suggestions about how to fix the warning issue?
Would I need to silently replace the zero for the tuning of my model?

Here's the question and reproducible code regarding tuning my bagged model using the function tune_grid().

               ##Tune the bagged model
               bag_res <- tune_grid(
                                   wflow_bag %>% update_model(tune_spec_bag),
                                   cv,
                                   grid = bag_grid,
                                   metrics=metric_set(rmse, rsq),
                                   control = control_resamples(save_pred = TRUE)
                                   )

Results from collect_predictions()

      ##Produce the prediction metrics for the tuned models
      bag_predictions<-bag_res %>%
                            collect_predictions()

      id     .pred  .row tree_depth Frequency_Blue .config
      <chr>  <dbl> <int>      <int>          <dbl> <chr>  
    1 Fold01 34.1      1          1             36 Model01
    2 Fold01 10.1      5          1              5 Model01
    3 Fold01 34.1     14          1             41 Model01
    4 Fold01 35.3      1          2             36 Model02
    5 Fold01  9.51     5          2              5 Model02
    6 Fold01 34.3     14          2             41 Model02
    7 Fold01 39.7      1          4             36 Model03
    8 Fold01  9.02     5          4              5 Model03
    9 Fold01 36.7     14          4             41 Model03
   10 Fold01 42.3      1          5             36 Model04
   # … with 260 more rows

If you are able to help me advance further and to understand this warning message better, I would like to express my deepest appreciation.

Many thanks in advance if you can help.

Max · December 18, 2020, 11:48pm

It's not really an error but a warning.

It happens when your model produces a single predicted value. R2 needs the variance (which is then zero) and produces an NA value.

So your model is trying to predict with the mean of the outcome data (= a single predicted value). This most likely occurs if there is little to no signal in your data. If the tree cannot make any splits, it uses the same mean.

There's really not anything that you can do about it. The performance metrics for those tuning parameter values will not be good so the tuning process would not select those settings.

Kaikash777 · December 19, 2020, 1:51am

Hi Max,

Thank you so much for explaining this warning message to me, I have spent days trying to find a solution. So ultimately, the performance metrics for the tuned bagged tree model is correct. However, I will assume the bagged tree model would not be the best model for my data.

I deeply appreciate your time and patience for a beginner.

system · January 9, 2021, 1:52am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.