Error with "step_string2factor()" and model global explanations

ML_Rookie_2021 · May 22, 2022, 11:27pm

I am trying to replicate code shown in Tidy Models With R book using this kaggle dataset. However I and running into some issues with recipes and model global explanations.

Here's the code to reproduce my work -

# Libraries ----
library(tidyverse)
library(janitor)
library(tidymodels)
library(DALEXtra)

# Load Data ----
campaign_tbl_raw <- data.table::fread("../data/marketing_campaign.csv", sep = ";") %>% 
    clean_names() %>% 
    as_tibble()

campaign_tbl <- campaign_tbl_raw %>% 
    filter(!income > 200000) %>% 
    mutate(response = as.factor(response))

# Data Split ----
set.seed(123)
data_split <- initial_split(campaign_tbl, prop = 0.8, strata = response)
train_tbl <- training(data_split) 
test_tbl <- testing(data_split)


# Recipe ----
glmnet_base_recipe <- glmnet_recipe <- recipe(formula = response ~ ., data = train_tbl) %>% 
    step_rm(starts_with("z_")) %>% 
    update_role(id, new_role = "indicator") %>% 
    **step_string2factor**(one_of(education, marital_status)) %>% 
    step_mutate(dt_customer = as.numeric(dt_customer)) %>% 
    step_novel(all_nominal(), -all_outcomes()) %>% 
    step_dummy(all_nominal(), -all_outcomes()) %>% 
    step_zv(all_predictors()) %>% 
    step_normalize(year_birth, income, dt_customer, recency, starts_with("mnt_"), starts_with("num_")) %>% 
    themis::step_upsample(response, over_ratio = 0.5)

glmnet_base_recipe %>% prep() %>% juice() %>% glimpse()

First Issue - When I try to prep and glimpse the recipe above, I get the following error -

Error in `instrument_base_errors()`:
! object 'education' not found
Caused by error in `map_lgl()`:
! object 'education' not found
Run `rlang::last_error()` to see where the error occurred.

Unfortunately, I'm not sure how to interpret the Run `rlang::last_error()` to see where the error occurred message. However when I take out the step_string2factor(one_of(education, marital_status)) step, then the recipe works just fine, so I update the recipe and proceed -

# Recipe ----
glmnet_base_recipe <- glmnet_recipe <- recipe(formula = response ~ ., data = train_tbl) %>% 
    step_rm(starts_with("z_")) %>% 
    update_role(id, new_role = "indicator") %>% 
    step_mutate(dt_customer = as.numeric(dt_customer)) %>% 
    step_novel(all_nominal(), -all_outcomes()) %>% 
    step_dummy(all_nominal(), -all_outcomes()) %>% 
    step_zv(all_predictors()) %>% 
    step_normalize(year_birth, income, dt_customer, recency, starts_with("mnt_"), starts_with("num_")) %>% 
    themis::step_upsample(response, over_ratio = 0.5)

# Model Spec ----
base_glmnet_spec <- logistic_reg(
    penalty = 0.1,
    mixture = 0.5
) %>% 
    set_mode("classification") %>% 
    set_engine("glmnet")

# Workflow Spec ----
glmnet_base_workflow <- workflow() %>% 
    add_recipe(glmnet_base_recipe) %>% 
    add_model(base_glmnet_spec)   

# Fit ----
glmnet_base_fit <- glmnet_base_workflow %>% 
    fit(train_tbl)

Second Issue - I am trying to follow the steps used in TMWR to explain models and predictions. In the book, they first build an explainer (see section 18.1). I follow the same steps using the code below -

# Explainer ----
explainer_glmnet <- explain_tidymodels(
    glmnet_base_fit,
    data = train_tbl,
    y = train_tbl$response,
    label = "lm base",
    verbose = FALSE
)

However I get the a warning -

Warning message:
In Ops.factor(y, predict_function(model, data)) :
  ‘-’ not meaningful for factors

I googled the error and learned that this message indicates there a data type not suitable for computation, however in my case, I'm not sure where, or how to fix it.

Finally, I try to replicate the global explanations in TMWR (see section 18.3) with the code -

# Variable Importance Via model_parts() ----
set.seed(123)
vip_glmnet <- model_parts(explainer_glmnet, loss_function = loss_one_minus_auc)

However I get the following error -

Error in Summary.factor(1L, na.rm = FALSE) : 
  ‘sum’ not meaningful for factors

I'm assuming this has something to do with the warning message earlier, however I'm at a loss for how to fix. Any help will be appreciated

Max · May 23, 2022, 9:22pm

Can you please provide a minimal reprex (reproducible example)? The goal of a reprex is to make it as easy as possible for me to recreate your problem so that I can fix it: please help me help you!

If you've never heard of a reprex before, start by reading "What is a reprex", and follow the advice further down that page.

I'm not sure that we can figure it out without being able to reproduce the issue

ML_Rookie_2021 · May 24, 2022, 12:39pm

@Max Thank you. I created some data, though I'm not able to reproduce the exact error I get with the data I'm using from kaggle. I'm also not able to reproduce the step_string2factor() error. However the code below reproduces something similar when I try to do the model explanation. I'm assuming it's because the example in the book was a regression problem while I'm doing a classification problem.

library(tidyverse)
library(tidymodels)
library(DALEXtra)

# Create Data ----
var1 <- floor(runif(40, 0, 99))
var2 <- runif(40, 0, 1493.0)
var3 <- runif(40, 0, 199.0)
var4 <- runif(40, 0, 27)
var5 <- c(rep("phd", 10), rep("masters", 10), rep("grad", 15), rep("basic", 5))
var6 <- c(rep("divorced", 10), rep("single", 10), rep("married", 15), rep("unknown", 5))
target <- c(rep(0, 35), rep(1, 5))

df <- tibble(var1, var2, var3, var4, var5, var6, target)
df <- df[sample(1:nrow(df)), ]

# Data Split ----
set.seed(123)
data_split <- initial_split(df, prop = 0.8, strata = target)
train_tbl <- training(data_split) %>% mutate(target = as.factor(target))
test_tbl <- testing(data_split) %>% mutate(target = as.factor(target))


# Recipe ----
glmnet_recipe <- recipe(formula = target ~ ., data = train_tbl) %>% 
    step_string2factor(one_of(var5, var6)) %>% 
    step_novel(all_nominal(), -all_outcomes()) %>% 
    step_dummy(all_nominal(), -all_outcomes()) %>% 
    step_zv(all_predictors()) %>% 
    step_normalize(var1, var2, var3, var4) %>% 
    themis::step_upsample(target, over_ratio = 0.5)
#> Registered S3 methods overwritten by 'themis':
#>   method                  from   
#>   bake.step_downsample    recipes
#>   bake.step_upsample      recipes
#>   prep.step_downsample    recipes
#>   prep.step_upsample      recipes
#>   tidy.step_downsample    recipes
#>   tidy.step_upsample      recipes
#>   tunable.step_downsample recipes
#>   tunable.step_upsample   recipes

# Model Spec ----
glmnet_spec <- logistic_reg(
    penalty = 0.1,
    mixture = 0.5
) %>% 
    set_mode("classification") %>% 
    set_engine("glmnet")

# Workflow Spec ----
glmnet_workflow <- workflow() %>% 
    add_recipe(glmnet_recipe) %>% 
    add_model(glmnet_spec)   

# Fit ----
glmnet_fit <- glmnet_workflow %>% 
    fit(train_tbl)
#> Warning: Unknown columns: `phd`, `masters`, `grad`, `basic`, `divorced`,
#> `single`, `married`, `unknown`

# Explainer ----
explainer_glmnet <- explain_tidymodels(
    glmnet_fit,
    data = train_tbl,
    y = train_tbl$target,
    label = "lm",
    verbose = FALSE
)
#> Warning in Ops.factor(y, predict_function(model, data)): '-' not meaningful for
#> factors

# Variable Importance Via model_parts() ----
set.seed(123)
vip_glmnet <- model_parts(explainer_glmnet, loss_function = loss_one_minus_auc)
#> Error in Summary.factor(structure(1L, .Label = c("0", "1"), class = "factor"), : 'sum' not meaningful for factors

Max · May 24, 2022, 1:27pm

The first issue can be solved by converting the character columns to factor before the recipe. During resampling, the complete set of values might not be in the character data. Converting them to factors then will misconfigure the levels. I'd do an across() before anything else (see the code below).

For the DALEX bit, those functions assume that the outcome is always numeric. The help file says this (not not obviously) and the example doesn't help. So convert it to 0/1 beforehand.

(I think that future versions of DALEX can get around this since it looks like they are allowing yardstick to be used with their explainers).

Here's code that works. Sorry that either of these issues are not more clear.

library(tidyverse)
library(tidymodels)
library(DALEXtra)
#> Loading required package: DALEX
#> Welcome to DALEX (version: 2.4.0).
#> Find examples and detailed introduction at: http://ema.drwhy.ai/
#> 
#> Attaching package: 'DALEX'
#> The following object is masked from 'package:dplyr':
#> 
#>     explain

# Create Data ----
var1 <- floor(runif(40, 0, 99))
var2 <- runif(40, 0, 1493.0)
var3 <- runif(40, 0, 199.0)
var4 <- runif(40, 0, 27)
var5 <- c(rep("phd", 10), rep("masters", 10), rep("grad", 15), rep("basic", 5))
var6 <- c(rep("divorced", 10), rep("single", 10), rep("married", 15), rep("unknown", 5))
target <- c(rep(0, 35), rep(1, 5))

df <- tibble(var1, var2, var3, var4, var5, var6, target)
df <- df[sample(1:nrow(df)), ]

# Added this to convert to character
df <- mutate(df, across(where(is.character), ~ as.factor(.x)))

# Data Split ----
set.seed(123)
data_split <- initial_split(df, prop = 0.8, strata = target)
train_tbl <- training(data_split) %>% mutate(target = as.factor(target))
test_tbl <- testing(data_split) %>% mutate(target = as.factor(target))


# Recipe ----
glmnet_recipe <- recipe(formula = target ~ ., data = train_tbl) %>% 
  # Removed the string2factor step
  step_novel(all_nominal_predictors()) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  themis::step_upsample(target, over_ratio = 0.5)

# Model Spec ----
glmnet_spec <- logistic_reg(
  penalty = 0.1,
  mixture = 0.5
) %>% 
  set_mode("classification") %>% 
  set_engine("glmnet")

# Workflow Spec ----
glmnet_workflow <- workflow() %>% 
  add_recipe(glmnet_recipe) %>% 
  add_model(glmnet_spec)   

# Fit ----
glmnet_fit <- glmnet_workflow %>% 
  fit(train_tbl)

# Explainer ----
explainer_glmnet <- explain_tidymodels(
  glmnet_fit,
  data = train_tbl,
  y = ifelse(train_tbl$target == "1", 1, 0),  # <-- convert to numeric :-/
  label = "lm",
  verbose = FALSE
)

# Variable Importance Via model_parts() ----
set.seed(123)
vip_glmnet <- model_parts(explainer_glmnet, loss_function = loss_one_minus_auc)

^{Created on 2022-05-24 by the reprex package (v2.0.1)}

ML_Rookie_2021 · May 24, 2022, 10:28pm

@Max This is perfect. Thank you.

system · May 31, 2022, 10:29pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.