Error when generating predictions from a model with case weights

JamesGrecian · May 29, 2023, 1:23pm

I've been fitting a series of models using case weights. The aim is to down-weight the influence of a large number of background points relative to observed presence points. I have been using logistic regression and ML: fitting a GLM, GAM, RF and BRT - all of which can accept case-weights.

The models are fit via spatial cross-validation using the excellent spatialsample package and I have had some help applying the weights to the folds

From the series of fitted models I would like to generate partial dependence plots. However, when a fitted model is passed to DALEX::explain_tidymodels and DALEX::model_profile I get the following error:

`Error in `quantile()`:
! `quantile.hardhat_importance_weights()` not implemented.`

Typically, although a model is fitted with weights, the weights are not needed when generating predictions to new data. Can anyone offer advice on how to generate predictions from a tidymodel workflow using case weights, either using DALEX or using an alternative?

Thanks in advance for any help or suggestions, here's a reprex to illustrate the issue.

James

set.seed(1107)

# packages
library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(tidymodels)
library(spatialsample)
library(DALEXtra)
#> Loading required package: DALEX
#> Welcome to DALEX (version: 2.4.3).
#> Find examples and detailed introduction at: http://ema.drwhy.ai/
#> Additional features will be available after installation of: ggpubr.
#> Use 'install_dependencies()' to get all suggested dependencies
#> 
#> Attaching package: 'DALEX'
#> The following object is masked from 'package:dplyr':
#> 
#>     explain
#> Anaconda not found on your computer. Conda related functionality such as create_env.R and condaenv and yml parameters from explain_scikitlearn will not be available

## Data prep:
# pak::pkg_install("Nowosad/spDataLarge")
data("lsl", "study_mask", package = "spDataLarge")
ta <- terra::rast(system.file("raster/ta.tif", package = "spDataLarge"))
lsl <- lsl |> 
  st_as_sf(coords = c("x", "y"), crs = "EPSG:32717")

# convert to 0, 1 as is typical in species distribution modelling
lsl <- lsl |> 
  mutate(lslpts = factor(as.numeric(lslpts)-1)) |>
  # Creating a dummy case weights column, to get past initial verification by recipe
  mutate(cwts = hardhat::importance_weights(NA))

# set up case weights as a recipe step
lsl_recipe <- recipes::recipe(
  lslpts ~ slope + cplan + cprof + elev + log10_carea, 
  data = sf::st_drop_geometry(lsl)
) |> 
  recipes::step_mutate(
    cwts = hardhat::importance_weights(
      ifelse(lslpts == 1, 1, sum(lslpts == 1) / sum(lslpts == 0))
    ),
    # Need to set the "case_weights" role explicitly:
    role = "case_weights"
  )

# split into folds
lsl_folds <- spatial_block_cv(lsl, method = "random", v = 10)

# try GLM
glm_model <- logistic_reg() |> 
  set_engine("glm") |> 
  set_mode("classification")

# Using weights instead: no add_formula, because the formula is in our recipe
glm_wflow_wts <- workflow(preprocessor = lsl_recipe) |> 
  add_model(glm_model) |> 
  add_case_weights(cwts) |>
  fit_resamples(lsl_folds)

# generate partial dependence profile for model
# ideally want to generate profile for each fold to verify model fit
explain_tidymodels(glm_wflow_wts,
                   data = lsl_folds$splits[[1]] |> analysis() |> st_drop_geometry(),
                   y = lsl_folds$splits[[1]] |> analysis() |> st_drop_geometry() |> pull(lslpts)) |>
  model_profile(N = 1000, type = "partial")
#> Warning: Unknown or uninitialised column: `spec`.
#> Preparation of a new explainer is initiated
#>   -> model label       :  data.frame  (  default  )
#>   -> data              :  308  rows  7  cols 
#>   -> target variable   :  308  values 
#>   -> predict function  :  yhat.default will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package Model of class: resample_results package unrecognized , ver. Unknown , task regression (  default  ) 
#>   -> model_info        :  Model info detected regression task but 'y' is a factor .  (  WARNING  )
#>   -> model_info        :  By deafult regressions tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  the predict_function returns an error when executed (  WARNING  ) 
#>   -> residual function :  difference between y and yhat (  default  )
#>   -> residuals         :  the residual_function returns an error when executed (  WARNING  ) 
#>   A new explainer has been created!
#> Error in `quantile()`:
#> ! `quantile.hardhat_importance_weights()` not implemented.
#> Backtrace:
#>      ▆
#>   1. └─DALEX::model_profile(...)
#>   2.   ├─ingredients::ceteris_paribus(...)
#>   3.   └─ingredients:::ceteris_paribus.explainer(...)
#>   4.     └─ingredients:::ceteris_paribus.default(...)
#>   5.       ├─ingredients:::calculate_variable_split(...)
#>   6.       └─ingredients:::calculate_variable_split.default(...)
#>   7.         └─base::lapply(...)
#>   8.           └─ingredients (local) FUN(X[[i]], ...)
#>   9.             ├─base::unique(quantile(selected_column, probs = probs))
#>  10.             ├─stats::quantile(selected_column, probs = probs)
#>  11.             └─vctrs:::quantile.vctrs_vctr(selected_column, probs = probs)
#>  12.               └─vctrs:::stop_unimplemented(x, "quantile")
#>  13.                 └─vctrs:::stop_vctrs(...)
#>  14.                   └─rlang::abort(message, class = c(class, "vctrs_error"), ..., call = call)

^{Created on 2023-05-29 with reprex v2.0.2}

MikeMahoney218 · June 1, 2023, 2:33pm

I think there's two separate issues here. First off, your recipe is looking for the columns it used to calculate case weights when making predictions, and having no success; if we use the data.frame() method to recipe() and make a slight adjustment in how we use recipes for cross-validation and the final model fit, we're able to work around that issue.

As for the second, I'm not sure why quantile() is being called on the case weights column. I think that defining a fake quantile() method might work, but I really have no idea; if you're able to compare your results from model_profile() via this method to something you know works, that would be ideal. There might be an issue with how either DALEX or ingredients is handling the data from the model, as it's attempting to use the case weights column when it shouldn't be needed, but I don't know those packages well enough to comment.

set.seed(1107)

library(sf)
#> Linking to GEOS 3.11.1, GDAL 3.6.2, PROJ 9.1.1; sf_use_s2() is TRUE
library(tidymodels)
library(spatialsample)
library(DALEXtra)
#> Loading required package: DALEX
#> Welcome to DALEX (version: 2.4.3).
#> Find examples and detailed introduction at: http://ema.drwhy.ai/
#> 
#> Attaching package: 'DALEX'
#> The following object is masked from 'package:dplyr':
#> 
#>     explain
data("lsl", "study_mask", package = "spDataLarge")
ta <- terra::rast(system.file("raster/ta.tif", package = "spDataLarge"))
lsl <- lsl |> 
  st_as_sf(coords = c("x", "y"), crs = "EPSG:32717")
lsl <- lsl |> 
  mutate(lslpts = factor(as.numeric(lslpts)-1)) |>
  mutate(
    cwts = ifelse(lslpts == 1, 1, sum(lslpts == 1) / sum(lslpts == 0)),
    cwts = hardhat::importance_weights(cwts)
  )
lsl_folds <- spatial_block_cv(lsl, method = "random", v = 10)
glm_model <- logistic_reg() |> 
  set_engine("glm") |> 
  set_mode("classification")

# This is where I started changing things

# First off, define your formula as its own object outside tidymodels functions
lsl_formula <- lslpts ~ slope + cplan + cprof + elev + log10_carea

# Then set up your recipe using the data.frame method to recipe
# so that we can explicitly say our cwts column should be used for case weights
lsl_recipe <- recipes::recipe(
  sf::st_drop_geometry(lsl),
  # Save the case weights column alongside our formula
  vars = c(all.vars(lsl_formula), "cwts"),
  # I'm assuming that you've only got one outcome variable here
  roles = c(
    "outcome", 
    rep("predictor", length(all.vars(lsl_formula)) - 1), 
    "case_weight"
  )
) # Other recipe steps go here for other preprocessing, but NOT 
  # the case-weights creating recipe step

# Once your recipe is finished, create a sub-recipe that adds the 
# dynamic case-weights creating step
lsl_recipe_cv <- lsl_recipe |> 
  recipes::step_mutate(
    cwts = hardhat::importance_weights(
      ifelse(lslpts == 1, 1, sum(lslpts == 1) / sum(lslpts == 0))
    ),
    role = "case_weights"
  )

# Set up a primary workflow:
glm_wflow_wts <- workflow() |> 
  add_model(glm_model) |> 
  add_case_weights(cwts)

# When doing anything relating to cross-validation, use the sub-recipe:
glm_wflow_wts |> 
  add_recipe(lsl_recipe_cv) |> 
  fit_resamples(lsl_folds) |> 
  collect_metrics()
#> # A tibble: 2 × 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.752    10  0.0220 Preprocessor1_Model1
#> 2 roc_auc  binary     0.824    10  0.0301 Preprocessor1_Model1

# When fitting the final model, though, use the first recipe:
glm_wflow_wts_fit <- glm_wflow_wts |> 
  add_recipe(lsl_recipe) |> 
  fit(lsl)

# That fixes issue 1. Issue 2 is that something in {ingredients} is calling 
# quantile() on the case weights column
#
# I don't think it does anything else with the case weights column after that...
# so what if we just defined a function that works for that
# 
# I _think_ this hack is fine, but haven't done enough verification to confirm;
# I think there's an underlying bug in either DALEX or ingredients, but I don't 
# know what these packages do enough to confirm
quantile.hardhat_importance_weights <- \(x, ...) rep(NA, length(x))
explain_tidymodels(glm_wflow_wts_fit,
                   data = sf::st_drop_geometry(lsl),
                   y = lsl$lslpts) |>
  model_profile(N = 1000, type = "partial")
#> Preparation of a new explainer is initiated
#>   -> model label       :  workflow  (  default  )
#>   -> data              :  350  rows  7  cols 
#>   -> target variable   :  350  values 
#>   -> predict function  :  yhat.workflow  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package tidymodels , ver. 1.0.0 , task classification (  default  ) 
#>   -> model_info        :  Model info detected classification task but 'y' is a factor .  (  WARNING  )
#>   -> model_info        :  By deafult classification tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector with 0 and 1 values.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  0.00233623 , mean =  0.5 , max =  0.9858769  
#>   -> residual function :  difference between y and yhat (  default  )
#> Warning in Ops.factor(y, predict_function(model, data)): '-' not meaningful for
#> factors
#>   -> residuals         :  numerical, min =  NA , mean =  NA , max =  NA  
#>   A new explainer has been created!
#> Top profiles    : 
#>   _vname_  _label_         _x_    _yhat_ _ids_
#> 1   cprof workflow -0.16039415 0.8868163     0
#> 2   cplan workflow -0.15974538 0.9686094     0
#> 3   cplan workflow -0.12989512 0.9399790     0
#> 4   cplan workflow -0.12365616 0.9316212     0
#> 5   cplan workflow -0.10901864 0.9077660     0
#> 6   cplan workflow -0.09494406 0.8781567     0

^{Created on 2023-06-01 with reprex v2.0.2}

JamesGrecian · June 5, 2023, 3:47pm

Thanks @MikeMahoney218. I hadn't considered using two recipes!

I was trying to work forward from the cross-validation recipe and apply my workings to the folds object to generate multiple partial dependence profiles. However, I think I can adapt your suggestion as follows to give me the output I was after:

Use the cross-validation approach with lsl_recipe_cv to check AUC scores for the various models - ultimately I'm trying several combinations of polynomial terms in the GLMs, and exploring different parameters in the RFs and BRTs.

Then use the lsl_recipe formulation but apply it to each fold separately and pass it to the DALEXtra::explain_tidymodels and model_profile.

The outputs from this can be used to generate the partial dependence profiles for each fold (see below).

set.seed(1107)

# load libraries
library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(tidymodels)
library(spatialsample)
library(DALEXtra)
#> Loading required package: DALEX
#> Welcome to DALEX (version: 2.4.3).
#> Find examples and detailed introduction at: http://ema.drwhy.ai/
#> Additional features will be available after installation of: ggpubr.
#> Use 'install_dependencies()' to get all suggested dependencies
#> 
#> Attaching package: 'DALEX'
#> The following object is masked from 'package:dplyr':
#> 
#>     explain
#> Anaconda not found on your computer. Conda related functionality such as create_env.R and condaenv and yml parameters from explain_scikitlearn will not be available

# example data set
data("lsl", "study_mask", package = "spDataLarge")
ta <- terra::rast(system.file("raster/ta.tif", package = "spDataLarge"))

lsl <- lsl |> 
  st_as_sf(coords = c("x", "y"), crs = "EPSG:32717")

lsl <- lsl |> 
  mutate(lslpts = factor(as.numeric(lslpts)-1)) |>
  mutate(
    cwts = ifelse(lslpts == 1, 1, sum(lslpts == 1) / sum(lslpts == 0)),
    cwts = hardhat::importance_weights(cwts)
  )
lsl_folds <- spatial_block_cv(lsl, method = "random", v = 10)

glm_model <- logistic_reg() |> 
  set_engine("glm") |> 
  set_mode("classification")

# First off, define your formula as its own object outside tidymodels functions
lsl_formula <- lslpts ~ slope + cplan + cprof + elev + log10_carea

# Then set up your recipe using the data.frame method to recipe
# so that we can explicitly say our cwts column should be used for case weights
lsl_recipe <- recipes::recipe(
  sf::st_drop_geometry(lsl),
  # Save the case weights column alongside our formula
  vars = c(all.vars(lsl_formula), "cwts"),
  # I'm assuming that you've only got one outcome variable here
  roles = c(
    "outcome", 
    rep("predictor", length(all.vars(lsl_formula)) - 1), 
    "case_weight"
  )
) # Other recipe steps go here for other preprocessing, but NOT 
# the case-weights creating recipe step

# Once your recipe is finished, create a sub-recipe that adds the 
# dynamic case-weights creating step
lsl_recipe_cv <- lsl_recipe |> 
  recipes::step_mutate(
    cwts = hardhat::importance_weights(
      ifelse(lslpts == 1, 1, sum(lslpts == 1) / sum(lslpts == 0))
    ),
    role = "case_weights"
  )

# Set up a primary workflow:
glm_wflow_wts <- workflow() |>
  add_model(glm_model) |> 
  add_case_weights(cwts)

# When doing anything relating to cross-validation, use the sub-recipe:
glm_wflow_wts |> 
  add_recipe(lsl_recipe_cv) |> 
  fit_resamples(lsl_folds) |> 
  collect_metrics()
#> # A tibble: 2 × 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.752    10  0.0220 Preprocessor1_Model1
#> 2 roc_auc  binary     0.824    10  0.0301 Preprocessor1_Model1

# Fit final model using the first recipe
glm_wflow_wts_fit <- glm_wflow_wts |>
  add_recipe(lsl_recipe)

# but manipulate the input data to generate pdp plots for seperate folds
glm_final_model_fits_list <- 
  lapply(lsl_folds$splits, 
         FUN = function(x) fit(glm_wflow_wts_fit, # the primary workflow
                               analysis(x) |> # function to apply to each fold and manipulate case weights
                                 mutate(cwts = hardhat::importance_weights(
                                   ifelse(lslpts == 1, 1, sum(lslpts == 1) / sum(lslpts == 0))))))

# define a hack function
quantile.hardhat_importance_weights <- \(x, ...) rep(NA, length(x))

# a wrapper function so you don't need 10 chunks of explainer scripts
explain_wrapper <- function(input_model, input_data){
  pred <- input_data |> analysis() |> st_drop_geometry() |> 
    mutate(cwts = hardhat::importance_weights(ifelse(lslpts == 1, 1, sum(lslpts == 1) / sum(lslpts == 0))))
  lslpts <- input_data |> analysis() |> st_drop_geometry() |> dplyr::select(lslpts)
  pred_out <- explain_tidymodels(input_model, data = pred, y = lslpts) |> model_profile(N = 1000, type = "partial")
  pred_out <- pred_out[[2]] |> as_tibble()
  return(pred_out)
}

# map explainer wrapper to the list of models and the 10 folds
glm_pdp_preds <- map2(glm_final_model_fits_list, lsl_folds$splits, explain_wrapper)
#> Preparation of a new explainer is initiated
#>   -> model label       :  workflow  (  default  )
#>   -> data              :  315  rows  7  cols 
#>   -> target variable   :  Argument 'y' was a data frame. Converted to a vector. (  WARNING  )
#>   -> target variable   :  315  values 
#>   -> predict function  :  yhat.workflow  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package tidymodels , ver. 1.0.0 , task classification (  default  ) 
#>   -> model_info        :  Model info detected classification task but 'y' is a factor .  (  WARNING  )
#>   -> model_info        :  By deafult classification tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector with 0 and 1 values.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  0.003623632 , mean =  0.499563 , max =  0.9803047  
#>   -> residual function :  difference between y and yhat (  default  )
#> Warning in Ops.factor(y, predict_function(model, data)): '-' not meaningful for
#> factors
#>   -> residuals         :  numerical, min =  NA , mean =  NA , max =  NA  
#>   A new explainer has been created!  
#> Preparation of a new explainer is initiated
#>   -> model label       :  workflow  (  default  )
#>   -> data              :  317  rows  7  cols 
#>   -> target variable   :  Argument 'y' was a data frame. Converted to a vector. (  WARNING  )
#>   -> target variable   :  317  values 
#>   -> predict function  :  yhat.workflow  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package tidymodels , ver. 1.0.0 , task classification (  default  ) 
#>   -> model_info        :  Model info detected classification task but 'y' is a factor .  (  WARNING  )
#>   -> model_info        :  By deafult classification tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector with 0 and 1 values.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  0.00697734 , mean =  0.5023122 , max =  0.9829545  
#>   -> residual function :  difference between y and yhat (  default  )
#> Warning in Ops.factor(y, predict_function(model, data)): '-' not meaningful for
#> factors
#>   -> residuals         :  numerical, min =  NA , mean =  NA , max =  NA  
#>   A new explainer has been created!  
#> Preparation of a new explainer is initiated
#>   -> model label       :  workflow  (  default  )
#>   -> data              :  309  rows  7  cols 
#>   -> target variable   :  Argument 'y' was a data frame. Converted to a vector. (  WARNING  )
#>   -> target variable   :  309  values 
#>   -> predict function  :  yhat.workflow  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package tidymodels , ver. 1.0.0 , task classification (  default  ) 
#>   -> model_info        :  Model info detected classification task but 'y' is a factor .  (  WARNING  )
#>   -> model_info        :  By deafult classification tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector with 0 and 1 values.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  0.001714855 , mean =  0.5057313 , max =  0.9891054  
#>   -> residual function :  difference between y and yhat (  default  )
#> Warning in Ops.factor(y, predict_function(model, data)): '-' not meaningful for
#> factors
#>   -> residuals         :  numerical, min =  NA , mean =  NA , max =  NA  
#>   A new explainer has been created!  
#> Preparation of a new explainer is initiated
#>   -> model label       :  workflow  (  default  )
#>   -> data              :  318  rows  7  cols 
#>   -> target variable   :  Argument 'y' was a data frame. Converted to a vector. (  WARNING  )
#>   -> target variable   :  318  values 
#>   -> predict function  :  yhat.workflow  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package tidymodels , ver. 1.0.0 , task classification (  default  ) 
#>   -> model_info        :  Model info detected classification task but 'y' is a factor .  (  WARNING  )
#>   -> model_info        :  By deafult classification tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector with 0 and 1 values.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  0.002004039 , mean =  0.5 , max =  0.9741253  
#>   -> residual function :  difference between y and yhat (  default  )
#> Warning in Ops.factor(y, predict_function(model, data)): '-' not meaningful for
#> factors
#>   -> residuals         :  numerical, min =  NA , mean =  NA , max =  NA  
#>   A new explainer has been created!  
#> Preparation of a new explainer is initiated
#>   -> model label       :  workflow  (  default  )
#>   -> data              :  314  rows  7  cols 
#>   -> target variable   :  Argument 'y' was a data frame. Converted to a vector. (  WARNING  )
#>   -> target variable   :  314  values 
#>   -> predict function  :  yhat.workflow  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package tidymodels , ver. 1.0.0 , task classification (  default  ) 
#>   -> model_info        :  Model info detected classification task but 'y' is a factor .  (  WARNING  )
#>   -> model_info        :  By deafult classification tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector with 0 and 1 values.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  0.001638469 , mean =  0.5030292 , max =  0.9902357  
#>   -> residual function :  difference between y and yhat (  default  )
#> Warning in Ops.factor(y, predict_function(model, data)): '-' not meaningful for
#> factors
#>   -> residuals         :  numerical, min =  NA , mean =  NA , max =  NA  
#>   A new explainer has been created!  
#> Preparation of a new explainer is initiated
#>   -> model label       :  workflow  (  default  )
#>   -> data              :  320  rows  7  cols 
#>   -> target variable   :  Argument 'y' was a data frame. Converted to a vector. (  WARNING  )
#>   -> target variable   :  320  values 
#>   -> predict function  :  yhat.workflow  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package tidymodels , ver. 1.0.0 , task classification (  default  ) 
#>   -> model_info        :  Model info detected classification task but 'y' is a factor .  (  WARNING  )
#>   -> model_info        :  By deafult classification tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector with 0 and 1 values.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  0.004031889 , mean =  0.5025336 , max =  0.9808814  
#>   -> residual function :  difference between y and yhat (  default  )
#> Warning in Ops.factor(y, predict_function(model, data)): '-' not meaningful for
#> factors
#>   -> residuals         :  numerical, min =  NA , mean =  NA , max =  NA  
#>   A new explainer has been created!  
#> Preparation of a new explainer is initiated
#>   -> model label       :  workflow  (  default  )
#>   -> data              :  315  rows  7  cols 
#>   -> target variable   :  Argument 'y' was a data frame. Converted to a vector. (  WARNING  )
#>   -> target variable   :  315  values 
#>   -> predict function  :  yhat.workflow  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package tidymodels , ver. 1.0.0 , task classification (  default  ) 
#>   -> model_info        :  Model info detected classification task but 'y' is a factor .  (  WARNING  )
#>   -> model_info        :  By deafult classification tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector with 0 and 1 values.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  0.001731867 , mean =  0.4966264 , max =  0.9870314  
#>   -> residual function :  difference between y and yhat (  default  )
#> Warning in Ops.factor(y, predict_function(model, data)): '-' not meaningful for
#> factors
#>   -> residuals         :  numerical, min =  NA , mean =  NA , max =  NA  
#>   A new explainer has been created!  
#> Preparation of a new explainer is initiated
#>   -> model label       :  workflow  (  default  )
#>   -> data              :  302  rows  7  cols 
#>   -> target variable   :  Argument 'y' was a data frame. Converted to a vector. (  WARNING  )
#>   -> target variable   :  302  values 
#>   -> predict function  :  yhat.workflow  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package tidymodels , ver. 1.0.0 , task classification (  default  ) 
#>   -> model_info        :  Model info detected classification task but 'y' is a factor .  (  WARNING  )
#>   -> model_info        :  By deafult classification tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector with 0 and 1 values.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  0.002338018 , mean =  0.4906704 , max =  0.9822061  
#>   -> residual function :  difference between y and yhat (  default  )
#> Warning in Ops.factor(y, predict_function(model, data)): '-' not meaningful for
#> factors
#>   -> residuals         :  numerical, min =  NA , mean =  NA , max =  NA  
#>   A new explainer has been created!  
#> Preparation of a new explainer is initiated
#>   -> model label       :  workflow  (  default  )
#>   -> data              :  326  rows  7  cols 
#>   -> target variable   :  Argument 'y' was a data frame. Converted to a vector. (  WARNING  )
#>   -> target variable   :  326  values 
#>   -> predict function  :  yhat.workflow  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package tidymodels , ver. 1.0.0 , task classification (  default  ) 
#>   -> model_info        :  Model info detected classification task but 'y' is a factor .  (  WARNING  )
#>   -> model_info        :  By deafult classification tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector with 0 and 1 values.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  0.002765453 , mean =  0.5026366 , max =  0.987148  
#>   -> residual function :  difference between y and yhat (  default  )
#> Warning in Ops.factor(y, predict_function(model, data)): '-' not meaningful for
#> factors
#>   -> residuals         :  numerical, min =  NA , mean =  NA , max =  NA  
#>   A new explainer has been created!  
#> Preparation of a new explainer is initiated
#>   -> model label       :  workflow  (  default  )
#>   -> data              :  313  rows  7  cols 
#>   -> target variable   :  Argument 'y' was a data frame. Converted to a vector. (  WARNING  )
#>   -> target variable   :  313  values 
#>   -> predict function  :  yhat.workflow  will be used (  default  )
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package tidymodels , ver. 1.0.0 , task classification (  default  ) 
#>   -> model_info        :  Model info detected classification task but 'y' is a factor .  (  WARNING  )
#>   -> model_info        :  By deafult classification tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector with 0 and 1 values.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  0.001907591 , mean =  0.4974343 , max =  0.9881512  
#>   -> residual function :  difference between y and yhat (  default  )
#> Warning in Ops.factor(y, predict_function(model, data)): '-' not meaningful for
#> factors
#>   -> residuals         :  numerical, min =  NA , mean =  NA , max =  NA  
#>   A new explainer has been created!
# format dataframe
glm_pdp_preds <- glm_pdp_preds |> bind_rows(.id = "id")
names(glm_pdp_preds) <- c("id", "var_name", "label", "x", "yhat", "ids")

# plot to check
ggplot() +
  theme_bw() +
  geom_line(aes(x = x, y = yhat, group = id),
            alpha = 0.5,
            data = glm_pdp_preds) +
  facet_wrap(~var_name, 
             scale = "free_x")

^{Created on 2023-06-05 with reprex v2.0.2}

This still relies on your idea of using a function to override quantile.hardhat_importance_weights.

An alternative solution would be the addition of a helper function in the spatialsample package that adds a weight column to the spatialsample::spatial_block_cv object. These could then be passed as case weights, without using a function, which I think would resolve some of the issues further down the line. However, happy to stick with this solution for now.

Thanks again!

MikeMahoney218 · June 6, 2023, 1:35pm

An alternative solution would be the addition of a helper function in the spatialsample package that adds a weight column to the spatialsample::spatial_block_cv object. These could then be passed as case weights, without using a function, which I think would resolve some of the issues further down the line. However, happy to stick with this solution for now.

I've been thinking about this for my own purposes recently, and have mostly (but not entirely) landed on "this is the wrong place in the system to address this". rsample, and by extension spatialsample, never alters the data you provide it; this is partially how the resamples are kept memory-efficient but also partially reflects that most tidymodels workflows expect to do their data wrangling inside of recipes and tune. But particularly with spatial data, it might be nice to have easier ways to think about dynamic weights; I've been thinking about this in terms of IDW-weighting, a la method 3 of de Bruin et al, but it could apply to a lot of different domains.

All that to say: I wouldn't expect anything on this front in the immediate future... but I'm sure thinking about it.

system · June 13, 2023, 1:36pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.