workflow and null model: response object not found

I'm new to workflows (and tidymodels in general), so hopefully the fix is obvious. I'm trying out the null_model interface on the Titanic dataset, adapting this section in the tidymodels docs:

reprex

Preamble:

library(tidyverse)
library(tidymodels)

set.seed(10191)

Load and split:

# train.csv can be found here: https://www.kaggle.com/c/titanic/data
df <- read_csv("train.csv")

split <- initial_split(df, prop=0.8, strata=Survived)
df_train = training(split)
df_test = testing(split)

Convert response variable to a factor:

null_rec <- recipe(Survived ~ ., data=df_train) %>% 
  step_mutate(Survived=factor(as.logical(Survived)))

Create null model:

null_mod <- null_model(mode="classification") %>% 
  set_engine("parsnip")

Combine recipe and model into a workflow:

null_wflow <- workflow() %>% 
  add_recipe(null_rec) %>%
  add_model(null_mod)

"Fit" the model:

null_fit <- null_wflow %>% 
  fit(data=df_train)

Try to make a prediction:

null_pred <- null_fit %>% 
  predict(df_test, type="prob") %>%
  bind_cols(df_test %>% select(Survived))

# Error in factor(as.logical(Survived)) : object 'Survived' not found

I see this error if I replicate the above steps with a logistic regression model as well.

system/package info

  • Windows 10
  • R: 4.0.0
  • RStudio: 1.2.5042
  • parsnip: 0.1.1
  • recipes: 0.1.12
  • tidymodels: 0.1.0
  • workflows: 0.1.1

Thanks!

I havent used tidymodels myself, but it would make sense to me to prepare the variables on df_train and df_test in the same way. step_mutate(Survived=factor(as.logical(Survived))) is done for df_train only, perhaps do it earlier in your flow so that the mutation happens before the data is split ?

No, as I understand it, when you fit the workflow with a recipe and a model, calling predict with the workflow and a new dataset should apply any transformations specified in the recipe to the new data. At least, that's what the next section in the tutorial suggests.

sorry, I looked at this a little but I have no idea how to do this the tidymodels way , its only version 0.1 so maybe its not fully featured yet.

From the tutorial you linked, the arr_delay outcome/variable is defined before the creation of the recipe, indeed before even the split. so perhaps outcome/target variables are not suitable for step_mutates etc. ?

Wow, that sure did it.

This works:

df <- read_csv("train.csv") %>%
    mutate(Survived=factor(as.logical(Survived)))

and

null_rec <- recipe(Survived ~ ., data=df_train)

leaving everything else the same.

Thanks!

1 Like

Thats great. I suppose the downside is that if someone provided you a new data by csv, you'd have to manually convert Survived rather than rely on the recipe to do it ?
Perhaps its worth raising this as an issue on the recipe github

1 Like

Yes, I think I'll do that.

See the package vignette on skipping steps. In general, when baking new data (i.e. executing the finished recipe) you can't ensure that the outcome will be available. Skipping steps that involve the out is the way to get around this.

1 Like

That's a good point. I just tried it, skipping mutate when loading the data and putting step_mutate back in the recipe with skip=TRUE, and I can fit the model without any problems. But of course it makes more sense to do the mutate when loading, rather than having to mutate twice.

Thanks!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.