workflow and null model: response object not found

aliddell · June 1, 2020, 1:08pm

I'm new to workflows (and tidymodels in general), so hopefully the fix is obvious. I'm trying out the null_model interface on the Titanic dataset, adapting this section in the tidymodels docs:

reprex

Preamble:

library(tidyverse)
library(tidymodels)

set.seed(10191)

Load and split:

# train.csv can be found here: https://www.kaggle.com/c/titanic/data
df <- read_csv("train.csv")

split <- initial_split(df, prop=0.8, strata=Survived)
df_train = training(split)
df_test = testing(split)

Convert response variable to a factor:

null_rec <- recipe(Survived ~ ., data=df_train) %>% 
  step_mutate(Survived=factor(as.logical(Survived)))

Create null model:

null_mod <- null_model(mode="classification") %>% 
  set_engine("parsnip")

Combine recipe and model into a workflow:

null_wflow <- workflow() %>% 
  add_recipe(null_rec) %>%
  add_model(null_mod)

"Fit" the model:

null_fit <- null_wflow %>% 
  fit(data=df_train)

Try to make a prediction:

null_pred <- null_fit %>% 
  predict(df_test, type="prob") %>%
  bind_cols(df_test %>% select(Survived))

# Error in factor(as.logical(Survived)) : object 'Survived' not found

I see this error if I replicate the above steps with a logistic regression model as well.

system/package info

Windows 10
R: 4.0.0
RStudio: 1.2.5042
parsnip: 0.1.1
recipes: 0.1.12
tidymodels: 0.1.0
workflows: 0.1.1

Thanks!

nirgrahamuk · June 1, 2020, 4:27pm

I havent used tidymodels myself, but it would make sense to me to prepare the variables on df_train and df_test in the same way. step_mutate(Survived=factor(as.logical(Survived))) is done for df_train only, perhaps do it earlier in your flow so that the mutation happens before the data is split ?

aliddell · June 1, 2020, 4:51pm

No, as I understand it, when you fit the workflow with a recipe and a model, calling predict with the workflow and a new dataset should apply any transformations specified in the recipe to the new data. At least, that's what the next section in the tutorial suggests.

nirgrahamuk · June 1, 2020, 6:19pm

sorry, I looked at this a little but I have no idea how to do this the tidymodels way , its only version 0.1 so maybe its not fully featured yet.

From the tutorial you linked, the arr_delay outcome/variable is defined before the creation of the recipe, indeed before even the split. so perhaps outcome/target variables are not suitable for step_mutates etc. ?

aliddell · June 1, 2020, 6:26pm

Wow, that sure did it.

This works:

df <- read_csv("train.csv") %>%
    mutate(Survived=factor(as.logical(Survived)))

and

null_rec <- recipe(Survived ~ ., data=df_train)

leaving everything else the same.

Thanks!

nirgrahamuk · June 1, 2020, 6:28pm

Thats great. I suppose the downside is that if someone provided you a new data by csv, you'd have to manually convert Survived rather than rely on the recipe to do it ?
Perhaps its worth raising this as an issue on the recipe github

aliddell · June 1, 2020, 6:30pm

Yes, I think I'll do that.

Max · June 1, 2020, 7:45pm

See the package vignette on skipping steps. In general, when baking new data (i.e. executing the finished recipe) you can't ensure that the outcome will be available. Skipping steps that involve the out is the way to get around this.

aliddell · June 2, 2020, 1:23am

That's a good point. I just tried it, skipping mutate when loading the data and putting step_mutate back in the recipe with skip=TRUE, and I can fit the model without any problems. But of course it makes more sense to do the mutate when loading, rather than having to mutate twice.

Thanks!

system · June 9, 2020, 1:23am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.