xgboost works with add_formula but not with recipe

Hi! I'm trying to fit an xgboost model (regression) for some Airbnb data. I´m using the tidymodels framework. I go thru my usual steps when working with tidymodels:

  1. Split data
data_split <- initial_split(listings_regre,
                            strata = "y",
                            prop = 0.8)
data_train <- training(data_split)
data_test  <- testing(data_split)
  1. Create recipe
rec <- recipe(y  ~ ., data = data_train) %>% 
  step_nzv(all_nominal()) %>%
  step_dummy(all_nominal())
  1. Create model
xgb_mod <-
  boost_tree() %>% 
  set_engine('xgboost') %>%
  set_mode('regression')
  1. Create workflow
xgb_flow <- workflow() %>%
  add_model(xgb_mod) %>% 
  add_recipe(rec)
  1. Fit model
xgb_fit <- xgb_flow %>% 
  last_fit(split = data_split) 

Then I get:

preprocessor 1/1, model 1/1: Error in xgboost::xgb.DMatrix(x, label = y, missing = NA): 'data' has class 'character' and length 682192.\n  'data' accepts either a numeric matrix or a single filename."

But if change the workflow to

xgb_flow <- workflow() %>%
  add_model(xgb_mod) %>% 
  add_formula(y ~ .)

Everything works just fine.

I understood from here that both of these should work but is not happening. Does anybody know what is wrong with my recipe? I prefer working with recipes so I'd prefer using the first option.

Thank you in advance

2 Likes

You haven't provided a reprex so it's might be difficult to directly help you in reference to what you are doing.

But I'm going to go out on a limb and guess that the issue is with the data type of the y column, probably the recipe is adjusting all dependant variables but not the outcome, and xgboost doesn't know how to target a character outcome ?

I'm away from the computer so can't yet test my theory on a made up example, at this time.

1 Like

You're right, no reprex was provided. Here's my attempt to do so:

Data from: Get the Data - Inside Airbnb. Adding data to the debate. (the first one, from Amsterdam, you download listings.csv.gz)
My first, failed attempt:

listings <- read_csv(here("listings.csv") )

listings <- listings %>%
  mutate(Price = parse_number(Price),
         across(where(is.character), as.factor)
  )

data_split <- initial_split(listings,
                            strata = "Price",
                            prop = 0.8)
train <- training(data_split)
test  <- testing(data_split)

rec_xgb <- recipe(Price ~ ., data =  train) %>%  
  step_nzv(all_nominal()) %>%
  step_dummy(all_nominal())

xgb_mod <-
  boost_tree() %>% 
  set_engine('xgboost') %>%
  set_mode('regression')

xgb_flow <- workflow() %>%
  add_model(xgb_mod) %>% 
  add_recipe(rec_xgb)

xgb_fit <- xgb_flow %>% 
  last_fit(split = data_split) 

Then changed the workflow to:

xgb_flow <- workflow() %>%
  add_model(xgb_modelo) %>% 
  add_formula(Price ~ .)

And it worked.

As you can see, my dependent variable is not a character, is numeric.

Thank you for your response and help, hope this helps to clear up the situation

1 Like

I don't think your demonstration shows that, Doesn't it rather demonstrate that your flow executes when using the formula approach, despite price is character...

furthermore, isn't it not Price but price in the data ?


listings <- read_csv("http://data.insideairbnb.com/the-netherlands/north-holland/amsterdam/2021-02-08/data/listings.csv.gz")
str(listings$price)

I'm sorry for the poor reprex. I couldn't reproduce my problem as it is because we did a lot of preprocessing and translation of variables.

Nonetheless, I think we've figure it out:

Dates were the problem, we were introducing dates to the model without any coertion and that was what was causing the problem. I will look into step_date and other methods that may help with this issue.

Thank you so much for all your help

1 Like

Glad you figured it out, and thanks for sharing the date info

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.