Tidymodels: tune::last_fit() ERROR when using rsample::initial_time_split() instead of rsample::initial_split()

I created a tidymodel pipeline that works perfectly fine until I change the method of creating the rsplit object from initial_split() to initial_time_split(). When I do this, the last_fit() function breaks and shows the error:

x : internal: Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 2, 0
Warning message:
All models failed in [fit_resamples()]. See the .notes column.

I need to use initial_time_split() because I have an specific set for training and another set for testing (they are different time periods) so a random pull of rows doesnt work.

Do someone know how to fix this problem or knows a better way to do this? Thank you very much for your time!

Here is the code that produces the error. Notice that if you change the df_split line after the ATTENTION comment it will work perfectly fine.

library(tidymodels)
library(dplyr)
library(fastDummies)


#download the data
data = read.csv("https://uc4f84ae07955bebed2c3804d381.dl.dropboxusercontent.com/cd/0/get/A-PcNiWAKII0M8OlwmYxE1fYXFhtTUPnLw2x_AvL3IlUR_HE8_IPdTVPaYj1mtQwByPgcq2qpj-bfb4O8-wgW4rqgAnff4cLbNSSGe44FewPUmxenJZBpvxXikDQUyVVXXY/file?_download_id=89899030331414668534005769335371927551009313297349360886651029534&_notify_domain=www.dropbox.com&dl=1")
data = data %>%
  select(-X) %>%
  dummy_cols(remove_selected_columns = TRUE) %>%
  mutate(label = as.factor(label))
  

#Creation of train and test splits
set.seed(123)

proportion = sum(data$train)/nrow(data) #proportion of train observations in the data

# ATTENTION #
df_split = initial_time_split(data, prop = proportion) #IT SEEMS THAT THIS CREATES THE PROBLEM  
#df_split = initial_split(data, prop = 3/4, strata = label) #changing this line with the previous fixes the problem


df_train <- training(df_split)
df_test  <- testing(df_split)

#Recipe
recipe <- 
  recipe(label ~ ., data = df_train) %>%
  update_role(x, y, train, new_role = "ID")

#model specificaction
cores = parallel::detectCores() 

xgb_spec <- boost_tree(
  trees = tune(), 
  tree_depth = tune(),
  min_n = tune(), 
  loss_reduction = tune(),                     
  sample_size = tune(),
  mtry = tune(),         
  learn_rate = tune(),                         
) %>% 
  set_engine("xgboost", nthread = cores) %>% 
  set_mode("classification")

#Hiperparameters
set.seed(123)
xgb_grid <- grid_max_entropy(
  trees(),
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), df_train),
  learn_rate(),
  size = 7
)

#Cross validation
set.seed(123)
folds <- vfold_cv(df_train, v = 2, strata = label) 

#Model workflow
xgb_workflow <- 
  workflow() %>% 
  add_model(xgb_spec) %>% 
  add_recipe(recipe)


### Training###
set.seed(123)
xgb_res = xgb_workflow %>% 
  tune_grid(resamples = folds,
            grid = xgb_grid,
            metrics = metric_set(roc_auc)) 
#selection of best model
best_auc <- select_best(xgb_res, "roc_auc")

#adding best model to workflow
final_xgb <- finalize_workflow(
  xgb_workflow,
  best_auc)

#Last fit
final_res <- last_fit(object = final_xgb, split = df_split)

x : internal: Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 2, 0
Warning message:
All models failed in [fit_resamples()]. See the `.notes` column.

Session Info

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] fastDummies_1.6.1 xgboost_1.1.1.1 vip_0.2.2 themis_0.1.2 yardstick_0.0.7 workflows_0.1.3 tune_0.1.1 tidyr_1.1.1 tibble_3.0.3 rsample_0.0.7 recipes_0.1.13 purrr_0.3.4 parsnip_0.1.3 modeldata_0.0.2 infer_0.5.3 ggplot2_3.3.2 dplyr_1.0.2 dials_0.0.8 scales_1.1.1 broom_0.7.0 tidymodels_0.1.1

Absolutely not an expert, also I can't download your test data, but could it be that your labels are not homogenous in your dataset, and the split with initial_time_split() gives you a testing dataset without one of the labels. That's exactly the error message you would get.

1 Like

Thank you very much for your response! I just updated the data link, sorry for that.
Regarding your suggestion, both df_train and df_test include the same labels, so maybe the problem is somewhere else.

It is a bug. I just figured out the issue but Athos beat me to it.

One note about your data though... it might be problematic to do classification on logical outcomes that are converted to factors. There are very few reserved words in R but you might be tempting fate a bit by doing this.

2 Likes

Thank you very much Max! I will follow your advice regarding the labels as well.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.