Incossistent error message: "Error: `x` must be a vector, not a `rsplit/vfold_split` object"

Am encountering the above error message when attempting to apply crossing() to what I think a nested data frame, but I'm not sure:

model_ranger <- train_cv %>% 
  crossing(mtry = c(1,2)) # %>%

Sometimes results in:

Error: x must be a vector, not a rsplit/vfold_split object

This error happens only sometimes, if I just keep re running the code block it sometimes works. (Discovered by a combination of accident and desperation).

The object in question:

> class(train_cv)
[1] "vfold_cv"   "rset"       "tbl_df"     "tbl"        "data.frame"
> train_cv
#  5-fold cross-validation using stratification 
# A tibble: 5 x 4
  splits            id    train                   validate               
* <named list>      <chr> <named list>            <named list>           
1 <split [72K/18K]> Fold1 <df[,11] [72,000 × 11]> <df[,11] [18,001 × 11]>
2 <split [72K/18K]> Fold2 <df[,11] [72,001 × 11]> <df[,11] [18,000 × 11]>
3 <split [72K/18K]> Fold3 <df[,11] [72,001 × 11]> <df[,11] [18,000 × 11]>
4 <split [72K/18K]> Fold4 <df[,11] [72,001 × 11]> <df[,11] [18,000 × 11]>
5 <split [72K/18K]> Fold5 <df[,11] [72,001 × 11]> <df[,11] [18,000 × 11]>

I arrived here with the following block of code, where pdata is my starting point regular df.

library(rsample)

# create train test split
set.seed(123)
pdata_split <- initial_split(pdata, 0.9)
training_data <- training(pdata_split)
testing_data <- testing(pdata_split)

# 5 fold split stratified on spender
train_cv <- vfold_cv(training_data, 5, strata = spender) %>% 
  
  # create training and validation sets within each fold
  mutate(train = map(splits, ~training(.x)), 
         validate = map(splits, ~testing(.x)))

Now that I've created the splits, I have separate code blocks that fit a ranger random forrest and an xgb on the same folds. For ranger I start with:

model_ranger <- train_cv %>% 
  crossing(mtry = c(1,2)) # %>% 

Error: x must be a vector, not a rsplit/vfold_split object

I tried to recreate this using diamonds built in dataset, but it worked. It's just with my actual data this happens. Intermittently.

Any ideas on how to solve?

A reproducible example, called a reprex will attract more and likely better answers.

Here's an example:

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(purrr)) 
suppressPackageStartupMessages(library(rsample))
pdata <- iris
pdata_split <- initial_split(pdata, 0.9)
training_data <- training(pdata_split)
testing_data <- testing(pdata_split)
train_cv <- vfold_cv(training_data, 5, strata = Species) %>% mutate(train = map(splits, ~training(.x)), validate = map(splits, ~testing(.x)))
(model_ranger <- train_cv %>% crossing(mtry = c(1,2)))
#> # A tibble: 10 x 5
#>    splits           id    train              validate           mtry
#>    <named list>     <chr> <named list>       <named list>      <dbl>
#>  1 <split [107/29]> Fold1 <df[,5] [107 × 5]> <df[,5] [29 × 5]>     1
#>  2 <split [107/29]> Fold1 <df[,5] [107 × 5]> <df[,5] [29 × 5]>     2
#>  3 <split [108/28]> Fold2 <df[,5] [108 × 5]> <df[,5] [28 × 5]>     1
#>  4 <split [108/28]> Fold2 <df[,5] [108 × 5]> <df[,5] [28 × 5]>     2
#>  5 <split [109/27]> Fold3 <df[,5] [109 × 5]> <df[,5] [27 × 5]>     1
#>  6 <split [109/27]> Fold3 <df[,5] [109 × 5]> <df[,5] [27 × 5]>     2
#>  7 <split [110/26]> Fold4 <df[,5] [110 × 5]> <df[,5] [26 × 5]>     1
#>  8 <split [110/26]> Fold4 <df[,5] [110 × 5]> <df[,5] [26 × 5]>     2
#>  9 <split [110/26]> Fold5 <df[,5] [110 × 5]> <df[,5] [26 × 5]>     1
#> 10 <split [110/26]> Fold5 <df[,5] [110 × 5]> <df[,5] [26 × 5]>     2

Created on 2020-01-08 by the reprex package (v0.3.0)

It's hard to figure out the source of the error without knowing the actual pdata input or a faux-pdata with the same structure.

Hi. Yes, I did try using the diamonds dataset. I created folds using rsample::vfold_cv() but each time I tried using crossing() it did work.

I cannot share my own rather large dataset since it's company data.

What makes this problem hard is that if I understood why it sometimes works and why I cannot reproduce on other data, I'd probably know how to solve it.

pdata, the data used for the splits, is just a regular df:

> pdata %>% glimpse()
Observations: 1,000,000
Variables: 11
$ s                        <chr> "IDFV-FEDC6007-08AC-4810-88A1-F7176467F387", "7081C69E-ECE2-4E39-B7AC-3A58B129E7DE", "8BBD5…
$ IOS                      <dbl> 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0…
$ is_publisher_organic     <dbl> 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
$ is_publisher_facebook    <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0…
$ sessions_d7              <dbl> 1, 1, 12, 1, 1, 1, 2, 1, 1, 2, 8, 12, 3, 1, 3, 14, 2, 4, 1, 1, 2, 1, 1, 2, 14, 11, 1, 4, 2,…
$ sum_session_time_secs_d7 <dbl> 106, 800, 19426, 1431, 1323, 196, 4011, 288, 1152, 4005, 10352, 13402, 4171, 5646, 170, 192…
$ d7_utility_sum           <dbl> 1.65927871, 11.00098870, 211.61885361, 19.43254448, 17.89554574, 2.57431089, 49.26038019, 4…
$ recent_utility_ratio     <dbl> 1.00, 1.00, 0.86, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 0.55, 0.95, 0.24, 1.00, 0.41, 1…
$ spend_7d                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 192, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ spend_30d                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 192, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ spender                  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …

Have you screened for NAs and other data problems? I’m on iOS so can’t check but there’s a complete cases function in descTools, I think.

Yes, all is complete cases. The data frame with list columns is at the top of my post:

> class(train_cv)
[1] "vfold_cv"   "rset"       "tbl_df"     "tbl"        "data.frame"
> train_cv
#  5-fold cross-validation using stratification 
# A tibble: 5 x 4
  splits            id    train                   validate               
* <named list>      <chr> <named list>            <named list>           
1 <split [72K/18K]> Fold1 <df[,11] [72,000 × 11]> <df[,11] [18,001 × 11]>
2 <split [72K/18K]> Fold2 <df[,11] [72,001 × 11]> <df[,11] [18,000 × 11]>
3 <split [72K/18K]> Fold3 <df[,11] [72,001 × 11]> <df[,11] [18,000 × 11]>
4 <split [72K/18K]> Fold4 <df[,11] [72,001 × 11]> <df[,11] [18,000 × 11]>
5 <split [72K/18K]> Fold5 <df[,11] [72,001 × 11]> <df[,11] [18,000 × 11]>

The data within each fold is complete too.

I was thinking of the source data. Although train is complete do we know that its lists are? I'm speculating, because I haven't thought about the role of sampling in the workflow. If it is there and pdata has some holes in it, that would explain the intermittent nature of the errors: some draws sweep in data that jam up calculations. Since iris and diamonds are clean, we'd never see that. Just a thought.

HI, thanks for the suggestion. I just checked the input data frame to vfold_cv() is indeed complete:

> pdata %>% complete.cases() %>% table()
.
   TRUE 
3055787 

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.