Error: `x` must be a vector, not a `rsplit/vfold_split` object

I'm experiencing an error message when working with a list column after using rsample. I created a sample of sanitized data as a csv and it seems to reproduce the issue. This csv file here.

(Incidentally is there a better way to share example data on here?)

Here is the script and error message:

# (so you know what's loaded in my environment)
library(tufte)
library(tidyverse)
library(lubridate)
library(foreach)
library(doParallel)
library(scales)
library(kableExtra)
library(rmarkdown)
library(dbplyr)
library(DBI)
library(odbc)
library(rlang)
library(rsample)
library(Metrics)

example_data <- read_csv("example_data.csv")

example_split <- initial_split(example_data, 0.9)
training_data <- training(example_split)
testing_data <- testing(example_split)

# 5 fold split stratified on spender
train_cv <- vfold_cv(training_data, 5, strata = j) %>% 
  
  # create training and validation sets within each fold
  mutate(train = map(splits, ~training(.x)), 
         validate = map(splits, ~testing(.x)))

# everything works up till this point. It's when I try to do anything with train_cv that I encounter issues. 

blah <- train_cv %>% 
  crossing(mtry = c(1,2))

> Error: `x` must be a vector, not a `rsplit/vfold_split` object

blah2 <- train_cv %>% 
  crossing(nrounds = c(100, 150, 200))
> Error: `x` must be a vector, not a `rsplit/vfold_split` object

Here's how train_cv looks before trying to use it with crossing():

train_cv
#  5-fold cross-validation using stratification 
# A tibble: 5 x 4
  splits            id    train                  validate              
* <named list>      <chr> <named list>           <named list>          
1 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]>
2 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
3 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
4 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
5 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]>

Desired outcome is that there will be a new column 'mtry' and each existing row (fold) will have a row for mtry = 1 and another for mtry = 2.

How can I continue to work with train_cv after creating folds and then use crossing to experiment with various hyper parameters in my workflow?

[EDIT]
This error is inconsistent. I tried restarting my session and it ran fine. But in my script I cannot restart my session each time I come to this code block. What could be loaded in my space that would lead to this error? Screen shot:

And then after clearing workspace and starting a fresh session, everything works:

In case it's informative, here's what I'm shown when I click 'show traceback' on the error message:

Error: `x` must be a vector, not a `rsplit/vfold_split` object
23.
stop(fallback)
22.
signal_abort(cnd)
21.
abort(message, .subclass = c(.subclass, "vctrs_error"), ...)
20.
stop_vctrs(msg, "vctrs_error_scalar_type", actual = x)
19.
stop_scalar_type(.Primitive("quote")(structure(list(data = structure(list( s = c("92DF3481-4F83-47E4-AE08-E7AD35EBC2B9", "IDFV-DB587A66-50ED-4468-999D-81CB9D872B81", "EAB6422C-17D7-428C-B25A-9BA9EB5C6FE2", "IDFV-A228265A-CB20-40EE-BEFF-85A532525DA2", "IDFV-109FD148-8287-47BF-A301-130557CC2583", "6B611B4F-6C1B-45D2-99E7-3BA9BF67CC27", ...
18.
vec_unique_loc(x)
17.
vec_slice(x, vec_unique_loc(x))
16.
vec_unique(x)
15.
vec_proxy_compare(x)
14.
is.data.frame(proxy)
13.
order_proxy(vec_proxy_compare(x), direction = direction, na_value = na_value)
12.
vec_order(x, direction = direction, na_value = na_value)
11.
vec_sort(vec_unique(x))
10.
.f(.x[[i]], ...)
9.
map(cols, sorted_unique)
8.
crossing(., mtry = c(1, 2))
7.
function_list[[i]](value)
6.
freduce(value, `_function_list`)
5.
`_fseq`(`_lhs`)
4.
eval(quote(`_fseq`(`_lhs`)), env, env)
3.
eval(quote(`_fseq`(`_lhs`)), env, env)
2.
withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
1.
train_cv %>% crossing(mtry = c(1, 2)) %>% mutate(model_binary = map2(.x = train, .y = mtry, ~ranger::ranger(formula = spender ~ d7_utility_sum + recent_utility_ratio, probability = T, mtry = .y, data = .x %>% filter(spend_7d == 0) %>% mutate(spender = factor(spender)))), ...

Hi, again, Doug

Here's a reprex that refactors the code a bit and doesn't produce the error. But is blah the object you need?

suppressPackageStartupMessages(library(dplyr)) 
suppressPackageStartupMessages(library(purrr)) 
suppressPackageStartupMessages(library(readr)) 
suppressPackageStartupMessages(library(rsample)) 

# downloaded the file from googlesheets; there's a googlesheet package
# that I hasn't groked it yet; gists are a great alternative

example_data <- read_csv("~/Desktop/example_data.csv")
#> Parsed with column specification:
#> cols(
#>   a = col_double(),
#>   b = col_double(),
#>   c = col_double(),
#>   d = col_double(),
#>   e = col_double(),
#>   f = col_double(),
#>   g = col_double(),
#>   h = col_double(),
#>   i = col_double(),
#>   j = col_logical()
#> )
example_split <- initial_split(example_data, 0.9)
training_data <- training(example_split)
testing_data <- testing(example_split)
train_cv <- vfold_cv(training_data, 5, strata = j) %>% mutate(train = map(splits, ~training(.x)), validate = map(splits, ~testing(.x)))
train_splits_id <- train_cv %>% select(splits, id)
blah <- crossing(mtry = train_splits_id)
blah
#> # A tibble: 5 x 1
#>   mtry$splits       $id  
#>   <named list>      <chr>
#> 1 <split [72K/18K]> Fold1
#> 2 <split [72K/18K]> Fold2
#> 3 <split [72K/18K]> Fold3
#> 4 <split [72K/18K]> Fold4
#> 5 <split [72K/18K]> Fold5

Created on 2020-01-14 by the reprex package (v0.3.0)

1 Like

Hi @technocrat, I think there's a disconnect. What I have before trying to apply crossing() is this:

> library(tufte)
> library(tidyverse)
> library(lubridate)
> library(foreach)
> library(doParallel) # includes package just parallel
> library(scales)
> library(kableExtra)
> library(rmarkdown)
> library(dbplyr)
> library(DBI)
> library(odbc)
> library(rlang)
> library(rsample)
> library(Metrics)
> 
> example_data <- read_csv("example_data.csv")
Parsed with column specification:
cols(
  a = col_double(),
  b = col_double(),
  c = col_double(),
  d = col_double(),
  e = col_double(),
  f = col_double(),
  g = col_double(),
  h = col_double(),
  i = col_double(),
  j = col_logical()
)
> 
> example_split <- initial_split(example_data, 0.9)
> training_data <- training(example_split)
> testing_data <- testing(example_split)
> 
> # 5 fold split stratified on spender
> train_cv <- vfold_cv(training_data, 5, strata = j) %>% 
+   
+   # create training and validation sets within each fold
+   mutate(train = map(splits, ~training(.x)), 
+          validate = map(splits, ~testing(.x)))
> train_cv
#  5-fold cross-validation using stratification 
# A tibble: 5 x 4
  splits            id    train                  validate              
* <named list>      <chr> <named list>           <named list>          
1 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]>
2 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
3 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
4 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
5 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]>

Here's what train_cv looks like as of now:

> train_cv
#  5-fold cross-validation using stratification 
# A tibble: 5 x 4
  splits            id    train                  validate              
* <named list>      <chr> <named list>           <named list>          
1 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]>
2 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
3 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
4 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
5 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]>

Now, I need to take train_cv and add a new column 'mtry' which, for each row in train_cv mtry contains one of each of c(1, 2).

# this should work
train_cv <- train_cv %>%  crossing(mtry = c(1,2))

This is how train_cv should look after successfully running crossing()

blah
# A tibble: 10 x 5
   splits            id    train                  validate                mtry
   <named list>      <chr> <named list>           <named list>           <dbl>
 1 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]>     1
 2 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]>     2
 3 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
 4 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2
 5 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
 6 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2
 7 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
 8 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2
 9 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
10 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2

So, whereas train_cv started out with 5 rows, it now has 10 rows, one row for mtry values of 1 and one row for mtry values of 2.

1 Like

Thanks, I'll take another look. Getting the question right is always the hardest part!

Yeah! Appreciate any feedback. This problem is particularly hard because the error is intermittent. It only happens sometimes.

One of those annoying code chunks that choke reprex.

The following code, like yours intermittantly throws

Error: x must be a vector, not a rsplit/vfold_split object
Run rlang::last_error() to see where the error occurred.

However, it proceeds to happily produce my re_train_cv variable even with the error message!

suppressPackageStartupMessages(library(dplyr))
library(purrr)
library(readr)
library(rsample)
library(Metrics)

mtry = c(1,2)
example_data <- read_csv("~/Desktop/example_data.csv")
Parsed with column specification:
cols(
  a = col_double(),
  b = col_double(),
  c = col_double(),
  d = col_double(),
  e = col_double(),
  f = col_double(),
  g = col_double(),
  h = col_double(),
  i = col_double(),
  j = col_logical()
)
example_split <- initial_split(example_data, 0.9)
training_data <- training(example_split)
testing_data <- testing(example_split)
train_cv <- vfold_cv(training_data, 5, strata = j) %>% mutate(train = map(splits, ~training(.x)), validate = map(splits, ~testing(.x)))
re_train_cv <- train_cv %>%  crossing(mtry)
re_train_cv
# A tibble: 10 x 5
   splits            id    train                  validate                mtry
   <named list>      <chr> <named list>           <named list>           <dbl>
 1 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]>     1
 2 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]>     2
 3 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
 4 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2
 5 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
 6 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2
 7 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
 8 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2
 9 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
10 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2

Hi @technocrat. OK, but in my code it does not run when the error shows, it stops the script :frowning:

Did you try

mtry = c(1,2)

I couldn't get it past the error at all until I did that; why the dickens the error shows up but doesn't stop anything, wtf?

I don't follow? In my original code I have ... %>% crossing(mtry = c(1, 2)). Are you suggesting something else?

I don't know why it changed the result, but I used

mtry = c(1,2)
...
crossing(mtry)

Hmmm. OK, going to try that when back in the office and see how it goes. Thanks for the tip!

1 Like

I wish I could explain the behavior, but heck! It does seem to work!

Welp, I gave that a try but am still hitting this error :frowning:
Thanks for trying
Screen Shot 2020-01-17 at 11.35.53 AM

1 Like

For the avoidance of doubt:

First run

Second run

Third run

(Inexplicable)

I'm not even sure which library's github page to go to to report an issue. Would you suspect this is to do with rsample? Or Plyr?

I thought dplyr::crossing was the function acting inconsistently on the same data, but I was, once again, wrong. With a separately saved train_cv, it's perfectly consistent. So, I'd lob an issue into rsample; if it's an artifact of random sampling, can it be controlled with set.seed(?)?

Submitted an issue, lets see if anyone sheds any light on it. Thanks for trying to help me!

1 Like

I have not been able to reproduce the error message since doing this:

library(tidyverse)
library(rsample)

example_data <- read_csv("example_data.csv")

example_split <- initial_split(example_data, 0.9)
training_data <- training(example_split) %>% as_tibble()
testing_data <- testing(example_split) %>% as_tibble()

# 5 fold split stratified on j
set.seed(123)
train_cv <- vfold_cv(training_data, 5, strata = j) %>%
  
  # create training and validation sets within each fold
  mutate(train = map(splits, ~training(.x)), 
         validate = map(splits, ~testing(.x))) %>% 
  group_by(id) %>% nest() %>% unnest()

blah <- train_cv %>%
  crossing(mtry = c(1, 2))

Adding:
%>% group_by(id) %>% nest() %>% unnest()

Just seems repetitive since this is already done. But it seems to get around the error. Don't know why but there you go.

1 Like