Error: `x` must be a vector, not a `rsplit/vfold_split` object

dougfir · January 15, 2020, 12:57am

I'm experiencing an error message when working with a list column after using rsample. I created a sample of sanitized data as a csv and it seems to reproduce the issue. This csv file here.

(Incidentally is there a better way to share example data on here?)

Here is the script and error message:

# (so you know what's loaded in my environment)
library(tufte)
library(tidyverse)
library(lubridate)
library(foreach)
library(doParallel)
library(scales)
library(kableExtra)
library(rmarkdown)
library(dbplyr)
library(DBI)
library(odbc)
library(rlang)
library(rsample)
library(Metrics)

example_data <- read_csv("example_data.csv")

example_split <- initial_split(example_data, 0.9)
training_data <- training(example_split)
testing_data <- testing(example_split)

# 5 fold split stratified on spender
train_cv <- vfold_cv(training_data, 5, strata = j) %>% 
  
  # create training and validation sets within each fold
  mutate(train = map(splits, ~training(.x)), 
         validate = map(splits, ~testing(.x)))

# everything works up till this point. It's when I try to do anything with train_cv that I encounter issues. 

blah <- train_cv %>% 
  crossing(mtry = c(1,2))

> Error: `x` must be a vector, not a `rsplit/vfold_split` object

blah2 <- train_cv %>% 
  crossing(nrounds = c(100, 150, 200))
> Error: `x` must be a vector, not a `rsplit/vfold_split` object

Here's how train_cv looks before trying to use it with crossing():

train_cv
#  5-fold cross-validation using stratification 
# A tibble: 5 x 4
  splits            id    train                  validate              
* <named list>      <chr> <named list>           <named list>          
1 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]>
2 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
3 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
4 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
5 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]>

Desired outcome is that there will be a new column 'mtry' and each existing row (fold) will have a row for mtry = 1 and another for mtry = 2.

How can I continue to work with train_cv after creating folds and then use crossing to experiment with various hyper parameters in my workflow?

[EDIT]
This error is inconsistent. I tried restarting my session and it ran fine. But in my script I cannot restart my session each time I come to this code block. What could be loaded in my space that would lead to this error? Screen shot:

And then after clearing workspace and starting a fresh session, everything works:

In case it's informative, here's what I'm shown when I click 'show traceback' on the error message:

Error: `x` must be a vector, not a `rsplit/vfold_split` object
23.
stop(fallback)
22.
signal_abort(cnd)
21.
abort(message, .subclass = c(.subclass, "vctrs_error"), ...)
20.
stop_vctrs(msg, "vctrs_error_scalar_type", actual = x)
19.
stop_scalar_type(.Primitive("quote")(structure(list(data = structure(list( s = c("92DF3481-4F83-47E4-AE08-E7AD35EBC2B9", "IDFV-DB587A66-50ED-4468-999D-81CB9D872B81", "EAB6422C-17D7-428C-B25A-9BA9EB5C6FE2", "IDFV-A228265A-CB20-40EE-BEFF-85A532525DA2", "IDFV-109FD148-8287-47BF-A301-130557CC2583", "6B611B4F-6C1B-45D2-99E7-3BA9BF67CC27", ...
18.
vec_unique_loc(x)
17.
vec_slice(x, vec_unique_loc(x))
16.
vec_unique(x)
15.
vec_proxy_compare(x)
14.
is.data.frame(proxy)
13.
order_proxy(vec_proxy_compare(x), direction = direction, na_value = na_value)
12.
vec_order(x, direction = direction, na_value = na_value)
11.
vec_sort(vec_unique(x))
10.
.f(.x[[i]], ...)
9.
map(cols, sorted_unique)
8.
crossing(., mtry = c(1, 2))
7.
function_list[[i]](value)
6.
freduce(value, `_function_list`)
5.
`_fseq`(`_lhs`)
4.
eval(quote(`_fseq`(`_lhs`)), env, env)
3.
eval(quote(`_fseq`(`_lhs`)), env, env)
2.
withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
1.
train_cv %>% crossing(mtry = c(1, 2)) %>% mutate(model_binary = map2(.x = train, .y = mtry, ~ranger::ranger(formula = spender ~ d7_utility_sum + recent_utility_ratio, probability = T, mtry = .y, data = .x %>% filter(spend_7d == 0) %>% mutate(spender = factor(spender)))), ...

technocrat · January 15, 2020, 3:20am

Hi, again, Doug

Here's a reprex that refactors the code a bit and doesn't produce the error. But is blah the object you need?

suppressPackageStartupMessages(library(dplyr)) 
suppressPackageStartupMessages(library(purrr)) 
suppressPackageStartupMessages(library(readr)) 
suppressPackageStartupMessages(library(rsample)) 

# downloaded the file from googlesheets; there's a googlesheet package
# that I hasn't groked it yet; gists are a great alternative

example_data <- read_csv("~/Desktop/example_data.csv")
#> Parsed with column specification:
#> cols(
#>   a = col_double(),
#>   b = col_double(),
#>   c = col_double(),
#>   d = col_double(),
#>   e = col_double(),
#>   f = col_double(),
#>   g = col_double(),
#>   h = col_double(),
#>   i = col_double(),
#>   j = col_logical()
#> )
example_split <- initial_split(example_data, 0.9)
training_data <- training(example_split)
testing_data <- testing(example_split)
train_cv <- vfold_cv(training_data, 5, strata = j) %>% mutate(train = map(splits, ~training(.x)), validate = map(splits, ~testing(.x)))
train_splits_id <- train_cv %>% select(splits, id)
blah <- crossing(mtry = train_splits_id)
blah
#> # A tibble: 5 x 1
#>   mtry$splits       $id  
#>   <named list>      <chr>
#> 1 <split [72K/18K]> Fold1
#> 2 <split [72K/18K]> Fold2
#> 3 <split [72K/18K]> Fold3
#> 4 <split [72K/18K]> Fold4
#> 5 <split [72K/18K]> Fold5

^{Created on 2020-01-14 by the reprex package (v0.3.0)}

dougfir · January 15, 2020, 10:08pm

Hi @technocrat, I think there's a disconnect. What I have before trying to apply crossing() is this:

> library(tufte)
> library(tidyverse)
> library(lubridate)
> library(foreach)
> library(doParallel) # includes package just parallel
> library(scales)
> library(kableExtra)
> library(rmarkdown)
> library(dbplyr)
> library(DBI)
> library(odbc)
> library(rlang)
> library(rsample)
> library(Metrics)
> 
> example_data <- read_csv("example_data.csv")
Parsed with column specification:
cols(
  a = col_double(),
  b = col_double(),
  c = col_double(),
  d = col_double(),
  e = col_double(),
  f = col_double(),
  g = col_double(),
  h = col_double(),
  i = col_double(),
  j = col_logical()
)
> 
> example_split <- initial_split(example_data, 0.9)
> training_data <- training(example_split)
> testing_data <- testing(example_split)
> 
> # 5 fold split stratified on spender
> train_cv <- vfold_cv(training_data, 5, strata = j) %>% 
+   
+   # create training and validation sets within each fold
+   mutate(train = map(splits, ~training(.x)), 
+          validate = map(splits, ~testing(.x)))
> train_cv
#  5-fold cross-validation using stratification 
# A tibble: 5 x 4
  splits            id    train                  validate              
* <named list>      <chr> <named list>           <named list>          
1 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]>
2 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
3 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
4 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
5 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]>

Here's what train_cv looks like as of now:

> train_cv
#  5-fold cross-validation using stratification 
# A tibble: 5 x 4
  splits            id    train                  validate              
* <named list>      <chr> <named list>           <named list>          
1 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]>
2 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
3 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
4 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]>
5 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]>

Now, I need to take train_cv and add a new column 'mtry' which, for each row in train_cv mtry contains one of each of c(1, 2).

# this should work
train_cv <- train_cv %>%  crossing(mtry = c(1,2))

This is how train_cv should look after successfully running crossing()

blah
# A tibble: 10 x 5
   splits            id    train                  validate                mtry
   <named list>      <chr> <named list>           <named list>           <dbl>
 1 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]>     1
 2 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]>     2
 3 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
 4 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2
 5 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
 6 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2
 7 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
 8 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2
 9 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
10 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2

So, whereas train_cv started out with 5 rows, it now has 10 rows, one row for mtry values of 1 and one row for mtry values of 2.

technocrat · January 15, 2020, 10:20pm

Thanks, I'll take another look. Getting the question right is always the hardest part!

dougfir · January 15, 2020, 10:21pm

Yeah! Appreciate any feedback. This problem is particularly hard because the error is intermittent. It only happens sometimes.

technocrat · January 17, 2020, 12:34am

One of those annoying code chunks that choke reprex.

The following code, like yours intermittantly throws

Error: x must be a vector, not a rsplit/vfold_split object
Run rlang::last_error() to see where the error occurred.

However, it proceeds to happily produce my re_train_cv variable even with the error message!

suppressPackageStartupMessages(library(dplyr))
library(purrr)
library(readr)
library(rsample)
library(Metrics)

mtry = c(1,2)
example_data <- read_csv("~/Desktop/example_data.csv")
Parsed with column specification:
cols(
  a = col_double(),
  b = col_double(),
  c = col_double(),
  d = col_double(),
  e = col_double(),
  f = col_double(),
  g = col_double(),
  h = col_double(),
  i = col_double(),
  j = col_logical()
)
example_split <- initial_split(example_data, 0.9)
training_data <- training(example_split)
testing_data <- testing(example_split)
train_cv <- vfold_cv(training_data, 5, strata = j) %>% mutate(train = map(splits, ~training(.x)), validate = map(splits, ~testing(.x)))
re_train_cv <- train_cv %>%  crossing(mtry)
re_train_cv
# A tibble: 10 x 5
   splits            id    train                  validate                mtry
   <named list>      <chr> <named list>           <named list>           <dbl>
 1 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]>     1
 2 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]>     2
 3 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
 4 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2
 5 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
 6 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2
 7 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
 8 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2
 9 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     1
10 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]>     2

dougfir · January 17, 2020, 12:52am

Hi @technocrat. OK, but in my code it does not run when the error shows, it stops the script

technocrat · January 17, 2020, 1:53am

Did you try

mtry = c(1,2)

I couldn't get it past the error at all until I did that; why the dickens the error shows up but doesn't stop anything, wtf?

dougfir · January 17, 2020, 1:57am

I don't follow? In my original code I have ... %>% crossing(mtry = c(1, 2)). Are you suggesting something else?

technocrat · January 17, 2020, 2:50am

I don't know why it changed the result, but I used

mtry = c(1,2)
...
crossing(mtry)

dougfir · January 17, 2020, 3:35am

Hmmm. OK, going to try that when back in the office and see how it goes. Thanks for the tip!

technocrat · January 17, 2020, 4:37am

I wish I could explain the behavior, but heck! It does seem to work!

dougfir · January 17, 2020, 7:36pm

Welp, I gave that a try but am still hitting this error
Thanks for trying
Screen Shot 2020-01-17 at 11.35.53 AM

technocrat · January 17, 2020, 8:03pm

For the avoidance of doubt:

First run

Second run

Third run

(Inexplicable)

dougfir · January 17, 2020, 8:11pm

I'm not even sure which library's github page to go to to report an issue. Would you suspect this is to do with rsample? Or Plyr?

technocrat · January 17, 2020, 9:34pm

I thought dplyr::crossing was the function acting inconsistently on the same data, but I was, once again, wrong. With a separately saved train_cv, it's perfectly consistent. So, I'd lob an issue into rsample; if it's an artifact of random sampling, can it be controlled with set.seed(?)?

dougfir · January 17, 2020, 10:37pm

Submitted an issue, lets see if anyone sheds any light on it. Thanks for trying to help me!

github.com/tidymodels/rsample

Error: `x` must be a vector, not a `rsplit/vfold_split` object

opened 10:36PM - 17 Jan 20 UTC

closed 02:39AM - 30 Mar 20 UTC

gcameron89777

Error: `x` must be a vector, not a `rsplit/vfold_split` object I am experienc…ing the above error when using tidyr::crossing() just after creativng a rsplit object using `vfold_cv()`. The error is intermittent, it happens sometimes. [Others have been able to reproduce](https://community.rstudio.com/t/error-x-must-be-a-vector-not-a-rsplit-vfold-split-object/49396), sometimes. [Example csv file to reproduce](https://drive.google.com/open?id=1c5Qu2U_DgX-hC1HDPOWCHnGm50u7aNPZ). ```r library(tidyverse) library(rsample) example_data <- read_csv("example_data.csv") example_split <- initial_split(example_data, 0.9) training_data <- training(example_split) testing_data <- testing(example_split) # 5 fold split stratified on j set.seed(123) train_cv <- vfold_cv(training_data, 5, strata = j) %>% # create training and validation sets within each fold mutate(train = map(splits, ~training(.x)), validate = map(splits, ~testing(.x))) blah <- train_cv %>% crossing(mtry = c(1,2)) > Error: `x` must be a vector, not a `rsplit/vfold_split` object ``` train_cv looks like this: ``` train_cv # 5-fold cross-validation using stratification # A tibble: 5 x 4 splits id train validate * <named list> <chr> <named list> <named list> 1 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]> 2 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]> 3 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]> 4 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]> 5 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]> ``` I would like to use the same train_cv object in my script for trying different models with their own tuning parameters. In the example above, if `crossing(mtry = c(1, 2))` works, the desired output would take `train_cv` and make it look like this: ``` # A tibble: 10 x 5 splits id train validate mtry <named list> <chr> <named list> <named list> <dbl> 1 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]> 1 2 <split [72K/18K]> Fold1 <tibble [72,000 × 10]> <tibble [18,001 × 10]> 2 3 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]> 1 4 <split [72K/18K]> Fold2 <tibble [72,001 × 10]> <tibble [18,000 × 10]> 2 5 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]> 1 6 <split [72K/18K]> Fold3 <tibble [72,001 × 10]> <tibble [18,000 × 10]> 2 7 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]> 1 8 <split [72K/18K]> Fold4 <tibble [72,001 × 10]> <tibble [18,000 × 10]> 2 9 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]> 1 10 <split [72K/18K]> Fold5 <tibble [72,001 × 10]> <tibble [18,000 × 10]> 2 ``` ### Session Info: ```r sessionInfo() R version 3.6.0 (2019-04-26) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: Amazon Linux 2 Matrix products: default BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] Metrics_0.1.4 rsample_0.0.5 rlang_0.4.2 odbc_1.2.2 DBI_1.1.0 dbplyr_1.4.2 rmarkdown_2.0 kableExtra_1.1.0 [9] scales_1.1.0 doParallel_1.0.15 iterators_1.0.12 foreach_1.4.7 lubridate_1.7.4 forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3 [17] purrr_0.3.3 readr_1.3.1 tidyr_1.0.0 tibble_2.1.3 ggplot2_3.2.1 tidyverse_1.3.0 tufte_0.5 loaded via a namespace (and not attached): [1] Rcpp_1.0.3 lattice_0.20-38 listenv_0.8.0 utf8_1.1.4 assertthat_0.2.1 zeallot_0.1.0 digest_0.6.23 packrat_0.5.0 [9] R6_2.4.1 cellranger_1.1.0 backports_1.1.5 reprex_0.3.0 evaluate_0.14 httr_1.4.1 pillar_1.4.3 lazyeval_0.2.2 [17] readxl_1.3.1 data.table_1.12.8 rstudioapi_0.10 furrr_0.1.0 blob_1.2.0 webshot_0.5.2 bit_1.1-15.1 munsell_0.5.0 [25] broom_0.5.3 compiler_3.6.0 modelr_0.1.5 xfun_0.12 pkgconfig_2.0.3 globals_0.12.5 htmltools_0.4.0 tidyselect_0.2.5 [33] codetools_0.2-16 future_1.16.0 fansi_0.4.1 viridisLite_0.3.0 crayon_1.3.4 withr_2.1.2 grid_3.6.0 nlme_3.1-143 [41] jsonlite_1.6 gtable_0.3.0 lifecycle_0.1.0 magrittr_1.5 cli_2.0.1 stringi_1.4.5 fs_1.3.1 xml2_1.2.2 [49] generics_0.0.2 vctrs_0.2.1 tools_3.6.0 bit64_0.9-7 glue_1.3.1 hms_0.5.3 colorspace_1.4-1 rvest_0.3.5 [57] knitr_1.27 haven_2.2.0 ``` Not sure if this is an actual issue or a problem with my code. I tried the rstudio community forum first.

dougfir · January 18, 2020, 12:03am

I have not been able to reproduce the error message since doing this:

library(tidyverse)
library(rsample)

example_data <- read_csv("example_data.csv")

example_split <- initial_split(example_data, 0.9)
training_data <- training(example_split) %>% as_tibble()
testing_data <- testing(example_split) %>% as_tibble()

# 5 fold split stratified on j
set.seed(123)
train_cv <- vfold_cv(training_data, 5, strata = j) %>%
  
  # create training and validation sets within each fold
  mutate(train = map(splits, ~training(.x)), 
         validate = map(splits, ~testing(.x))) %>% 
  group_by(id) %>% nest() %>% unnest()

blah <- train_cv %>%
  crossing(mtry = c(1, 2))

Adding:
%>% group_by(id) %>% nest() %>% unnest()

Just seems repetitive since this is already done. But it seems to get around the error. Don't know why but there you go.

system · February 8, 2020, 12:03am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.