Creating a Validation Set specified by the user -not random-.

I have a "rsplit" object created by

rsample::initial_time_split()

Now I want a create just one validation set based on one column or order. I tried "validation_split()" but it just allows a random sampling. I went to "group_vfold_cv()" which gave the appropiate grouping but, as the name says, it will make a cross-validation and as such will give me 2 resamples.

folds = group_vfold_cv(training(df_split), group = 'column')
# Group 2-fold cross-validation 
# A tibble: 2 x 2
  splits                 id       
  <list>                 <chr>    
1 <rsplit [40912/72608]> Resample1
2 <rsplit [72608/40912]> Resample2

I would like to make something like this:

folds = group_vfold_cv(training(df_split), group = 'column') %>%
          filter(id == "Resample2")

But this breaks its class and converts it to a tibble that will not be recognized by the tuning function (tune_grid()).

Does anyone knows a way to accomplish this?

Here is a REPREX on what i would like to do:

library(tidymodels)

df = tibble( x = runif(100, 0 ,1), y = runif(100, 0,1), group_column = rep(c(1,0), 50))

df_split = initial_split(df, prop = 3/4)

#the filter changes the class that is needed for the tune_grid function
folds = group_vfold_cv(training(df_split), group = 'group_column') %>%
  filter(id == "Resample2")

boost_spec <- parsnip::boost_tree(
  trees = tune(),
  tree_depth = tune()) %>%
  set_engine("xgboost") %>%
  set_mode("regression")
  
recipe <- recipe(y ~ ., data = head(training(df_split)))

boost_workflow = workflow() %>% 
  add_recipe(recipe) %>%
  add_model(boost_spec)

set.seed(123)
boost_grid <- grid_max_entropy(
  trees(),
  tree_depth(),
  size = 2)

boost_res = boost_workflow %>%
  tune_grid(resamples = folds,
            grid = boost_grid,
            metrics = metric_set(rmse))

Thanks a lot!

Can you make a simple dummy version with data? Just makes it easier with a reprex (FAQ: How to do a minimal reproducible example ( reprex ) for beginners) so I can create the exact objects on my side.

1 Like

Thanks, I just added a reprex.

Hi,

So looks as if people have asked for the ability to manually split their data based on a column. See if the below can work? Also have a look here: feature request - manual split creation · Issue #158 · tidymodels/rsample · GitHub

library(tidymodels)

df = tibble( x = runif(100, 0 ,1), y = runif(100, 0,1), group_column = rep(c(1,0), 50))


df <- df %>% 
  arrange(group_column) %>% 
  mutate(.row = row_number())


split_prop <- (last(which(df$group_column == 1))) / nrow(df)

indices <-
  list(analysis   = df$.row[df$group_column == 1], 
       assessment = df$.row[df$group_column ==  0]
  )

split <- make_splits(indices, df %>% select(-.row))
training(split)
#> # A tibble: 50 x 3
#>         x     y group_column
#>     <dbl> <dbl>        <dbl>
#>  1 0.684  0.958            1
#>  2 0.469  0.304            1
#>  3 0.870  0.535            1
#>  4 0.107  0.899            1
#>  5 0.537  0.212            1
#>  6 0.0980 0.553            1
#>  7 0.0834 0.257            1
#>  8 0.0133 0.790            1
#>  9 0.0419 0.888            1
#> 10 0.0560 0.576            1
#> # ... with 40 more rows

testing(split)
#> # A tibble: 50 x 3
#>         x      y group_column
#>     <dbl>  <dbl>        <dbl>
#>  1 0.977  0.802             0
#>  2 0.839  0.0102            0
#>  3 0.979  0.0793            0
#>  4 0.0670 0.815             0
#>  5 0.573  0.287             0
#>  6 0.152  0.672             0
#>  7 0.203  0.373             0
#>  8 0.587  0.635             0
#>  9 0.709  0.446             0
#> 10 0.0289 0.198             0
#> # ... with 40 more rows

Created on 2021-05-09 by the reprex package (v2.0.0)

Thank you very much for your response but I am not looking to split the data into training and testing. I want to make a validation set from an already made training split. Does this makes sense?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.