Custom Train Fold/Splits for time series data using rsample

cmeuli07 · May 24, 2021, 8:44pm

Hey all, I need some help building an rset object with user-defined folds using the rsample package. The goal is to be able to build an rset object for time series data where the splits are defined using periods defined by the user. I want to be able to feed the rset object to tune::tune_bayes.

For the data given below, I would like the splits to be generated based on an every 4 week rule and come out to be as follows:
fold 1 train min/max = 2017-01-07 to 2020-03-28; fold 1 test min/max = 2020-04-04 to 2020-08-15
fold 2 train min/max = 2017-01-07 to 2020-04-25; fold 2 test min/max = 2020-05-02 to 2020-09-12
fold 3 train min/max = 2017-01-07 to 2020-05-23; fold 3 test min/max = 2020-05-30 to 2020-10-10
fold 4 train min/max = 2017-01-07 to 2020-06-20; fold 5 test min/max = 2020-06-27 to 2020-11-07
fold 5 train min/max = 2017-01-07 to 2020-07-18; fold 5 test min/max = 2020-07-25 to 2020-12-05

reprex data is:

dput(df_reprex)
structure(list(period = structure(c(17173, 17180, 17187, 17194, 
17201, 17208, 17215, 17222, 17229, 17236, 17243, 17250, 17257, 
17264, 17271, 17278, 17285, 17292, 17299, 17306, 17313, 17320, 
17327, 17334, 17341, 17348, 17355, 17362, 17369, 17376, 17383, 
17390, 17397, 17404, 17411, 17418, 17425, 17432, 17439, 17446, 
17453, 17460, 17467, 17474, 17481, 17488, 17495, 17502, 17509, 
17516, 17523, 17530, 17537, 17544, 17551, 17558, 17565, 17572, 
17579, 17586, 17593, 17600, 17607, 17614, 17621, 17628, 17635, 
17642, 17649, 17656, 17663, 17670, 17677, 17684, 17691, 17698, 
17705, 17712, 17719, 17726, 17733, 17740, 17747, 17754, 17761, 
17768, 17775, 17782, 17789, 17796, 17803, 17810, 17817, 17824, 
17831, 17838, 17845, 17852, 17859, 17866, 17873, 17880, 17887, 
17894, 17896, 17901, 17908, 17915, 17922, 17929, 17936, 17943, 
17950, 17957, 17964, 17971, 17978, 17985, 17992, 17999, 18006, 
18013, 18020, 18027, 18034, 18041, 18048, 18055, 18062, 18069, 
18076, 18083, 18090, 18097, 18104, 18111, 18118, 18125, 18132, 
18139, 18146, 18153, 18160, 18167, 18174, 18181, 18188, 18195, 
18202, 18209, 18216, 18223, 18230, 18237, 18244, 18251, 18258, 
18265, 18272, 18279, 18286, 18293, 18300, 18307, 18314, 18321, 
18328, 18335, 18342, 18349, 18356, 18363, 18370, 18377, 18384, 
18391, 18398, 18405, 18412, 18419, 18426, 18433, 18440, 18447, 
18454, 18461, 18468, 18475, 18482, 18489, 18496, 18503, 18510, 
18517, 18524, 18531, 18538, 18545, 18552, 18559, 18566, 18573, 
18580, 18587, 18594, 18601), class = "Date"), units = c(1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), class = "data.frame", row.names = c(NA, 
-206L))

DavoWW · May 25, 2021, 1:58am

Hi @cmeuli07,
I'm not sure I understand your requirements exactly, but does an approach like this get you close to what you want?

indat <- structure(list(period = structure(c(17173, 17180, 17187, 17194, 
17201, 17208, 17215, 17222, 17229, 17236, 17243, 17250, 17257, 
17264, 17271, 17278, 17285, 17292, 17299, 17306, 17313, 17320, 
17327, 17334, 17341, 17348, 17355, 17362, 17369, 17376, 17383, 
17390, 17397, 17404, 17411, 17418, 17425, 17432, 17439, 17446, 
17453, 17460, 17467, 17474, 17481, 17488, 17495, 17502, 17509, 
17516, 17523, 17530, 17537, 17544, 17551, 17558, 17565, 17572, 
17579, 17586, 17593, 17600, 17607, 17614, 17621, 17628, 17635, 
17642, 17649, 17656, 17663, 17670, 17677, 17684, 17691, 17698, 
17705, 17712, 17719, 17726, 17733, 17740, 17747, 17754, 17761, 
17768, 17775, 17782, 17789, 17796, 17803, 17810, 17817, 17824, 
17831, 17838, 17845, 17852, 17859, 17866, 17873, 17880, 17887, 
17894, 17896, 17901, 17908, 17915, 17922, 17929, 17936, 17943, 
17950, 17957, 17964, 17971, 17978, 17985, 17992, 17999, 18006, 
18013, 18020, 18027, 18034, 18041, 18048, 18055, 18062, 18069, 
18076, 18083, 18090, 18097, 18104, 18111, 18118, 18125, 18132, 
18139, 18146, 18153, 18160, 18167, 18174, 18181, 18188, 18195, 
18202, 18209, 18216, 18223, 18230, 18237, 18244, 18251, 18258, 
18265, 18272, 18279, 18286, 18293, 18300, 18307, 18314, 18321, 
18328, 18335, 18342, 18349, 18356, 18363, 18370, 18377, 18384, 
18391, 18398, 18405, 18412, 18419, 18426, 18433, 18440, 18447, 
18454, 18461, 18468, 18475, 18482, 18489, 18496, 18503, 18510, 
18517, 18524, 18531, 18538, 18545, 18552, 18559, 18566, 18573, 
18580, 18587, 18594, 18601), class = "Date"), units = c(1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), class = "data.frame", row.names = c(NA, 
-206L))

library(tidyverse)

# Create new columns that define the folds and the train/test split based on the date limits.
# We need multiple columns because the fold limits overlap - correct?
# Only done for the first two folds.
indat %>%
  mutate(type1=
  case_when(period >= as.Date("2017-01-07") & period <= as.Date("2020-03-28") ~ "train",
            period >= as.Date("2020-04-04") & period <= as.Date("2020-08-15") ~ "test")) %>% 
  mutate(fold1=
  case_when(period >= as.Date("2017-01-07") & period <= as.Date("2020-08-15") ~ 1)) %>% 
  
  mutate(type2=
  case_when(period >= as.Date("2017-01-07") & period <= as.Date("2020-04-25") ~ "train",
            period >= as.Date("2020-05-02") & period <= as.Date("2020-09-12") ~ "test")) %>% 
  mutate(fold2=
  case_when(period >= as.Date("2017-01-07") & period <= as.Date("2020-09-12") ~ 2)) -> new.dat

head(new.dat); tail(new.dat)
#>       period units type1 fold1 type2 fold2
#> 1 2017-01-07     1 train     1 train     2
#> 2 2017-01-14     1 train     1 train     2
#> 3 2017-01-21     1 train     1 train     2
#> 4 2017-01-28     1 train     1 train     2
#> 5 2017-02-04     1 train     1 train     2
#> 6 2017-02-11     1 train     1 train     2
#>         period units type1 fold1 type2 fold2
#> 201 2020-10-31     1  <NA>    NA  <NA>    NA
#> 202 2020-11-07     1  <NA>    NA  <NA>    NA
#> 203 2020-11-14     1  <NA>    NA  <NA>    NA
#> 204 2020-11-21     1  <NA>    NA  <NA>    NA
#> 205 2020-11-28     1  <NA>    NA  <NA>    NA
#> 206 2020-12-05     1  <NA>    NA  <NA>    NA

# Use the new columns to extract the subsets for analysis, e.g.
fold1 <- new.dat %>% 
  filter(fold1 == 1)

head(fold1); tail(fold1)            
#>       period units type1 fold1 type2 fold2
#> 1 2017-01-07     1 train     1 train     2
#> 2 2017-01-14     1 train     1 train     2
#> 3 2017-01-21     1 train     1 train     2
#> 4 2017-01-28     1 train     1 train     2
#> 5 2017-02-04     1 train     1 train     2
#> 6 2017-02-11     1 train     1 train     2
#>         period units type1 fold1 type2 fold2
#> 185 2020-07-11     1  test     1  test     2
#> 186 2020-07-18     1  test     1  test     2
#> 187 2020-07-25     1  test     1  test     2
#> 188 2020-08-01     1  test     1  test     2
#> 189 2020-08-08     1  test     1  test     2
#> 190 2020-08-15     1  test     1  test     2

^{Created on 2021-05-25 by the reprex package (v2.0.0)}

cmeuli07 · May 25, 2021, 1:32pm

Hey Davo, I appreciate you taking a shot at my problem! Unfortunately I don't think this will work because I need an object of class "rset" that I can feed to tune::tune_bayes. Rset objects are returned by some functions in the rsample package. I've updated my original ask above to make this requirement more clear.

cmeuli07 · May 25, 2021, 2:26pm

So, I did some more research and was able to answer my own question. Here's my solution:

require(tidyverse)
require(tidymodels)

split1 <- make_splits(list("analysis" = which(between(df_reprex$period, ymd('2017-01-07'), ymd('2020-03-28'))),
                           "assessment" = which(between(df_reprex$period, ymd('2020-04-04'), ymd('2020-08-15')))),
                      df_reprex)

split2 <- make_splits(list("analysis" = which(between(df_reprex$period, ymd('2017-01-07'), ymd('2020-04-25'))),
                           "assessment" = which(between(df_reprex$period, ymd('2020-05-02'), ymd('2020-09-12')))),
                      df_reprex)

split3 <- make_splits(list("analysis" = which(between(df_reprex$period, ymd('2017-01-07'), ymd('2020-05-23'))),
                           "assessment" = which(between(df_reprex$period, ymd('2020-05-30'), ymd('2020-10-10')))),
                      df_reprex)

split4 <- make_splits(list("analysis" = which(between(df_reprex$period, ymd('2017-01-07'), ymd('2020-06-20'))),
                           "assessment" = which(between(df_reprex$period, ymd('2020-06-27'), ymd('2020-11-07')))),
                      df_reprex)

split5 <- make_splits(list("analysis" = which(between(df_reprex$period, ymd('2017-01-07'), ymd('2020-07-18'))),
                           "assessment" = which(between(df_reprex$period, ymd('2020-07-25'), ymd('2020-12-05')))),
                      df_reprex)

df_train_folds <- manual_rset(splits = list(split1, split2, split3, split4, split5),
                              ids = c("Split 1", "Split 2", "Split 3", "Split 4", "Split 5"))

system · June 1, 2021, 2:27pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.