How do I split into training/test sets in this specific way?

I'm following one of the fantastic Tidymodels tutorials on XGBoost from @julia, and everything works as expected for my own data except one thing: the split of training and test sets.

My data have 3 replicates, i.e. each sample was measured 3 times.

I don't want to split the data such that sampleA_R1 and sampleA_R2 end up in the training set, but sampleA_R3 ends up in the test set.

How can I prevent this, still using the same simple syntax from {tidymodels} and {rsample} ?

Min reprex: (mtcars with 3 replicates)

#Make a version of mtcars that has 3 replicates for each row
# (as in if each car were measured 3 times rather than 1)
mtcars_R1 <- 
  mtcars %>%
  `rownames<-` (paste0(rownames(.),"_R1"))

mtcars_R2 <-
  mtcars %>%
  `rownames<-` (paste0(rownames(.),"_R2")) %>%
  mutate_all(function(x)x*2)

mtcars_R3 <-
  mtcars %>%
  `rownames<-` (paste0(rownames(.),"_R3")) %>%
  mutate_all(function(x)x*3)

mtcars_new <-
  rbind(
    mtcars_R1,
    rbind(
      mtcars_R2,
      mtcars_R3))
> head(mtcars_new)
                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4_R1         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag_R1     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710_R1        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive_R1    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout_R1 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant_R1           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Now if we split the data:

vb_split <- initial_split(mtcars_new, strata = mpg)
vb_train <- training(vb_split)
vb_test <- testing(vb_split)

In this example, Mazda RX4 Wag_R1 and Mazda RX4 Wag_R3 are in vb_train, but Mazda RX4 Wag_R2 is in vb_test

look at group_initial_split

1 Like

Hey there, @cwright1!

The group_initial_split() function from rsample should take care of this use case. Check out the "Grouped Resampling" section of this article on the rsample website for more information.

1 Like

Hi @simoncouch ,

Thank you! I looked at the group_initial_split() function and it seems that's what I need. I can't seem to make it work with the reprex I posted, though.

Can you help me with what the code should look like here?

initial_split has the parameter strata (which I still want to use). group_initial_split needs a way to group the names before "_r1/2/3"? Is that right?

Thanks for the help here to you and @nirgrahamuk !

1 Like

you stated what you don't want; I think there is a thought exercise where you stipulate it positively rather than negatively; i.e. what you do want

I want all replicates of a given sample to be together in either the training set or the test set.

For example, I want sampleA_R1, sampleA_R2, and sampleA_R3 to all be in the test set, or all be in the training set.

I tried using the 'root' sample name (sampleA for example rather than sampleA_R1) for the group argument of group_initial_split , but that doesn't seem to be how it's used:

mtcars_new <- 
  cbind(data.frame("rootnames"=str_extract(rownames(mtcars_new), ".+?(?=_)")),
        mtcars_new)

vb_split <- group_initial_split(mtcars_new, group=rootnames, strata=mpg)
Error in `check_grouped_strata()`:
! `strata` must be constant across all members of each `group`.
Run `rlang::last_error()` to see where the error occurred.

How can I keep all replicates of a sample together this way?

1 Like

you did good but I think you would drop the strata=mpg requirement.

it seems stratifying grouped samples was not supported; but its an area of development; Theres a lot to read here ...
Stratification in grouped resampling · Issue #317 · tidymodels/rsample (github.com)

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.