I'm following one of the fantastic Tidymodels tutorials on XGBoost from @julia, and everything works as expected for my own data except one thing: the split of training and test sets.
My data have 3 replicates, i.e. each sample was measured 3 times.
I don't want to split the data such that sampleA_R1
and sampleA_R2
end up in the training set, but sampleA_R3
ends up in the test set.
How can I prevent this, still using the same simple syntax from {tidymodels}
and {rsample}
?
Min reprex: (mtcars with 3 replicates)
#Make a version of mtcars that has 3 replicates for each row
# (as in if each car were measured 3 times rather than 1)
mtcars_R1 <-
mtcars %>%
`rownames<-` (paste0(rownames(.),"_R1"))
mtcars_R2 <-
mtcars %>%
`rownames<-` (paste0(rownames(.),"_R2")) %>%
mutate_all(function(x)x*2)
mtcars_R3 <-
mtcars %>%
`rownames<-` (paste0(rownames(.),"_R3")) %>%
mutate_all(function(x)x*3)
mtcars_new <-
rbind(
mtcars_R1,
rbind(
mtcars_R2,
mtcars_R3))
> head(mtcars_new)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4_R1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag_R1 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710_R1 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive_R1 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout_R1 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant_R1 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Now if we split the data:
vb_split <- initial_split(mtcars_new, strata = mpg)
vb_train <- training(vb_split)
vb_test <- testing(vb_split)
In this example, Mazda RX4 Wag_R1
and Mazda RX4 Wag_R3
are in vb_train, but Mazda RX4 Wag_R2
is in vb_test