test/train split and repeated measures

bragks · October 14, 2018, 7:23pm

Originally posted on stackoverflow. Trying my luck here with a few modifications.

I want to fit a random forest on this data where y = "happy" after x = "ate". Some of these people were lucky and got two free meals, while some only got one. Could I use rsample to make sure that the same id (in this case 2) does not appear in both the train and test split? If not, how should I do it?

library(tibble)
library(rsample)

set.seed(123)
dframe <- tibble(id = c(1,1,2,2,3,4,5,5,6,7), 
                 ate = sample(c("cookie", "slug"), size = 10, replace = TRUE), 
                 happy = sample(c("yes", "no"), size = 10, replace = TRUE))

dframe_split <- initial_split(dframe, prop = 3/4, strata = "ate")
dframe_train <- training(dframe_split)
dframe_test <- testing(dframe_split)

Created on 2018-10-14 by the reprex package (v0.2.0).

joels · October 15, 2018, 3:46pm

I haven't used rsample before and I don't see an obvious way to sample by id with rsample functions, but you could use the base R sample function instead:

library(tidyverse)

set.seed(5)
train_ids = sample(unique(dframe$id), size=0.75*length(unique(dframe$id)))

dframe_train = dframe %>% filter(id %in% train_ids)
dframe_test = dframe %>% filter(!id %in% train_ids)

Or, staying completely in base R:

dframe_train = dframe[dframe$id %in% train_ids, ]
dframe_test = dframe[!dframe$id %in% train_ids, ]

bragks · October 15, 2018, 5:37pm

Excellent, this is exactly what I was looking for! Why can I never remember %in%...

Max · October 16, 2018, 2:47pm

There also an rsample function for this purpose: group_vfold_cv.