how to divide data to train and test when there is a group?

I want to divide my data set into train and test data. but I have one column as a group.All member of a group must be in train or test. for example if the group column is like this:

         group
           1
           1
           1
           1
           1
           2
           2
           2
           3
           3

if one of the row of first group is in train set the first 5 rows must be in there and ...

I think the easiest approach is to construct the test and training populations by sampling the group column. Let's say your data are in a data frame named DF, there are ten groups labeled 1 - 10 and you want the training sample to be 7 of the groups.

library(dplyr)
training <- sample(1:10, 7, replace = FALSE)
training

TrnDF <- DF %>% filter(group %in% training)
TestDF <- DF %>% filter(!group %in% training)

That may cause a problem if the groups are of very different sizes, but that is inherent in the data.

rsample which is part of tidymodels has the following function

initial_split(data, prop = 3/4, strata = NULL, breaks = 4, ...)

you could pass your group variable to the strata argument and that should do the job I think.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.