I have a dataset with important groupings that I want to preserve so that there is no data leakage for any group between training and test set and would like to do this with the caret package. I understand that groupKFold() can be used to preserve groupings during cross validation on the training set, with the resulting folds used in the index argument of traincontrol.
My question is what would be a way to prevent data leakage per group (i.e., suppose certain persons/groups each have multiple rows in dataframe) between training and test set. I assume I need to do this before I use groupKFold() on the training set.
Apologies if I am misunderstanding what CV is doing here.
Sample data with code is below.
# Make random dataframe
x1 <- runif(n = 50, min = 1, max = 50)
x2 <- runif(n = 50, min = 100, max = 250)
class <- sample(x = c(0, 1), size = 50, replace = TRUE)
group <- sample(x = c(1:20), size = 50, replace = TRUE)
df = data.frame(x1, x2, class, group)
# perform 10 fold CV while preserving group
group_folds <- groupKFold(df$group, k = 10)
group_fit_control <- trainControl(## use grouped CV folds
index = group_folds,
method = "cv")
# fit model
set.seed(2019)
svm_fit <- train(as.factor(class) ~ .,
data = select(df, - group),
method = "svmLinear",
trControl = group_fit_control)
# run prediction
predict(svm_fit, newdata = select(df, - group))
I think I need to split training and test set in such a way that any group in training set is not found in test set and then only use the training set in train() above. How can I do this without data leakage?
Thanks for any help.