How to prevent data leakage between training and test set when I have repeated measures?


I have a dataset with important groupings that I want to preserve so that there is no data leakage for any group between training and test set and would like to do this with the caret package. I understand that groupKFold() can be used to preserve groupings during cross validation on the training set, with the resulting folds used in the index argument of traincontrol.

My question is what would be a way to prevent data leakage per group (i.e., suppose certain persons/groups each have multiple rows in dataframe) between training and test set. I assume I need to do this before I use groupKFold() on the training set.

Apologies if I am misunderstanding what CV is doing here.

Sample data with code is below.

    # Make random dataframe
    x1 <- runif(n = 50, min = 1, max = 50)
    x2 <- runif(n = 50, min = 100, max = 250)
    class <- sample(x = c(0, 1), size = 50, replace = TRUE)
    group <- sample(x = c(1:20), size = 50, replace = TRUE)
    df = data.frame(x1, x2, class, group)

    # perform 10 fold CV while preserving group
    group_folds <- groupKFold(df$group, k = 10)

    group_fit_control <- trainControl(## use grouped CV folds
                              index = group_folds,
                              method = "cv")

    # fit model
    svm_fit <- train(as.factor(class) ~ ., 
            data = select(df, - group), 
            method = "svmLinear",
            trControl = group_fit_control)

    # run prediction
    predict(svm_fit, newdata = select(df, - group))

I think I need to split training and test set in such a way that any group in training set is not found in test set and then only use the training set in train() above. How can I do this without data leakage?

Thanks for any help.



That a look at

1 Like


Thank you, that's essentially what I wrote as sample code above. I think part of what I'm not understanding is how groupKFold creates a holdout/test set. Or is it used only on the training set to make sure that important groups are preserved during CV?



Per the doc

For Group k-fold cross-validation, the data are split such that no group is contained in both the modeling and holdout sets

So, I think that I may not be on the same page with you. Are we talking about group as a partition of the dataset into k equal portions? Alternatively is there a group factor in the data frame that we want to exclude?

If it's the latter, my admittedly limited understanding is that you will hold out a factor at a time and either end up with k-1 models or feed them back into an "outer" k-fold.



We don't have a separate function to do this for the train/test split. You could use sample() on the unique values of the group and allocate the rows with those group values to the test set.



Thank you for this very helpful clarification. Is there an alternative way outside of caret to achieve a balanced distribution between train/test while preserving important groups? Among the many things I appreciate about createDataPartition is that it is very useful in creating a balanced distribution of the outcome class for train and test set.