Cross Validation: Data Leakage for Group Aggregation

john.smith · June 10, 2018, 11:55am

Hi,

I think i have found the answer to my problem...assuming my assumptions above are valid.
Using the rsample package i was able to split the data-set into a stratified ten fold cross validation.
I updated the training and test set of each portion to calculate the average per school and then was able to model my new cross validated data-set using the notes by Dr Max Kuhn here from slide 30 onwards. i found this very difficult (there is a distinct possibility i missed something) but working through it was very rewarding

I hope someone finds it useful


# Assume the data is the same for the previous post
library(lavaan)
library(tidyverse)
set.seed(300)

# Create a training and test set and set it to 80% for the modelling aspect
# https://github.com/topepo/rstudio-conf-2018/blob/master/Materials/Part_2_Basic_Principals.pdf
library(rsample)
data_split <- initial_split(mydf, prop = 0.8, strata = "dep")
train <- training(data_split)
test <- testing(data_split)
nrow(train)/nrow(mydf)


# Create Cross Validation Plan
# Split it into 10 resamples basically train and test pairs
# The train segment is called analysis and the test is called assesment
# We will use 10 pairs
cv_splits <- vfold_cv(train, v=10, strata = "dep")

# Sample sizes 
cv_splits$splits[[1]]

# Analysis gets the training split for the first fold
analysis(cv_splits$splits[[1]])

# assessment gets the test split for the first fold
assessment(cv_splits$splits[[1]])

create_fold_aggregates <- function(fold){
  
  # We create a test and a training set per fold
  train_fold <- analysis(fold) %>% rownames_to_column() %>% mutate(data_type = 'train')
  test_fold <- assessment(fold) %>% rownames_to_column() %>% mutate(data_type = 'test')
  
  # Combine the train fold and the test fold together
  est_data <- bind_rows(train_fold, test_fold)
  
  # Generate mean aggregate per school per dataset type
  # So the mean per school will differ in the training and the test
  # It will also differ across folds
  est_data_agg <- est_data %>% 
    select(data_type, school, x7, x8) %>%
    mutate(ind_mean = rowMeans(select(., x7, x8))) %>% 
    group_by(data_type, school) %>% 
    summarise(school_mean = mean(ind_mean)) %>% 
    ungroup()
  
  # Bring individual scores and group scores together
  df_calc <-  est_data %>% 
    left_join(est_data_agg) %>% 
    select(-x7,-x8) %>% 
    arrange(as.numeric(rowname)) %>% 
    column_to_rownames("rowname") %>% 
    select(-data_type)
    data.frame()
  
  # Update Fold
  fold$data <- df_calc
  
  return(fold)
  
}

# Create a new list which consists of the updated school aggregates per fold for train and test
new_splits <- map(.x = cv_splits$splits, .f = create_fold_aggregates)

# Bring it back into the fold
cv_splits$splits <- new_splits

Thanks