Cross Validation: Data Leakage for Group Aggregation

john.smith · June 2, 2018, 3:57pm

Hi,

I am currently working with some data where particular people are from a specific country. I have attributes for those people where some are scored at individual level and some at group level. The group level scores are aggregated to country level which means they will only ever be 6 scores (one per country) in the play dataset below.

# Create some sample data
# 1000 people from 6 countries
library(tidyverse)
country <- c("Ireland","England","France","Germany","USA","Spain")

mydf <- data.frame(countries = sample(country,1000,replace = TRUE),
               x = sample(1:1000, replace = TRUE),
               y = sample(1:1000, replace = TRUE),
               z = sample(1:1000, replace = TRUE)
) 

# Demonstrate aggregation levels for variable z for the entire dataset
mydf %>% group_by(countries) %>% 
  summarise(mu = mean(z)) %>% 
  ungroup()

My question is two-fold and I hope not ridiculous

If i was to model x and z as a function of y using cross validation, should i do the group aggregation within the fold. Assume that i am trying to use my model for some sort of inference. If i aggregated to the country level outside of the cross validation step it would seem that i could inadvertently leak information to my test fold in the cross validation (all data being used to aggregate to the country level).
Is there a clever way to do something like this with caret or with recipes

Thank you all for your time
Have a nice weekend

JohnMount · June 2, 2018, 7:32pm

This is exactly what vtreat::kWayCrossValidation() is designed for. It builds a cross-validation plan where for each set of rows you want to make an estimate for you have a complementary set of rows to generate that estimate.

The code below applies the methodology to computing the mean per-row for the set of columns. One can wrap it for re-use. Notice how in the example different rows labeled "Germany" have different estimates of the means. This is because they were formed using different sets of estimating rows, and this does a good job reducing the group-contamination. Within reasonable limits these per group means do a very good job of simulating having come from data disjoint from the data set at hand.

We have some notes on the theory here.

set.seed(355626)
country <- c("Ireland","England","France","Germany","USA","Spain")

mydf <- data.frame(countries = sample(country,1000,replace = TRUE),
                   x = sample(1:1000, replace = TRUE),
                   y = sample(1:1000, replace = TRUE),
                   z = sample(1:1000, replace = TRUE)
) 

vars <- c("x", "y", "z")

cross_plan <- vtreat::kWayCrossValidation(nrow(mydf), nSplits = 10, mydf, mydf$countries)


for(vi in vars) {
  mydf[[paste0(vi, "_mean")]] <- NA_real_
}

for(i in seq_len(length(cross_plan))) {
  est_data <- mydf[cross_plan[[i]]$train, , drop = FALSE]
  for(vi in vars) {
    mi <- paste0(vi, "_mean")
    est_map <- tapply(est_data[[vi]], est_data$countries, mean)
    mydf[[mi]][cross_plan[[i]]$app] <- est_map[mydf$countries[cross_plan[[i]]$app]]
  }
}

head(mydf)

#  countries   x   y   z   x_mean   y_mean   z_mean
#1   Ireland 281 829 709 470.4934 465.3289 509.0329
#2   Germany 551 681 493 481.9880 499.5602 464.0301
#3   Germany 712 124 650 483.7081 495.1180 463.3106
#4     Spain 554 794 149 497.9041 506.0548 504.9315
#5       USA 916 133 259 479.9200 548.9400 480.8267
#6   England  21 886 678 481.5221 464.9191 527.7647

john.smith · June 9, 2018, 11:09am

Hi @JohnMount

Than you for taking the time out to provide such a detailed and comprehensive answer. I had a look at the vignette for your package and was wondering if you have the time could you clarify/confirm my understanding

From looking into the package and your code above the function kWayCrossValidation returns a list of data that has been split into a list with ten training sets and ten validation sets
The code then proceeds to create a mean variable to country level within each fold. So for example taking the first fold of train and test dataset, you calculate the mean for the training set first on country level and a separate mean within the test set?
You then iterate through each train and test set fold (ten in told) and make the calculation on train and test set separately for each pair
The final output would be a ten list item with a train and test dataset where each pair would have a separate mean calculated.

Below is a reprex i tried to implement based on more realistic data taken from the lavaan package. The data is the HolzingerSwineford1939 dataset and comprises the mental ability test scores of seventh- and eighth-grade children from two different schools

I have changed it slightly to make an artificial example using your library and returned the results in a dataframe so its easier to see what happened

library(lavaan)
library(tidyverse)
set.seed(100)
mydf <- HolzingerSwineford1939 %>% 
  select(ageyr:x9) %>% 
  filter(complete.cases(.))

# We will assume that x1 to x3 is a single attribute on individual level - attrib_1
# Similiarly we assume x4 to x6 is a single attribute on individual level - attrib_2
# Finally we assume that x7 to x8 is some measurement on school level - attrib_3
# We are trying to predict x9 - dep
mydf <- mydf %>% 
  mutate(attrib_1 = rowMeans(select(.,x1:x3)),
         attrib_2 = rowMeans(select(.,x4:x6)),
         dep = x9) %>% 
  select(ageyr:grade, x7, x8, starts_with('attr'), dep)

# Create a training and test set
library(caret)
index <- createDataPartition(y=mydf$dep, p=0.7, list=FALSE)
train <- mydf[index,]
test <- mydf[-index,]

# Create Cross Validation Plan
cross_plan <- vtreat::kWayCrossValidation(nrow(train), nSplits = 10, train, train$dep)
cross_plan_index <- seq(1:length(cross_plan))

# Create a function to calculate the mean per school per fold for both train and test
# Function will return a dataframe with the mean per school
create_fold_aggregates <- function(fold, cross_plan, mydf) {
  
  # We create a test and a training set per fold
  train_fold <- mydf[cross_plan[[fold]][["train"]], ] %>% rownames_to_column() %>% mutate(data_type = 'train')
  test_fold <- mydf[cross_plan[[fold]][["app"]], ] %>% rownames_to_column() %>% mutate(data_type = 'test')
  
  # Combine the train fold and the test fod together
  est_data <- bind_rows(train_fold, test_fold)
  
  # Generate mean aggregate per school per dataset type
  # So the mean per school will differ in the training and the test
  # It will also differ across folds
  est_data_agg <- est_data %>% 
    select(data_type, school, x7, x8) %>%
    mutate(ind_mean = rowMeans(select(., x7, x8))) %>% 
    group_by(data_type, school) %>% 
    summarise(school_mean = mean(ind_mean)) %>% 
    ungroup()
  
  # Bring individual scores and group scores together
  est_data <- est_data %>% 
    left_join(est_data_agg) %>% 
    select(-x7,-x8) %>% 
    rename(index = rowname) %>% 
    mutate(fold = paste0("fold", fold))
  
  return(est_data)
  
}

# Review the new aggregates per fold for train and test
test <- map_df(.x = cross_plan_index, .f = create_fold_aggregates, cross_plan, train)

# Plot the mean per school per fold for both test and train just to get an undestanding on how the means change
ggplot(test, aes(x=fold, y = school_mean, colour = school, shape = data_type)) +
  geom_point(size = 4) +
  theme_minimal()

I notice in the vignette you use cross validation when implementing a logistic regression. If i wanted to utilize the above with say the structure of caret to try many different models to try and predict the variable 'dep', Is this possible? I would like to take advantage of carets diagnostic tools for model performance as well as using the same structure to produce other models but wouldn't have a clue how to force caret to use our new calculated train/test data with custom folds

Thanks again for your time. I appreciate this is a bit long winded

john.smith · June 10, 2018, 11:55am

Hi,

I think i have found the answer to my problem...assuming my assumptions above are valid.
Using the rsample package i was able to split the data-set into a stratified ten fold cross validation.
I updated the training and test set of each portion to calculate the average per school and then was able to model my new cross validated data-set using the notes by Dr Max Kuhn here from slide 30 onwards. i found this very difficult (there is a distinct possibility i missed something) but working through it was very rewarding

I hope someone finds it useful


# Assume the data is the same for the previous post
library(lavaan)
library(tidyverse)
set.seed(300)

# Create a training and test set and set it to 80% for the modelling aspect
# https://github.com/topepo/rstudio-conf-2018/blob/master/Materials/Part_2_Basic_Principals.pdf
library(rsample)
data_split <- initial_split(mydf, prop = 0.8, strata = "dep")
train <- training(data_split)
test <- testing(data_split)
nrow(train)/nrow(mydf)


# Create Cross Validation Plan
# Split it into 10 resamples basically train and test pairs
# The train segment is called analysis and the test is called assesment
# We will use 10 pairs
cv_splits <- vfold_cv(train, v=10, strata = "dep")

# Sample sizes 
cv_splits$splits[[1]]

# Analysis gets the training split for the first fold
analysis(cv_splits$splits[[1]])

# assessment gets the test split for the first fold
assessment(cv_splits$splits[[1]])

create_fold_aggregates <- function(fold){
  
  # We create a test and a training set per fold
  train_fold <- analysis(fold) %>% rownames_to_column() %>% mutate(data_type = 'train')
  test_fold <- assessment(fold) %>% rownames_to_column() %>% mutate(data_type = 'test')
  
  # Combine the train fold and the test fold together
  est_data <- bind_rows(train_fold, test_fold)
  
  # Generate mean aggregate per school per dataset type
  # So the mean per school will differ in the training and the test
  # It will also differ across folds
  est_data_agg <- est_data %>% 
    select(data_type, school, x7, x8) %>%
    mutate(ind_mean = rowMeans(select(., x7, x8))) %>% 
    group_by(data_type, school) %>% 
    summarise(school_mean = mean(ind_mean)) %>% 
    ungroup()
  
  # Bring individual scores and group scores together
  df_calc <-  est_data %>% 
    left_join(est_data_agg) %>% 
    select(-x7,-x8) %>% 
    arrange(as.numeric(rowname)) %>% 
    column_to_rownames("rowname") %>% 
    select(-data_type)
    data.frame()
  
  # Update Fold
  fold$data <- df_calc
  
  return(fold)
  
}

# Create a new list which consists of the updated school aggregates per fold for train and test
new_splits <- map(.x = cv_splits$splits, .f = create_fold_aggregates)

# Bring it back into the fold
cv_splits$splits <- new_splits

Thanks