Using Mutate with Recipies and RSample for Group level Calculations

Hi,

This is a rather long question i was hoping someone could help me with.
I have a dataset which i have taken from the lavaan package of measurements of students for a school
It was a questionnaire where specific questions are then averaged to create factors.

For a completely fictitious example lets say x1:x3 are three questions that are supposed to measure how extroverted someone is. In order to get the extroversion score, you would take the mean so mean(x1:x3) would produce this score. I wish to calculate the same on school level so for example take the mean of x1:x3 and then take the mean of that number per school. So each student will now have a score on individual level and each school will also have a score

I would like to then predict a variable using the recipes and caret package. In order for me not to leak data and compromise my test scores, I would use cross validation. Cross validation would be ten fold. So I would have ten analysis-assessment splits 70-30% each time.

In an ideal scenario i would make the split on the raw data and then in each analysis-assessment split calculate the group scores within each segment separately. More concretely, the analysis portion of each split would not have all the students and so the school average would be slightly different in each fold for each portion of the data. Below is a fictitious example


# RStudio Community Question
library(lavaan)
library(tidyverse)
library(rsample)

# Builds fake group scores per school
# So individual scores exist per individual and group scores for each school
build_grp_scores <- function(mydf) {
  
  # Create fake constructs for the train and test set
  temp <- mydf %>% 
    select(id, school, grade, x1:x9) %>% 
    mutate(fct1 = rowMeans(select(.,x1:x3)),
           fct2 = rowMeans(select(.,x4:x7)),
           fct3 = rowMeans(select(.,x1:x5))) %>% 
    dplyr::select(id, school, grade, x1, x5, x7, starts_with('fct'))
  
  # Create The group construct
  group_scores <- temp %>% 
    group_by(school) %>% 
    summarise(sch_avg_fct1 = mean(fct1, na.rm = TRUE),
              sch_avg_fct2 = mean(fct2, na.rm = TRUE),
              sch_avg_fct3 = mean(fct3, na.rm = TRUE),
              sch_avg_grade = mean(grade, na.rm = TRUE))
  
  
  temp <- temp %>% 
    inner_join(group_scores)
  
  temp
}


# Create dataset and a random dependant variable
mydf <- HolzingerSwineford1939
tgt <- as.factor(sample(c(1, 0), nrow(mydf), replace = TRUE))
mydf <- mydf %>% mutate(tgt = tgt)

# Create a ten fold partition of the dataset
cv_splits <- vfold_cv(mydf, v=10, strata = "tgt")

# Sample sizes to check
cv_splits$splits[[1]]

# Build up the group scores for training and test
build_grp_scores(analysis(cv_splits$splits[[1]])) %>% glimpse()
build_grp_scores(assessment(cv_splits$splits[[1]])) %>% glimpse()

The final two lines in this code show the variation of the school averages which is exactly what im looking for.
My questions are now two fold:

How do I update my cv_splits object with the new version of the data derived from my function build_grp_scores

Is it possible then to take my new rsamples and feed it into a recipe or caret model

Thank you for your time. Sorry for the length

Do you need to collapse the data within the resamples? That might be hard to do with recipes and caret.

What type(s) of models would you fit to the data?

Perhaps consider doing leave-group-out with student as the group:

> lv_student_out <- group_vfold_cv(mydf, "id")
> lv_student_out
# Group -fold cross-validation 
# A tibble: 301 x 2
   splits       id         
   <list>       <chr>      
 1 <S3: rsplit> Resample001
 2 <S3: rsplit> Resample002
 3 <S3: rsplit> Resample003
 4 <S3: rsplit> Resample004
 5 <S3: rsplit> Resample005
 6 <S3: rsplit> Resample006
 7 <S3: rsplit> Resample007
 8 <S3: rsplit> Resample008
 9 <S3: rsplit> Resample009
10 <S3: rsplit> Resample010
# ... with 291 more rows

Hi @Max

Thanks for coming back to me so quickly. The aggregate would be calculated per segment per split. I actually was thinking of this while reading your book on feature engineering about data leakage (3.4.6).

From a social science perspective constructs/factors can measured on group or individual level. While the individual level wouldn't succumb to the data leakage because its calculated across a single record, the group scores would be impacted because different individuals are selected in each segment during the re-sampling. I think quite interesting (I may be alone in this :slight_smile: ) because its a sample statistic and so would be expected to vary across the sub-samples.

Of course, we could just take a training and test split and calculate the group scores separately and then validate it this way but if you don't have many groups you would only get two scores so it may as well be a factor. With many samples we could also compare it to published research a little bit easier.

To answer your question, I was going to try work through an example of MARS using recipes and rsample.

I was thinking that, if you need to summarize within each resample, use recipes, rsample and other packages directly. caret wouldn't be able to help you there.

If these are unfamiliar, these class notes can be helpful to learn.

2 Likes

Hi @Max

Sorry to come back to this. It took me longer than expected to go through the notes as I needed to familiarize myself with purrr. Thank you for creating them, they really are intuitive and easy to pick up.

My only question regarding the above and I think you might have already answered it with group_vfold_cv, but how do i get my new updated data set for each cross validation into the Rsamples object. So more concretely, if i create my aggregates per group, in the case above per school. How do i get that into both my train and my test samples. Below is updated code which is a bit shorter because of the utilization of purrr


library(lavaan)
library(tidyverse)
library(rsample)

# Builds fake group scores per school
# So individual scores exist per individual and group scores for each school
build_grp_scores <- function(data_split) {
  
  fold <- analysis(data_split)
  
  # Create The group construct
  group_scores <- fold %>% 
    group_by(school) %>% 
    summarise(sch_avg_fct1 = mean(fct1, na.rm = TRUE),
              sch_avg_fct2 = mean(fct2, na.rm = TRUE),
              sch_avg_fct3 = mean(fct3, na.rm = TRUE),
              sch_avg_grade = mean(grade, na.rm = TRUE))
  
  # Combine our new grouped scores with the individual scores
  updated_fold <- fold %>% 
    inner_join(group_scores)
  
  updated_fold
}

# Create data-set and and attributes on individual level
mydf <- HolzingerSwineford1939 %>% 
  select(id, school, grade, x1:x9) %>% 
  mutate(fct1 = rowMeans(select(.,x1:x3)),
         fct2 = rowMeans(select(.,x4:x7)),
         fct3 = rowMeans(select(.,x1:x5))) %>% 
  dplyr::select(id, school, grade, starts_with('fct'))

# Create a fake target and cross validation stratified on that target
tgt <- as.factor(sample(c(1, 0), nrow(mydf), replace = TRUE))
mydf <- mydf %>% mutate(tgt = tgt)
cv_splits <- vfold_cv(mydf, v=10, strata = "tgt")

# Generate your new datasets per sample first with only the analysis data
map(cv_splits$splits, build_grp_scores)

# TODO
# Update the CV Splits with the new Analysis Data
# Run the same function for the assessment split
# Update the cv_splits$splits with the new assessment data
# Using recipes model it....

Again, thank you very much for your help
Have a nice weekend

Hi @Max

You can ignore this, I just realized after using Rsamples i can save the training and test splits as a list column into a data frame and then proceed as normal with any modelling by supplying cv_splits$train and cv_splits$test in place of
analysis(cv_splits$splits) and assessment(cv_splits$splits)

Thanks

1 Like