Using Mutate with Recipies and RSample for Group level Calculations




This is a rather long question i was hoping someone could help me with.
I have a dataset which i have taken from the lavaan package of measurements of students for a school
It was a questionnaire where specific questions are then averaged to create factors.

For a completely fictitious example lets say x1:x3 are three questions that are supposed to measure how extroverted someone is. In order to get the extroversion score, you would take the mean so mean(x1:x3) would produce this score. I wish to calculate the same on school level so for example take the mean of x1:x3 and then take the mean of that number per school. So each student will now have a score on individual level and each school will also have a score

I would like to then predict a variable using the recipes and caret package. In order for me not to leak data and compromise my test scores, I would use cross validation. Cross validation would be ten fold. So I would have ten analysis-assessment splits 70-30% each time.

In an ideal scenario i would make the split on the raw data and then in each analysis-assessment split calculate the group scores within each segment separately. More concretely, the analysis portion of each split would not have all the students and so the school average would be slightly different in each fold for each portion of the data. Below is a fictitious example

# RStudio Community Question

# Builds fake group scores per school
# So individual scores exist per individual and group scores for each school
build_grp_scores <- function(mydf) {
  # Create fake constructs for the train and test set
  temp <- mydf %>% 
    select(id, school, grade, x1:x9) %>% 
    mutate(fct1 = rowMeans(select(.,x1:x3)),
           fct2 = rowMeans(select(.,x4:x7)),
           fct3 = rowMeans(select(.,x1:x5))) %>% 
    dplyr::select(id, school, grade, x1, x5, x7, starts_with('fct'))
  # Create The group construct
  group_scores <- temp %>% 
    group_by(school) %>% 
    summarise(sch_avg_fct1 = mean(fct1, na.rm = TRUE),
              sch_avg_fct2 = mean(fct2, na.rm = TRUE),
              sch_avg_fct3 = mean(fct3, na.rm = TRUE),
              sch_avg_grade = mean(grade, na.rm = TRUE))
  temp <- temp %>% 

# Create dataset and a random dependant variable
mydf <- HolzingerSwineford1939
tgt <- as.factor(sample(c(1, 0), nrow(mydf), replace = TRUE))
mydf <- mydf %>% mutate(tgt = tgt)

# Create a ten fold partition of the dataset
cv_splits <- vfold_cv(mydf, v=10, strata = "tgt")

# Sample sizes to check

# Build up the group scores for training and test
build_grp_scores(analysis(cv_splits$splits[[1]])) %>% glimpse()
build_grp_scores(assessment(cv_splits$splits[[1]])) %>% glimpse()

The final two lines in this code show the variation of the school averages which is exactly what im looking for.
My questions are now two fold:

How do I update my cv_splits object with the new version of the data derived from my function build_grp_scores

Is it possible then to take my new rsamples and feed it into a recipe or caret model

Thank you for your time. Sorry for the length


Do you need to collapse the data within the resamples? That might be hard to do with recipes and caret.

What type(s) of models would you fit to the data?

Perhaps consider doing leave-group-out with student as the group:

> lv_student_out <- group_vfold_cv(mydf, "id")
> lv_student_out
# Group -fold cross-validation 
# A tibble: 301 x 2
   splits       id         
   <list>       <chr>      
 1 <S3: rsplit> Resample001
 2 <S3: rsplit> Resample002
 3 <S3: rsplit> Resample003
 4 <S3: rsplit> Resample004
 5 <S3: rsplit> Resample005
 6 <S3: rsplit> Resample006
 7 <S3: rsplit> Resample007
 8 <S3: rsplit> Resample008
 9 <S3: rsplit> Resample009
10 <S3: rsplit> Resample010
# ... with 291 more rows


Hi @Max

Thanks for coming back to me so quickly. The aggregate would be calculated per segment per split. I actually was thinking of this while reading your book on feature engineering about data leakage (3.4.6).

From a social science perspective constructs/factors can measured on group or individual level. While the individual level wouldn't succumb to the data leakage because its calculated across a single record, the group scores would be impacted because different individuals are selected in each segment during the re-sampling. I think quite interesting (I may be alone in this :slight_smile: ) because its a sample statistic and so would be expected to vary across the sub-samples.

Of course, we could just take a training and test split and calculate the group scores separately and then validate it this way but if you don't have many groups you would only get two scores so it may as well be a factor. With many samples we could also compare it to published research a little bit easier.

To answer your question, I was going to try work through an example of MARS using recipes and rsample.


I was thinking that, if you need to summarize within each resample, use recipes, rsample and other packages directly. caret wouldn't be able to help you there.

If these are unfamiliar, these class notes can be helpful to learn.