Hi,
This is a rather long question i was hoping someone could help me with.
I have a dataset which i have taken from the lavaan
package of measurements of students for a school
It was a questionnaire where specific questions are then averaged to create factors.
For a completely fictitious example lets say x1:x3
are three questions that are supposed to measure how extroverted someone is. In order to get the extroversion score, you would take the mean so mean(x1:x3) would produce this score. I wish to calculate the same on school level so for example take the mean of x1:x3
and then take the mean of that number per school. So each student will now have a score on individual level and each school will also have a score
I would like to then predict a variable using the recipes
and caret
package. In order for me not to leak data and compromise my test scores, I would use cross validation. Cross validation would be ten fold. So I would have ten analysis-assessment splits 70-30% each time.
In an ideal scenario i would make the split on the raw data and then in each analysis-assessment split calculate the group scores within each segment separately. More concretely, the analysis portion of each split would not have all the students and so the school average would be slightly different in each fold for each portion of the data. Below is a fictitious example
# RStudio Community Question
library(lavaan)
library(tidyverse)
library(rsample)
# Builds fake group scores per school
# So individual scores exist per individual and group scores for each school
build_grp_scores <- function(mydf) {
# Create fake constructs for the train and test set
temp <- mydf %>%
select(id, school, grade, x1:x9) %>%
mutate(fct1 = rowMeans(select(.,x1:x3)),
fct2 = rowMeans(select(.,x4:x7)),
fct3 = rowMeans(select(.,x1:x5))) %>%
dplyr::select(id, school, grade, x1, x5, x7, starts_with('fct'))
# Create The group construct
group_scores <- temp %>%
group_by(school) %>%
summarise(sch_avg_fct1 = mean(fct1, na.rm = TRUE),
sch_avg_fct2 = mean(fct2, na.rm = TRUE),
sch_avg_fct3 = mean(fct3, na.rm = TRUE),
sch_avg_grade = mean(grade, na.rm = TRUE))
temp <- temp %>%
inner_join(group_scores)
temp
}
# Create dataset and a random dependant variable
mydf <- HolzingerSwineford1939
tgt <- as.factor(sample(c(1, 0), nrow(mydf), replace = TRUE))
mydf <- mydf %>% mutate(tgt = tgt)
# Create a ten fold partition of the dataset
cv_splits <- vfold_cv(mydf, v=10, strata = "tgt")
# Sample sizes to check
cv_splits$splits[[1]]
# Build up the group scores for training and test
build_grp_scores(analysis(cv_splits$splits[[1]])) %>% glimpse()
build_grp_scores(assessment(cv_splits$splits[[1]])) %>% glimpse()
The final two lines in this code show the variation of the school averages which is exactly what im looking for.
My questions are now two fold:
How do I update my
cv_splits
object with the new version of the data derived from my functionbuild_grp_scores
Is it possible then to take my new rsamples and feed it into a recipe or caret model
Thank you for your time. Sorry for the length