Hi,
I think i have found the answer to my problem...assuming my assumptions above are valid.
Using the rsample
package i was able to split the data-set into a stratified ten fold cross validation.
I updated the training and test set of each portion to calculate the average per school and then was able to model my new cross validated data-set using the notes by Dr Max Kuhn here from slide 30 onwards. i found this very difficult (there is a distinct possibility i missed something) but working through it was very rewarding
I hope someone finds it useful
# Assume the data is the same for the previous post
library(lavaan)
library(tidyverse)
set.seed(300)
# Create a training and test set and set it to 80% for the modelling aspect
# https://github.com/topepo/rstudio-conf-2018/blob/master/Materials/Part_2_Basic_Principals.pdf
library(rsample)
data_split <- initial_split(mydf, prop = 0.8, strata = "dep")
train <- training(data_split)
test <- testing(data_split)
nrow(train)/nrow(mydf)
# Create Cross Validation Plan
# Split it into 10 resamples basically train and test pairs
# The train segment is called analysis and the test is called assesment
# We will use 10 pairs
cv_splits <- vfold_cv(train, v=10, strata = "dep")
# Sample sizes
cv_splits$splits[[1]]
# Analysis gets the training split for the first fold
analysis(cv_splits$splits[[1]])
# assessment gets the test split for the first fold
assessment(cv_splits$splits[[1]])
create_fold_aggregates <- function(fold){
# We create a test and a training set per fold
train_fold <- analysis(fold) %>% rownames_to_column() %>% mutate(data_type = 'train')
test_fold <- assessment(fold) %>% rownames_to_column() %>% mutate(data_type = 'test')
# Combine the train fold and the test fold together
est_data <- bind_rows(train_fold, test_fold)
# Generate mean aggregate per school per dataset type
# So the mean per school will differ in the training and the test
# It will also differ across folds
est_data_agg <- est_data %>%
select(data_type, school, x7, x8) %>%
mutate(ind_mean = rowMeans(select(., x7, x8))) %>%
group_by(data_type, school) %>%
summarise(school_mean = mean(ind_mean)) %>%
ungroup()
# Bring individual scores and group scores together
df_calc <- est_data %>%
left_join(est_data_agg) %>%
select(-x7,-x8) %>%
arrange(as.numeric(rowname)) %>%
column_to_rownames("rowname") %>%
select(-data_type)
data.frame()
# Update Fold
fold$data <- df_calc
return(fold)
}
# Create a new list which consists of the updated school aggregates per fold for train and test
new_splits <- map(.x = cv_splits$splits, .f = create_fold_aggregates)
# Bring it back into the fold
cv_splits$splits <- new_splits
Thanks