Dear Max,
Thanks a lot for your thorough answer and the great advice regarding the consequence of using SMOTE in this context. Looking forward to the added functionalities with the differential case weights!
Until then I will try to stick with this approach and I will try to fall back to simpler methods. Indeed Patient = Runner in this example. I created to following reprex:
Df <- tibble::tribble(
~year_week,~Runner,~NewRRI,~Distance, ~HR, ~Gender,~Age,~BMI,~PreviousRRI,
"2019-41", "M01" , 0, 5000, 120, "Male", 23, 18, 1,
"2019-41", "M02" , 0, 6000, 125,"Female", 36, 20, 0,
"2019-41", "M03" , 0, 8000, 130, "Male", 56, 21, 0,
"2019-42", "M01" , 0, 5500, 122, "Male", 23, 18, 1,
"2019-42", "M02" , 0, 7000, 128,"Female", 36, 20, 0,
"2019-42", "M03" , 0, 15000, 132, "Male", 56, 21, 0,
"2019-43", "M01" , 1, 3000, 120, "Male", 23, 18, 1,
"2019-43", "M02" , 0, 9000, 127,"Female", 36, 20, 0,
"2019-43", "M03" , 0, 9500, 131, "Male", 56, 21, 0,
"2019-44", "M01" , 0, 15000, 125, "Male", 23, 18, 1,
"2019-44", "M02" , 0, 9000, 127,"Female", 36, 20, 0,
"2019-44", "M03" , 0, 9500, 131, "Male", 56, 21, 0,
) %>%
mutate(Gender = as.factor(Gender),
PreviousRRI = as.factor(PreviousRRI),
NewRRI = as.factor(NewRRI),
Runner = as.factor(Runner))
library(tidyverse)
library(tidymodels)
library(multilevelmod)
library(themis)
Df <- Df %>% arrange(year_week)
Df_splits <- initial_time_split(Df, prop = 0.8)
RunningData_train <- training(Df_splits)
RunningData_test <- testing(Df_splits)
# Now apply original code
Then I get the following error message:
Error: All columns selected for the step should be numeric
I am not sure what to change within the code to avoid this error message?
In case this does not work, the alternative approach with the many models structure most likely also has its limits regarding the reduced availability of observations as the runners are nested with "nest()" and the respective models are mapped to each nested individual runner, so I guess this also is not a viable strategy to imitate the multilevel structure?
Lastly, I found this article on the "SMOTE-NC / ENC" making the application of the SMOTE algorithm possible, yet most likely with the by you mentioned drawbacks, as new data is added on existing Runner/Patient IDs:
https://arxiv.org/abs/2103.07612
Thank you again for your help and consideration, it is highly appreciated.
Kind regards!