Multilevel Prediction in Tidymodels with Imbalanced Nested Data

Dear R Studio Community,

I hope that I can inquire your expertise regarding a prediction task in R/Tidymodels. I intend to predict injuries in runners. The daily/weekly training data, on which the predictions are based on, is thereby nested in the individual runners over a timeframe of a few months. This made me consider multilevel models - multilevel binary logistic regression (MLBLR) specifically.

As the data is also very imbalanced I further tried to engage in resampling via SMOTE. Because half the runners did not incur injuries and the other half mostly only one I am additionally uncertain of the success of this undertaking, as there will be none or only one injury instance per runner within the training set to base the resampling on, and consequently no injury instance in the test set for runners with observed injuries within the testing set. This makes the SMOTE resampling most likely not possible.

So far I tried to manually predict injuries via a MLBLR without resampling and by only adapting the prediction probability threshold, with the outcome of only negative predictions because of the unbalanced nature. Understandably, I did not manage to resample via SMOTE in this scenario, should I rather look at other methods like e.g., undersampling non-injury instances or are there any specific resampling procedures (preferably synthetic data creation) for multilevel data, taking the nested structure into account?

I further tried to implement multilevel modelling in the preferred Tidymodel workflow, as resampling is also made easy there. Thereby, I looked firstly at the "multilevelmod" package which induces multilevel engines (lme4) to the workflow. Secondly, I tried to make use of the many models structure by nesting by each runner and then applying models to it. Unfortunately, I only did get the latter method working. Former, I used most likely incorrectly "stan-glmer" as an engine (Code 1), latter I made working with mixed results via simple oversampling (Screenshot 2 - ). Thirdly, I am not sure whether to additionally look at fitting generalized linear models using mixed models via the embed package in Tidymodels.

I would be very grateful to hear your take on this, specifically how to approach this issue of implementing a multilevel model + resampling in the Tidymodels workflow. Thank you very much in advance.

Kind regards!

Multillevelmod: https://github.com/tidymodels/multilevelmod
Many Models: ttps://r4ds.had.co.nz/many-models.html
Embed: ttps://embed.tidymodels.org/articles/Applications/GLM.html

Code 1:


    mlbr_mod <- linear_reg() %>% set_engine("stan-glmer")
    # Recipe:
    mlbr_mod_recipe <- recipe(NewRRI ~Distance + HR + Gender + Age + BMI + 
    PreviousRRI + Runner, data = RunningData_train) %>% 
    step_dummy(all_nominal_predictors()) %>%
    step_string2factor(Runner) %>%
    step_smote(NewRRI, over_ratio =0.5)

    mlbr_mod_workflow <- workflow() %>% add_recipe(mlbr_mod_recipe) %>%
    add_model(mlbr_mod, formula = NewRRI ~ . -Runner + (1|Runner))

    # Fit the model:
    mlbr_mod_workflow %>% fit(data = RunningData_train)

    # Train on original set and test on test set using last_fit()
    mlbr_last_fit <- mlbr_mod_workflow %>% last_fit(RunningData_splits, 
    metrics = metric_set(bal_accuracy, accuracy, f_meas, precision, 
    roc_auc,sensitivity, recall,  kap))

    # Performance on test set:
    mlbr_metrics <- mlbr_last_fit %>% collect_metrics()
    mlbr_metrics

The code fails at the step where I try to fit the model. There it gives the error message that it can't subset columns that don't exist - X Column 'Patient' doesn't exist. The input data is structured like this:



    Runner - factor:   A1, A1, A1, B1, B1, B1, C1, C1, C1 ... (=IDs)
    NewRRI - factor:   0,  0,  1,  0,  0,  0,   0, 0,  0 ...
    Distance - numeric:340,500,734,110,389,766,833,420,1100 ...
    HR - numeric:      120,110,130,142,98, 112,104,117,130 ...
    Gender - factor:   Male,Female,Male,Male,Male,Female,Male,Female,Female, ...
    Age - numeric:     23, 36, 56, 35, 67, 24, 52, 39, 29, ...
    BMI - numeric:     18, 20, 21, 25, 23, 24, 21, 22, 20, ...
    PreviousRRI -factor:0, 0,  1,  0,  0,  1,  1,  0,  0, ...

First, for you error, I suspect that the recipe's use of

step_dummy(all_nominal_predictors())

converts a column to dummy variables that should not have been. It is unclear since the error say Patient but that isn't a column in your data (that's why we usually ask for a reprex). If this is the case, you can just use the specific variable that should be converted to indicators.

For your main question, I don't know that there are any tools (anywhere) for dealing with imbalanced mult-level data. Synthesis isn't a great idea because you would be making up new data that may not correspond to a real patient. If it did, we'd be making a new patient and I doubt that the correlation structure would be inconsistent with the rest of the data.

I think that the best approach, which is not implemented yet, is to use differential case weights based on the outcome level. We are working on that right now across all of our packages.

1 Like

Dear Max,

Thanks a lot for your thorough answer and the great advice regarding the consequence of using SMOTE in this context. Looking forward to the added functionalities with the differential case weights!

Until then I will try to stick with this approach and I will try to fall back to simpler methods. Indeed Patient = Runner in this example. I created to following reprex:

Df <- tibble::tribble(
   ~year_week,~Runner,~NewRRI,~Distance, ~HR, ~Gender,~Age,~BMI,~PreviousRRI,
   "2019-41", "M01"  ,      0,     5000, 120,  "Male",  23,  18,        1,   
   "2019-41", "M02"  ,      0,     6000, 125,"Female",  36,  20,        0,
   "2019-41", "M03"  ,      0,     8000, 130,  "Male",  56,  21,        0,
   "2019-42", "M01"  ,      0,     5500, 122,  "Male",  23,  18,        1,
   "2019-42", "M02"  ,      0,     7000, 128,"Female",  36,  20,        0,
   "2019-42", "M03"  ,      0,    15000, 132,  "Male",  56,  21,        0,
   "2019-43", "M01"  ,      1,     3000, 120,  "Male",  23,  18,        1,
   "2019-43", "M02"  ,      0,     9000, 127,"Female",  36,  20,        0,
   "2019-43", "M03"  ,      0,     9500, 131,  "Male",  56,  21,        0,
   "2019-44", "M01"  ,      0,    15000, 125,  "Male",  23,  18,        1,
   "2019-44", "M02"  ,      0,     9000, 127,"Female",  36,  20,        0,
   "2019-44", "M03"  ,      0,     9500, 131,  "Male",  56,  21,        0,
  ) %>%
  mutate(Gender = as.factor(Gender),
         PreviousRRI = as.factor(PreviousRRI),
         NewRRI = as.factor(NewRRI),
         Runner = as.factor(Runner))

library(tidyverse)
library(tidymodels)
library(multilevelmod)
library(themis)


Df <- Df %>% arrange(year_week)
Df_splits <- initial_time_split(Df, prop = 0.8)
RunningData_train <- training(Df_splits)
RunningData_test <- testing(Df_splits)

# Now apply original code 

Then I get the following error message:

Error: All columns selected for the step should be numeric

I am not sure what to change within the code to avoid this error message?

In case this does not work, the alternative approach with the many models structure most likely also has its limits regarding the reduced availability of observations as the runners are nested with "nest()" and the respective models are mapped to each nested individual runner, so I guess this also is not a viable strategy to imitate the multilevel structure?

Lastly, I found this article on the "SMOTE-NC / ENC" making the application of the SMOTE algorithm possible, yet most likely with the by you mentioned drawbacks, as new data is added on existing Runner/Patient IDs:

https://arxiv.org/abs/2103.07612

Thank you again for your help and consideration, it is highly appreciated.
Kind regards!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.