Cross-validation AUC x Logistic Regression

Rlearner123 · November 22, 2021, 7:33pm

Dear all,

I would like to ask how I can perform cross-validation AUC? I saw the following R code online and am not sure if this is the correct way to do it. If not, can someone pls share the correct R code for 10-fold cvAUC?

Thank you very much

(cv_logit <-  train(life_exp_cat ~ total_expenditure + 
             schooling + adult_mortality + infant_deaths + thinness_1_19_years +
            alcohol + status, data=life_exp_new, 
            method="glm", 
            trControl = trainControl(method="cv", number=10)))

mattwarkentin · November 22, 2021, 9:59pm

Hi @Rlearner123,

I believe the above code is based on the {caret} package. The tidymodels family of packages sort of supersedes that package, so here is how you could do that with the modern and more modular API.

We will use the Ames Housing data, since I don't have access to the data in the original problem. Let's see if we can predict whether a house was built before or after 1960 based on just a couple predictors. We will measure model performance with cross-validated AUC.

library(tidymodels)
library(AmesHousing)

# Prepare the data
data <- 
  make_ames() %>% 
  mutate(
    before_1960 = factor(if_else(Year_Built<1960, 1L, 0L))
  ) %>% 
  select(-Year_Built)

# Get the resampling folds
folds <- vfold_cv(data, v = 10, strata = 'before_1960')

# Define a model
model <- logistic_reg(engine = 'glm')

# Define a workflow
wflow <- 
  workflow() %>% 
  add_formula(before_1960 ~ Sale_Price + Lot_Area + Roof_Style) %>% 
  add_model(model)

# Fit the model to the resamples, compute metrics
results <-
  fit_resamples(
    wflow,
    resamples = folds,
    metrics = metric_set(roc_auc),
    control = control_resamples(event_level = 'second')
  )

# Compute CV-AUC
collect_metrics(results)
#> # A tibble: 1 × 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 roc_auc binary     0.830    10 0.00751 Preprocessor1_Model1

Rlearner123 · November 23, 2021, 1:11am

Hi Matt,
Thank you very much for the help. Can you pls explain the output from the last code? Is the mean of 0.83 the cvAUC? Also, can you pls teach me how to get the 95% CI and p-value of this cvAUC model?

mattwarkentin · November 23, 2021, 10:33pm

#> # A tibble: 1 × 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 roc_auc binary     0.830    10 0.00751 Preprocessor1_Model1

Since we fit a model to each of the 10 training sets and then measure the AUC in the 10 testing sets, we have 10 test AUCs in total. We can then average the 10 AUCs to get the overall cross-validated AUC, which is presented in the mean column. The n column shows how many values were used in computing the average, and this number may change if you use more/less resamples, such as with bootstrapping, LOO-CV, or just a different number of folds in vfold_cv. Hope that helps.

As for getting the 95% CI, this is not straightforward to do in the tidymodels framework. You can definitely do this yourself by using control_resamples(save_pred = TRUE) and then computing the AUCs yourself using the pROC package, or whichever package you prefer. It is a bit longwinded, but something like this should work...

# Fit the model to the resamples, save predictions this time
results <-
  fit_resamples(
    wflow,
    resamples = folds,
    metrics = metric_set(roc_auc),
    control = control_resamples(
      event_level = 'second',
      save_pred = TRUE
    )
  )

library(pROC)

collect_predictions(results) %>% 
  group_nest(id) %>% 
  mutate(
    aucs = map(data, function(x) {
      roc = roc(before_1960 ~ .pred_1, data = x, 
          direction = '<', ci = TRUE)
      as.numeric(roc$ci)
    })
  ) %>% 
  unnest_wider(aucs) %>% 
  rename(auc = ...2, conf.low = ...1, conf.high = ...3) %>% 
  summarise(
    n = n(),
    across(c(auc, conf.low, conf.high), mean)
  )

#> # A tibble: 1 × 4
#>       n   auc conf.low conf.high
#>   <int> <dbl>    <dbl>     <dbl>
#> 1    10 0.829    0.780     0.878

Rlearner123 · November 23, 2021, 10:59pm

Hi Matt
Thank you very much for the detailed response. I have now understood it better!
Would you mind sharing how I can obtain the P-value, sensitivity, and specificity for this cvAUC model?

Rlearner123 · November 24, 2021, 3:24am

Sorry for the multiple responses, Matt! But I just tried the second code you share and it is not working for my dataset. Can you pls explain where do "x" and ".pred_1" come from?

mattwarkentin · November 24, 2021, 4:57pm

Would you mind sharing how I can obtain the P-value, sensitivity, and specificity for this cvAUC model?

I think this question needs to be refined a bit. The cross-validated AUC is a metric we use to evaluate our model(s) performance. Sensitivity and specificity are other metrics we can use. However, AUC is based on measuring sensitivity and specificity at every probability threshold in the range of model-predicted probabilities, where as for sens/spec we need to assign class labels, not probabilities. Second, what p-value are you looking for? For what test/comparison? This needs to be well defined.

Can you pls explain where do "x" and ".pred_1" come from?

.pred_1 is the named of the variable that contains the predicted probabilities for belong to class 1. This is a naming convention used by the tidymodels functions. x is just the name of the argument passed to the anonymous function I used in the map() call.

Rlearner123 · November 24, 2021, 6:24pm

Hi Matt,

I am under the impression that every AUC is associated with sensitivity & specificity. Does this apply to the cvAUC as well (maybe there are 10 sens & 10 specs and we have to average them just like the AUC)? Same as p-value. In the unadjusted AUC, I know the p-value is against the AUC=0.5. Can we do that for the cvAUC as well?

mattwarkentin · November 24, 2021, 6:59pm

You can obtain the Sensitivity and Specificity alongside the AUC via:

results <-
  fit_resamples(
    wflow,
    resamples = folds,
    metrics = metric_set(roc_auc, sensitivity, specificity),
    control = control_resamples(
      event_level = 'second',
      save_pred = TRUE
    )
  )

# Compute CV-AUC, CV-Sens, CV-Spec
collect_metrics(results)
#> # A tibble: 3 × 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 roc_auc binary     0.831    10 0.00713 Preprocessor1_Model1
#> 2 sens    binary     0.550    10 0.0175  Preprocessor1_Model1
#> 3 spec    binary     0.870    10 0.0111  Preprocessor1_Model1

If you want to test the AUC against an AUC of 0.5, this should work:

avg_preds <- 
  collect_predictions(results) %>% 
  group_by(.row) %>% 
  summarise(
    y = max(before_1960==1),
    p = mean(.pred_1)
  )

verification::roc.area(avg_preds$y, avg_preds$p)$p.value
#> Registered S3 method overwritten by 'verification':
#>   method    from
#>   lines.roc pROC
#> [1] 5.521164e-186

Rlearner123 · November 24, 2021, 7:00pm

Thank you Matt

Re: the 95% CI, I somehow got the below error message. Do you know what this means?

Error: Problem with mutate() column aucs.
i aucs = map(...).
x object 'disease' not found

(disease is my outcome, I simply replace the before_1990 with it, but it wont work sadly!)

Rlearner123 · November 26, 2021, 5:47pm

Hi Matt - I am following up regarding the 95% CI question in my previous post. I am wondering if you know how I can fix this issue? Thank you!

mattwarkentin · November 30, 2021, 2:37pm

If you are not able to share your data, if would be best if you can share the structure of your data or some synthetic data with that structure. It will be very difficult to write code that works across any/all data sets.

Rlearner123 · November 30, 2021, 6:05pm

Hi Matt - thank you for the suggestion. Please kindly see my data attached (I can only upload PDF somehow). R sample - cvAUC.pdf (77.4 KB)

Please run the following code first to get the disease vs non-disease status.

data1 <- R.sample...cvAUC
data <- data1 [-48:-68,] ## remove empty rows in the end
data$disease <- cut(data$Post.score, br = c(0,5,80), labels = c("none", "disease"))

Here is the code I try to get cvAUC's 95% CI. Can you pls spot why this code is not working?

folds <- vfold_cv(data, v = 10, strata = 'disease')
model <- logistic_reg(engine = 'glm')
wflow <- 
  workflow() %>% 
  add_formula(disease ~ Pre.score + BMI) %>% 
  add_model(model)

results <-
  fit_resamples(
    wflow,
    resamples = folds,
    metrics = metric_set(roc_auc),
    control = control_resamples(
      event_level = 'second',
      save_pred = TRUE
    )
  )

collect_predictions(results) %>% 
  group_nest(id) %>% 
  mutate(
    aucs = map(data, function(x) {
      roc = roc(data$disease ~ .pred_disease, data = x,  
          direction = '<', ci = TRUE)
      as.numeric(roc$ci)
    })
  ) %>% 
  
  unnest_wider(aucs) %>% 
  rename(auc = ...2, conf.low = ...1, conf.high = ...3) %>% 
  
  summarise(
    n = n(),
    across(c(auc, conf.low, conf.high), mean)
  )

mattwarkentin · November 30, 2021, 8:41pm

If you have your data loaded into an R session, please run dput(data) where data is the name of your data set, and then copy and paste the output from that function into code fences (i.e.``` ) here on the forum.

Rlearner123 · November 30, 2021, 8:49pm

Hi Matt, not sure if I did this correctly but is this what you meant?

structure(list(BMI = c(28.91, 33.9, 32.25, 40.08, 30.71, 33.67, 
53.51, 56.36, 38.5, 30, 36.05, 32.03, 34.11, 32.95, 36.81, 27.32, 
27.27, 24.3, 22.76, 22.47, 29.3, 21.75, 26.7, 28.2, 29.82, 26.6, 
35.35, 37.2, 29.4, 20.9, 28.7, 38.35, 25.8, 27.3, 24.3, 23.8, 
35.5, 24.2, 27.91, 32.1, 43.2, 37.76, 26.79, 35.81, 36.35, 35.55, 
25.76), Pre.score = c(15.87301587, 5.310939628, 5.707491082, 
3.089700997, 41.27569847, 4.200567644, 14.30503889, 6.699928724, 
4.148471616, 3.70212766, 19.41605839, 11.99368753, 5.991232343, 
4.42804428, 20.77562327, 3.432367595, 1.1574886, 4.186655037, 
29.27974948, 0.874453467, 13.66459627, 0.564652739, 13.06160259, 
19.66829268, 0, 8.623969562, 13.18289786, 3.505734689, 4.769475358, 
0.208768267, 1.23683005, 11.11934766, 0.567000567, 4.077429984, 
1.179835538, 2.582781457, 18.62888102, 0.540151242, 9.014084507, 
2.714285714, 15.17327505, 17.41935484, 11.01306036, 13.53987378, 
7.38645816, 18.69688385, 14.6287403), Post.score = c(1.309408341, 
7.213930348, 25.26690391, 12.92719168, 8.702064897, 5.556698909, 
16.09399246, 8.097784568, 4.505119454, 1.120709783, 1.708011387, 
5.040871935, 0.937744204, 6.898584906, 16.31768953, 5.823792932, 
3.003754693, 1.416005149, 44.515357, 4.358683314, 5.233572398, 
0.376175549, 38.43137255, 22.97383535, 1.367088608, 7.234251969, 
8.444902163, 5.696202532, 6.324262169, 3.12922542, 8.610271903, 
53.125, 4.962950198, 7.529843893, 2.871287129, 3.155728333, 15.67839196, 
3.181336161, 3.718393654, 3.9408867, 29.10839161, 21.28337983, 
7.73073889, 12.6340882, 18.53658537, 17.49837978, 15.8557047), 
    disease = structure(c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 
    1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 
    1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 
    1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("none", "disease"
    ), class = "factor")), row.names = c(NA, 47L), class = "data.frame")

mattwarkentin · November 30, 2021, 8:56pm

This code should work for your data. I changed it to 5 folds, as 10 folds produced warnings about AUCs of 1, since the hold out set was so small for N=47 data points.

folds <- vfold_cv(data, v = 5, strata = 'disease')
model <- logistic_reg(engine = 'glm')
wflow <- 
  workflow() %>% 
  add_formula(disease ~ Pre.score + BMI) %>% 
  add_model(model)

results <-
  fit_resamples(
    wflow,
    resamples = folds,
    metrics = metric_set(roc_auc),
    control = control_resamples(
      event_level = 'second',
      save_pred = TRUE
    )
  )

collect_predictions(results) %>% 
  group_nest(id) %>% 
  mutate(
    aucs = map(data, function(x) {
      roc = pROC::roc(disease ~ .pred_disease, data = x,  
                direction = '<', ci = TRUE)
      as.numeric(roc$ci)
    })
  ) %>% 
  unnest_wider(aucs) %>% 
  rename(auc = ...2, conf.low = ...1, conf.high = ...3) %>% 
  summarise(
    n = n(),
    across(c(auc, conf.low, conf.high), mean)
  )

Rlearner123 · December 1, 2021, 1:48am

Thank you for sharing this again, Matt!

I think I know where I got those errors from. In my previous dataset, I named it as "mydata" so I put "map(mydata, function (x).". which gives the error. It has to be "data" regardless of how I named my dataset, correct?

Also, do you know why I get different numbers every time I run this code? If so, which numbers should I report?

mattwarkentin · December 1, 2021, 4:45pm

It has to be "data" regardless of how I named my dataset, correct?

Not quite. When using group_nest() the default is to nest the data into a column called data, this can be changed, but by default your nested data will be in data and so this must match the name using in map().

Also, do you know why I get different numbers every time I run this code? If so, which numbers should I report?

Resampling (vfold_cv()) has a random component to it. Each time you run the entirety of the code, you will be selecting different folds for cross-validation, which will have some effect on the results. If you want the results to be reproducible, you will need to set a seed, such as with set.seed().

Rlearner123 · December 1, 2021, 6:44pm

This is very helpful, Matt!!!

I want to follow up with 1 last question on this topic. How can we impute the missing data when building the cvAUC model? I know we need to use MICE but I have trouble incorporating it into our cvAUC code.

Below I have another set of datasets in which I have some missing data for both of my predictors (one is continuous (percentage) and the other is categorical (smoking status). Can you pls teach me how to build the cvAUC model using this dataset?

structure(list(Smoking.Status = c("smoking", "smoking", "smoking", 
"smoking", "smoking", "non-smoking", "smoking", "non-smoking", 
"non-smoking", "non-smoking", "smoking", "non-smoking", "non-smoking", 
"smoking", "non-smoking", "smoking", "smoking", "non-smoking", 
"non-smoking", "", "", "", "", "", "", "", "non-smoking", "", 
"", "non-smoking", "smoking", "non-smoking", "non-smoking", "smoking", 
"non-smoking", "non-smoking", "non-smoking", "non-smoking", "non-smoking", 
"", "non-smoking", "smoking", "non-smoking", "non-smoking", "smoking", 
"non-smoking", "smoking"), percentage = c(5.5, 72.1, 7.9, 80.6, 
56.3, 11.5, 15.3, 12.3, 30.9, 27.5, 0.3, 5.3, 19.6, 19.8, 0.3, 
40.5, 16.8, 38, 13.8, NA, 15.8, 15.3, 22.8, 17.2, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 31.6, 41.3, 
21.4, 0.4, 41.2, 7.6, 29.9), Post.score = c(1.309408341, 7.213930348, 
25.26690391, 12.92719168, 8.702064897, 5.556698909, 16.09399246, 
8.097784568, 4.505119454, 1.120709783, 1.708011387, 5.040871935, 
0.937744204, 6.898584906, 16.31768953, 5.823792932, 3.003754693, 
1.416005149, 44.515357, 4.358683314, 5.233572398, 0.376175549, 
38.43137255, 22.97383535, 1.367088608, 7.234251969, 8.444902163, 
5.696202532, 6.324262169, 3.12922542, 8.610271903, 53.125, 4.962950198, 
7.529843893, 2.871287129, 3.155728333, 15.67839196, 3.181336161, 
3.718393654, 3.9408867, 29.10839161, 21.28337983, 7.73073889, 
12.6340882, 18.53658537, 17.49837978, 15.8557047)), row.names = c(NA, 
47L), class = "data.frame")

data_1$disease <- cut(data_1$Post.score, br = c(0,5,80), labels = c("none", "disease")) ## same as before - dichotomized disease outcomes

mattwarkentin · December 2, 2021, 4:10pm

You don't necessarily have to use MICE, it is one option for missing data. The tidymodels ecosystem also provides support for run-time imputation using the recipes package.

Check out the recipe package here: https://recipes.tidymodels.org/

Check out the imputation options available: Function reference • recipes

If you choose to go this route, you can define a recipe for your model fitting needs, and replace add_formula with add_recipe.

This should get you started:

folds <- vfold_cv(data, v = 5, strata = 'disease')

model <- logistic_reg(engine = 'glm')

rec <- 
  recipe(disease ~ Smoking.Status + percentage, data = data) %>% 
  step_integer(Smoking.Status) %>% 
  step_impute_median(Smoking.Status) %>% 
  step_impute_linear(percentage)

# We can check how this imputation will work
View(juice(prep(rec, data)))

wflow <- 
  workflow() %>% 
  add_recipe(rec) %>% 
  add_model(model)

results <-
  fit_resamples(
    wflow,
    resamples = folds,
    metrics = metric_set(roc_auc),
    control = control_resamples(
      event_level = 'second',
      save_pred = TRUE
    )
  )