Cross-validation AUC x Logistic Regression

Dear all,

I would like to ask how I can perform cross-validation AUC? I saw the following R code online and am not sure if this is the correct way to do it. If not, can someone pls share the correct R code for 10-fold cvAUC?

Thank you very much :slight_smile:

(cv_logit <-  train(life_exp_cat ~ total_expenditure + 
             schooling + adult_mortality + infant_deaths + thinness_1_19_years +
            alcohol + status, data=life_exp_new, 
            method="glm", 
            trControl = trainControl(method="cv", number=10)))

Hi @Rlearner123,

I believe the above code is based on the {caret} package. The tidymodels family of packages sort of supersedes that package, so here is how you could do that with the modern and more modular API.

We will use the Ames Housing data, since I don't have access to the data in the original problem. Let's see if we can predict whether a house was built before or after 1960 based on just a couple predictors. We will measure model performance with cross-validated AUC.

library(tidymodels)
library(AmesHousing)

# Prepare the data
data <- 
  make_ames() %>% 
  mutate(
    before_1960 = factor(if_else(Year_Built<1960, 1L, 0L))
  ) %>% 
  select(-Year_Built)

# Get the resampling folds
folds <- vfold_cv(data, v = 10, strata = 'before_1960')

# Define a model
model <- logistic_reg(engine = 'glm')

# Define a workflow
wflow <- 
  workflow() %>% 
  add_formula(before_1960 ~ Sale_Price + Lot_Area + Roof_Style) %>% 
  add_model(model)

# Fit the model to the resamples, compute metrics
results <-
  fit_resamples(
    wflow,
    resamples = folds,
    metrics = metric_set(roc_auc),
    control = control_resamples(event_level = 'second')
  )

# Compute CV-AUC
collect_metrics(results)
#> # A tibble: 1 × 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 roc_auc binary     0.830    10 0.00751 Preprocessor1_Model1
1 Like

Hi Matt,
Thank you very much for the help. Can you pls explain the output from the last code? Is the mean of 0.83 the cvAUC? Also, can you pls teach me how to get the 95% CI and p-value of this cvAUC model?

#> # A tibble: 1 × 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 roc_auc binary     0.830    10 0.00751 Preprocessor1_Model1

Since we fit a model to each of the 10 training sets and then measure the AUC in the 10 testing sets, we have 10 test AUCs in total. We can then average the 10 AUCs to get the overall cross-validated AUC, which is presented in the mean column. The n column shows how many values were used in computing the average, and this number may change if you use more/less resamples, such as with bootstrapping, LOO-CV, or just a different number of folds in vfold_cv. Hope that helps.

As for getting the 95% CI, this is not straightforward to do in the tidymodels framework. You can definitely do this yourself by using control_resamples(save_pred = TRUE) and then computing the AUCs yourself using the pROC package, or whichever package you prefer. It is a bit longwinded, but something like this should work...

# Fit the model to the resamples, save predictions this time
results <-
  fit_resamples(
    wflow,
    resamples = folds,
    metrics = metric_set(roc_auc),
    control = control_resamples(
      event_level = 'second',
      save_pred = TRUE
    )
  )

library(pROC)

collect_predictions(results) %>% 
  group_nest(id) %>% 
  mutate(
    aucs = map(data, function(x) {
      roc = roc(before_1960 ~ .pred_1, data = x, 
          direction = '<', ci = TRUE)
      as.numeric(roc$ci)
    })
  ) %>% 
  unnest_wider(aucs) %>% 
  rename(auc = ...2, conf.low = ...1, conf.high = ...3) %>% 
  summarise(
    n = n(),
    across(c(auc, conf.low, conf.high), mean)
  )

#> # A tibble: 1 × 4
#>       n   auc conf.low conf.high
#>   <int> <dbl>    <dbl>     <dbl>
#> 1    10 0.829    0.780     0.878
1 Like

Hi Matt
Thank you very much for the detailed response. I have now understood it better!
Would you mind sharing how I can obtain the P-value, sensitivity, and specificity for this cvAUC model?

Sorry for the multiple responses, Matt! But I just tried the second code you share and it is not working for my dataset. Can you pls explain where do "x" and ".pred_1" come from?

Would you mind sharing how I can obtain the P-value, sensitivity, and specificity for this cvAUC model?

I think this question needs to be refined a bit. The cross-validated AUC is a metric we use to evaluate our model(s) performance. Sensitivity and specificity are other metrics we can use. However, AUC is based on measuring sensitivity and specificity at every probability threshold in the range of model-predicted probabilities, where as for sens/spec we need to assign class labels, not probabilities. Second, what p-value are you looking for? For what test/comparison? This needs to be well defined.

Can you pls explain where do "x" and ".pred_1" come from?

.pred_1 is the named of the variable that contains the predicted probabilities for belong to class 1. This is a naming convention used by the tidymodels functions. x is just the name of the argument passed to the anonymous function I used in the map() call.

1 Like

Hi Matt,

I am under the impression that every AUC is associated with sensitivity & specificity. Does this apply to the cvAUC as well (maybe there are 10 sens & 10 specs and we have to average them just like the AUC)? Same as p-value. In the unadjusted AUC, I know the p-value is against the AUC=0.5. Can we do that for the cvAUC as well?

You can obtain the Sensitivity and Specificity alongside the AUC via:

results <-
  fit_resamples(
    wflow,
    resamples = folds,
    metrics = metric_set(roc_auc, sensitivity, specificity),
    control = control_resamples(
      event_level = 'second',
      save_pred = TRUE
    )
  )

# Compute CV-AUC, CV-Sens, CV-Spec
collect_metrics(results)
#> # A tibble: 3 × 6
#>   .metric .estimator  mean     n std_err .config             
#>   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 roc_auc binary     0.831    10 0.00713 Preprocessor1_Model1
#> 2 sens    binary     0.550    10 0.0175  Preprocessor1_Model1
#> 3 spec    binary     0.870    10 0.0111  Preprocessor1_Model1

If you want to test the AUC against an AUC of 0.5, this should work:

avg_preds <- 
  collect_predictions(results) %>% 
  group_by(.row) %>% 
  summarise(
    y = max(before_1960==1),
    p = mean(.pred_1)
  )

verification::roc.area(avg_preds$y, avg_preds$p)$p.value
#> Registered S3 method overwritten by 'verification':
#>   method    from
#>   lines.roc pROC
#> [1] 5.521164e-186
1 Like

Thank you Matt

Re: the 95% CI, I somehow got the below error message. Do you know what this means?

Error: Problem with mutate() column aucs.
i aucs = map(...).
x object 'disease' not found

(disease is my outcome, I simply replace the before_1990 with it, but it wont work sadly!)

Hi Matt - I am following up regarding the 95% CI question in my previous post. I am wondering if you know how I can fix this issue? Thank you!

If you are not able to share your data, if would be best if you can share the structure of your data or some synthetic data with that structure. It will be very difficult to write code that works across any/all data sets.

1 Like

Hi Matt - thank you for the suggestion. Please kindly see my data attached (I can only upload PDF somehow). R sample - cvAUC.pdf (77.4 KB)

Please run the following code first to get the disease vs non-disease status.

data1 <- R.sample...cvAUC
data <- data1 [-48:-68,] ## remove empty rows in the end
data$disease <- cut(data$Post.score, br = c(0,5,80), labels = c("none", "disease"))

Here is the code I try to get cvAUC's 95% CI. Can you pls spot why this code is not working?

folds <- vfold_cv(data, v = 10, strata = 'disease')
model <- logistic_reg(engine = 'glm')
wflow <- 
  workflow() %>% 
  add_formula(disease ~ Pre.score + BMI) %>% 
  add_model(model)

results <-
  fit_resamples(
    wflow,
    resamples = folds,
    metrics = metric_set(roc_auc),
    control = control_resamples(
      event_level = 'second',
      save_pred = TRUE
    )
  )

collect_predictions(results) %>% 
  group_nest(id) %>% 
  mutate(
    aucs = map(data, function(x) {
      roc = roc(data$disease ~ .pred_disease, data = x,  
          direction = '<', ci = TRUE)
      as.numeric(roc$ci)
    })
  ) %>% 
  
  unnest_wider(aucs) %>% 
  rename(auc = ...2, conf.low = ...1, conf.high = ...3) %>% 
  
  summarise(
    n = n(),
    across(c(auc, conf.low, conf.high), mean)
  )

If you have your data loaded into an R session, please run dput(data) where data is the name of your data set, and then copy and paste the output from that function into code fences (i.e.``` ) here on the forum.

1 Like

Hi Matt, not sure if I did this correctly but is this what you meant?

structure(list(BMI = c(28.91, 33.9, 32.25, 40.08, 30.71, 33.67, 
53.51, 56.36, 38.5, 30, 36.05, 32.03, 34.11, 32.95, 36.81, 27.32, 
27.27, 24.3, 22.76, 22.47, 29.3, 21.75, 26.7, 28.2, 29.82, 26.6, 
35.35, 37.2, 29.4, 20.9, 28.7, 38.35, 25.8, 27.3, 24.3, 23.8, 
35.5, 24.2, 27.91, 32.1, 43.2, 37.76, 26.79, 35.81, 36.35, 35.55, 
25.76), Pre.score = c(15.87301587, 5.310939628, 5.707491082, 
3.089700997, 41.27569847, 4.200567644, 14.30503889, 6.699928724, 
4.148471616, 3.70212766, 19.41605839, 11.99368753, 5.991232343, 
4.42804428, 20.77562327, 3.432367595, 1.1574886, 4.186655037, 
29.27974948, 0.874453467, 13.66459627, 0.564652739, 13.06160259, 
19.66829268, 0, 8.623969562, 13.18289786, 3.505734689, 4.769475358, 
0.208768267, 1.23683005, 11.11934766, 0.567000567, 4.077429984, 
1.179835538, 2.582781457, 18.62888102, 0.540151242, 9.014084507, 
2.714285714, 15.17327505, 17.41935484, 11.01306036, 13.53987378, 
7.38645816, 18.69688385, 14.6287403), Post.score = c(1.309408341, 
7.213930348, 25.26690391, 12.92719168, 8.702064897, 5.556698909, 
16.09399246, 8.097784568, 4.505119454, 1.120709783, 1.708011387, 
5.040871935, 0.937744204, 6.898584906, 16.31768953, 5.823792932, 
3.003754693, 1.416005149, 44.515357, 4.358683314, 5.233572398, 
0.376175549, 38.43137255, 22.97383535, 1.367088608, 7.234251969, 
8.444902163, 5.696202532, 6.324262169, 3.12922542, 8.610271903, 
53.125, 4.962950198, 7.529843893, 2.871287129, 3.155728333, 15.67839196, 
3.181336161, 3.718393654, 3.9408867, 29.10839161, 21.28337983, 
7.73073889, 12.6340882, 18.53658537, 17.49837978, 15.8557047), 
    disease = structure(c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 
    1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 
    1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 
    1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("none", "disease"
    ), class = "factor")), row.names = c(NA, 47L), class = "data.frame")

This code should work for your data. I changed it to 5 folds, as 10 folds produced warnings about AUCs of 1, since the hold out set was so small for N=47 data points.

folds <- vfold_cv(data, v = 5, strata = 'disease')
model <- logistic_reg(engine = 'glm')
wflow <- 
  workflow() %>% 
  add_formula(disease ~ Pre.score + BMI) %>% 
  add_model(model)

results <-
  fit_resamples(
    wflow,
    resamples = folds,
    metrics = metric_set(roc_auc),
    control = control_resamples(
      event_level = 'second',
      save_pred = TRUE
    )
  )

collect_predictions(results) %>% 
  group_nest(id) %>% 
  mutate(
    aucs = map(data, function(x) {
      roc = pROC::roc(disease ~ .pred_disease, data = x,  
                direction = '<', ci = TRUE)
      as.numeric(roc$ci)
    })
  ) %>% 
  unnest_wider(aucs) %>% 
  rename(auc = ...2, conf.low = ...1, conf.high = ...3) %>% 
  summarise(
    n = n(),
    across(c(auc, conf.low, conf.high), mean)
  )
1 Like

Thank you for sharing this again, Matt!

I think I know where I got those errors from. In my previous dataset, I named it as "mydata" so I put "map(mydata, function (x).". which gives the error. It has to be "data" regardless of how I named my dataset, correct?

Also, do you know why I get different numbers every time I run this code? If so, which numbers should I report?

It has to be "data" regardless of how I named my dataset, correct?

Not quite. When using group_nest() the default is to nest the data into a column called data, this can be changed, but by default your nested data will be in data and so this must match the name using in map().

Also, do you know why I get different numbers every time I run this code? If so, which numbers should I report?

Resampling (vfold_cv()) has a random component to it. Each time you run the entirety of the code, you will be selecting different folds for cross-validation, which will have some effect on the results. If you want the results to be reproducible, you will need to set a seed, such as with set.seed().

1 Like

This is very helpful, Matt!!!

I want to follow up with 1 last question on this topic. How can we impute the missing data when building the cvAUC model? I know we need to use MICE but I have trouble incorporating it into our cvAUC code.

Below I have another set of datasets in which I have some missing data for both of my predictors (one is continuous (percentage) and the other is categorical (smoking status). Can you pls teach me how to build the cvAUC model using this dataset?

structure(list(Smoking.Status = c("smoking", "smoking", "smoking", 
"smoking", "smoking", "non-smoking", "smoking", "non-smoking", 
"non-smoking", "non-smoking", "smoking", "non-smoking", "non-smoking", 
"smoking", "non-smoking", "smoking", "smoking", "non-smoking", 
"non-smoking", "", "", "", "", "", "", "", "non-smoking", "", 
"", "non-smoking", "smoking", "non-smoking", "non-smoking", "smoking", 
"non-smoking", "non-smoking", "non-smoking", "non-smoking", "non-smoking", 
"", "non-smoking", "smoking", "non-smoking", "non-smoking", "smoking", 
"non-smoking", "smoking"), percentage = c(5.5, 72.1, 7.9, 80.6, 
56.3, 11.5, 15.3, 12.3, 30.9, 27.5, 0.3, 5.3, 19.6, 19.8, 0.3, 
40.5, 16.8, 38, 13.8, NA, 15.8, 15.3, 22.8, 17.2, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 31.6, 41.3, 
21.4, 0.4, 41.2, 7.6, 29.9), Post.score = c(1.309408341, 7.213930348, 
25.26690391, 12.92719168, 8.702064897, 5.556698909, 16.09399246, 
8.097784568, 4.505119454, 1.120709783, 1.708011387, 5.040871935, 
0.937744204, 6.898584906, 16.31768953, 5.823792932, 3.003754693, 
1.416005149, 44.515357, 4.358683314, 5.233572398, 0.376175549, 
38.43137255, 22.97383535, 1.367088608, 7.234251969, 8.444902163, 
5.696202532, 6.324262169, 3.12922542, 8.610271903, 53.125, 4.962950198, 
7.529843893, 2.871287129, 3.155728333, 15.67839196, 3.181336161, 
3.718393654, 3.9408867, 29.10839161, 21.28337983, 7.73073889, 
12.6340882, 18.53658537, 17.49837978, 15.8557047)), row.names = c(NA, 
47L), class = "data.frame")

data_1$disease <- cut(data_1$Post.score, br = c(0,5,80), labels = c("none", "disease")) ## same as before - dichotomized disease outcomes

You don't necessarily have to use MICE, it is one option for missing data. The tidymodels ecosystem also provides support for run-time imputation using the recipes package.

Check out the recipe package here: https://recipes.tidymodels.org/

Check out the imputation options available: Function reference • recipes

If you choose to go this route, you can define a recipe for your model fitting needs, and replace add_formula with add_recipe.

This should get you started:

folds <- vfold_cv(data, v = 5, strata = 'disease')

model <- logistic_reg(engine = 'glm')

rec <- 
  recipe(disease ~ Smoking.Status + percentage, data = data) %>% 
  step_integer(Smoking.Status) %>% 
  step_impute_median(Smoking.Status) %>% 
  step_impute_linear(percentage)

# We can check how this imputation will work
View(juice(prep(rec, data)))

wflow <- 
  workflow() %>% 
  add_recipe(rec) %>% 
  add_model(model)

results <-
  fit_resamples(
    wflow,
    resamples = folds,
    metrics = metric_set(roc_auc),
    control = control_resamples(
      event_level = 'second',
      save_pred = TRUE
    )
  )
1 Like