How do I obtain the coefficients, z scores, and p-values for each fold of a k-fold cross validation in R?

I'm performing 5-fold cross-validation using glm to perform logistic regression. This all works fine, but I'd like to obtain the summary(model) information for each fold(meaning the coefficients, p valuess, z scores, etc that you get when performing summary() ), as well as the sensitivity and specificity for each fold if possible. Can someone help?

Here is a reproducible example of what I have so far using the built-in cars dataset:

library(caret)

data("mtcars")
str(mtcars)


mtcars$vs<-as.factor(mtcars$vs)
df0<-na.omit(mtcars)

set.seed(123) 
train.control <- trainControl(method = "cv", number = 5)
# Train the model
model <- train(vs ~., data = mtcars, method = "glm",
               trControl = train.control)


print(model)

summary(model)

model$resample

confusionMatrix(model)

pred.mod  <- predict(model)
confusionMatrix(data=pred.mod, reference=mtcars$vs)

Thank you!

caret doesn't really let you do that. tidymodels does via the extract option.

library(tidymodels)

data(ad_data, package = "modeldata")

set.seed(1)
rs <- vfold_cv(ad_data, repeats = 5)

lr_mod <- logistic_reg() %>% set_engine("glm")

# A function to pull out the model and tidy it
get_model_stats <- function(x) {
  # x is a workflow, so pull the model out
  mod <- pull_workflow_fit(x)
  # return the stats in a good format
  tidy(mod)
}

lr_models <-
  lr_mod %>%
  fit_resamples(Class ~ male + p_tau + age,
                resamples = rs,
                control = control_resamples(extract = get_model_stats))

# Pull out the results; it's a bit of a slog
model_stats <- 
  lr_models %>% 
  select(starts_with("id"), .extracts) %>% 
  bind_rows() %>% 
  unnest(".extracts") %>%  
  unnest(".extracts") %>% 
  select(-.config)

model_stats
#> # A tibble: 200 x 7
#>    id      id2    term        estimate std.error statistic     p.value
#>    <chr>   <chr>  <chr>          <dbl>     <dbl>     <dbl>       <dbl>
#>  1 Repeat1 Fold01 (Intercept)  260.      105.         2.47 0.0133     
#>  2 Repeat1 Fold01 male          -0.706     0.288     -2.45 0.0142     
#>  3 Repeat1 Fold01 p_tau         -1.72      0.355     -4.85 0.00000121 
#>  4 Repeat1 Fold01 age         -255.      107.        -2.39 0.0169     
#>  5 Repeat1 Fold02 (Intercept)  230.      103.         2.24 0.0251     
#>  6 Repeat1 Fold02 male          -0.676     0.286     -2.36 0.0182     
#>  7 Repeat1 Fold02 p_tau         -1.64      0.329     -4.98 0.000000632
#>  8 Repeat1 Fold02 age         -225.      104.        -2.16 0.0310     
#>  9 Repeat1 Fold03 (Intercept)  266.      105.         2.54 0.0112     
#> 10 Repeat1 Fold03 male          -0.697     0.291     -2.40 0.0165     
#> # … with 190 more rows

Created on 2021-04-19 by the reprex package (v1.0.0.9000)

1 Like

Thank you for the reply, the code works as shown. Unfortunately

model$resample
confusionMatrix(model)
pred.mod  <- predict(model)
confusionMatrix(data=pred.mod, reference=mtcars$vs)

Now no longer work. Is there an equivalent way to do this with the tidyverse package? Thank you!

Which part doesn't work? Can you do another reprex?

It looks like you are getting the confusion matrix for the training set (resampled and repredicted).

library(tidymodels)

data(ad_data, package = "modeldata")

set.seed(1)
rs <- vfold_cv(ad_data, repeats = 5)

lr_mod <- logistic_reg() %>% set_engine("glm")

lr_models <-
  lr_mod %>%
  fit_resamples(Class ~ male + p_tau + age,
                resamples = rs,
                control = control_resamples(save_pred = TRUE))

# Resampled confsion matrix: 
conf_mat_resampled(lr_models, tidy = FALSE)
#>          Impaired Control
#> Impaired     24.8    66.2
#> Control      19.2   222.8

# Repredicted
lr_fit <- 
  lr_mod %>% 
  fit(Class ~ male + p_tau + age, data = ad_data)

training_predictions <- augment(lr_fit, new_data = ad_data)

training_predictions %>% 
  conf_mat(truth = Class, estimate = .pred_class)
#>           Truth
#> Prediction Impaired Control
#>   Impaired       25      17
#>   Control        66     225

Created on 2021-04-20 by the reprex package (v1.0.0.9000)

Ah I think I may have been doing the wrong thing in my original code anyways: My apologies, I'm still quite new to both statistics and R. Allow me to start over: I am trying to obtain the confusion matrixes for each fold in the cross-validation by considering the predictions for test data in each fold. I am also trying to obtain the sensitivity and specificity for each fold. Is there a way to do that? Thank you!

Ok, got it. There's some code below (that is very tidyverse) to do what you want.

One note: I figured out from my last post that the results of conf_mat_resamples(tidy = FALSE) are transposed. I"m fixing that now. I use the tidy format of that matrix below (which is correct).

library(tidymodels)

data(ad_data, package = "modeldata")

set.seed(1)
rs <- vfold_cv(ad_data, repeats = 5)

lr_mod <- logistic_reg() %>% set_engine("glm")

perf_metrics <- metric_set(roc_auc, accuracy, sensitivity, specificity)

lr_models <-
  lr_mod %>%
  fit_resamples(Class ~ male + p_tau + age,
                resamples = rs,
                metrics = perf_metrics,
                control = control_resamples(save_pred = TRUE))
# Sens and spec

collect_metrics(lr_models, summarize = FALSE)
#> # A tibble: 200 x 6
#>    id      id2    .metric  .estimator .estimate .config             
#>    <chr>   <chr>  <chr>    <chr>          <dbl> <chr>               
#>  1 Repeat1 Fold01 accuracy binary         0.765 Preprocessor1_Model1
#>  2 Repeat1 Fold01 sens     binary         0.455 Preprocessor1_Model1
#>  3 Repeat1 Fold01 spec     binary         0.913 Preprocessor1_Model1
#>  4 Repeat1 Fold01 roc_auc  binary         0.723 Preprocessor1_Model1
#>  5 Repeat1 Fold02 accuracy binary         0.735 Preprocessor1_Model1
#>  6 Repeat1 Fold02 sens     binary         0.2   Preprocessor1_Model1
#>  7 Repeat1 Fold02 spec     binary         0.958 Preprocessor1_Model1
#>  8 Repeat1 Fold02 roc_auc  binary         0.804 Preprocessor1_Model1
#>  9 Repeat1 Fold03 accuracy binary         0.618 Preprocessor1_Model1
#> 10 Repeat1 Fold03 sens     binary         0.222 Preprocessor1_Model1
#> # … with 190 more rows
# Confusion matrices

# Averaged confusion matrix
conf_mat_resampled(lr_models)
#> # A tibble: 4 x 3
#>   Prediction Truth     Freq
#>   <fct>      <fct>    <dbl>
#> 1 Impaired   Impaired  24.8
#> 2 Impaired   Control   19.2
#> 3 Control    Impaired  66.2
#> 4 Control    Control  223.

# Individual tables
confusion_matrices <- 
  lr_models %>% 
  collect_predictions() %>% 
  group_nest(id, id2) %>% 
  mutate(confusion = 
         map(data, ~ conf_mat(.x, truth = Class, estimate = .pred_class)))

confusion_matrices$confusion[[1]]
#>           Truth
#> Prediction Impaired Control
#>   Impaired        5       2
#>   Control         6      21

Created on 2021-04-21 by the reprex package (v1.0.0.9000)

Thank you Max! This mostly works, there are two error message that show up right at the end of the script that I can't quite fix:

library(tidymodels)

data(mtcars)

mtcars$vs<-as.factor(mtcars$vs)

set.seed(1)

rs <- vfold_cv(mtcars, v = 5)

lr_mod <- logistic_reg() %>% set_engine("glm")

perf_metrics <- metric_set(roc_auc, accuracy, sensitivity, specificity)

lr_models <-
  
  lr_mod %>%
  
  fit_resamples(vs ~ mpg+am+gear,
                
                resamples = rs,
                
                metrics = perf_metrics,
                
                control = control_resamples(save_pred = TRUE))


collect_metrics(lr_models, summarize = FALSE)
#> # A tibble: 20 x 5
#>    id    .metric  .estimator .estimate .config             
#>    <chr> <chr>    <chr>          <dbl> <fct>               

#>  1 Fold1 accuracy binary         0.857 Preprocessor1_Model1
#>  2 Fold1 sens     binary         0.8   Preprocessor1_Model1
#>  3 Fold1 spec     binary         1     Preprocessor1_Model1
#>  4 Fold1 roc_auc  binary         0.8   Preprocessor1_Model1
#>  5 Fold2 accuracy binary         0.571 Preprocessor1_Model1
#>  6 Fold2 sens     binary         0.667 Preprocessor1_Model1
#>  7 Fold2 spec     binary         0.5   Preprocessor1_Model1
#>  8 Fold2 roc_auc  binary         0.5   Preprocessor1_Model1
#>  9 Fold3 accuracy binary         0.667 Preprocessor1_Model1
#> 10 Fold3 sens     binary         0.667 Preprocessor1_Model1
#> 11 Fold3 spec     binary         0.667 Preprocessor1_Model1
#> 12 Fold3 roc_auc  binary         0.889 Preprocessor1_Model1
#> 13 Fold4 accuracy binary         0.667 Preprocessor1_Model1
#> 14 Fold4 sens     binary         1     Preprocessor1_Model1
#> 15 Fold4 spec     binary         0.333 Preprocessor1_Model1
#> 16 Fold4 roc_auc  binary         0.667 Preprocessor1_Model1
#> 17 Fold5 accuracy binary         1     Preprocessor1_Model1
#> 18 Fold5 sens     binary         1     Preprocessor1_Model1
#> 19 Fold5 spec     binary         1     Preprocessor1_Model1
#> 20 Fold5 roc_auc  binary         1     Preprocessor1_Model1

conf_mat_resampled(lr_models)
#> # A tibble: 4 x 3
#>   Prediction Truth  Freq
#>   <fct>      <fct> <dbl>
#> 1 0          0       3  
#> 2 0          1       1  
#> 3 1          0       0.6
#> 4 1          1       1.8

confusion_matrices <- 
  lr_models %>% 
  collect_predictions() %>% 
  group_nest(id, id2) %>% 
  mutate(confusion = 
           map(data, ~ conf_mat(.x, truth = vs, estimate = .pred_class)))
#> Error: Must group by variables found in `.data`.
#> * Column `id2` is not found.

confusion_matrices$confusion[[1]]
#> Error in eval(expr, envir, enclos): object 'confusion_matrices' not found

Thank you!

I think that you changed the resampling to regular V-fold (with no repeats). When that is the case, there is only one id field, so get rid of the id2 in the group_nest() call.

Hi Max, thank you for the help. I had thought that might be the case, but I still get an error, albeit a slightly different one:

library(tidymodels)
library(readr)
#> 
#> Attaching package: 'readr'
#> The following object is masked from 'package:yardstick':
#> 
#>     spec
#> The following object is masked from 'package:scales':
#> 
#>     col_factor
library(Rmisc)
#> Loading required package: lattice
#> Loading required package: plyr
#> ------------------------------------------------------------------------------
#> You have loaded plyr after dplyr - this is likely to cause problems.
#> If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
#> library(plyr); library(dplyr)
#> ------------------------------------------------------------------------------
#> 
#> Attaching package: 'plyr'
#> The following object is masked from 'package:purrr':
#> 
#>     compact
#> The following objects are masked from 'package:dplyr':
#> 
#>     arrange, count, desc, failwith, id, mutate, rename, summarise,
#>     summarize
#library(boot)
setwd("~/covid_survey/Cleaned_Up")
df0<- read_csv("PTSD_cleaned.csv")
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   .default = col_double(),
#>   URN = col_character(),
#>   StartDate = col_character(),
#>   CompletionDate = col_character()
#> )
#> ℹ Use `spec()` for the full column specifications.


df0$PCLV_Binary_Score<-as.factor(df0$PCLV_Binary_Score)
df0$Gender<-as.factor(df0$Gender)
df0$Ethnicity<-as.factor(df0$Ethnicity)
df0$Age<-as.factor(df0$Age)
df0$Location<-as.factor(df0$Location)
df0$Income<-as.factor(df0$Income)
df0$Education<-as.factor(df0$Education)
df0$Working_From_Home<-as.factor(df0$Working_From_Home)

PTSD<-na.omit(df0)




set.seed(123)

rs <- vfold_cv(PTSD, v = 5)



lr_mod <- logistic_reg() %>% set_engine("glm")



perf_metrics <- metric_set(accuracy, sensitivity, specificity)



lr_models <-
  
  lr_mod %>%
  
  fit_resamples(PCLV_Binary_Score ~ Traditional_Time_Hrs+
                  Private_Use+
                  Public_Use+
                  Gender+
                  Ethnicity+
                  Age+
                  Location+
                  Income+
                  Education+
                  Working_From_Home+
                  COVID_Risk+
                  Anxiety_Diagnosed+
                  Depression_Diagnosed+
                  PTSD_Diagnosed+
                  Lived_Alone,
                resamples = rs,
                metrics = perf_metrics,
                control = control_resamples(save_pred = TRUE))
#> ! Fold1: preprocessor 1/1, model 1/1: glm.fit: algorithm did not converge, glm.fi...
#> ! Fold1: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...
#> ! Fold2: preprocessor 1/1, model 1/1: glm.fit: algorithm did not converge, glm.fi...
#> ! Fold2: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...
#> ! Fold3: preprocessor 1/1, model 1/1: glm.fit: algorithm did not converge, glm.fi...
#> ! Fold3: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...
#> ! Fold4: preprocessor 1/1, model 1/1: glm.fit: algorithm did not converge, glm.fi...
#> ! Fold4: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...
#> ! Fold5: preprocessor 1/1, model 1/1: glm.fit: algorithm did not converge, glm.fi...
#> ! Fold5: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...

collect_metrics(lr_models, summarize = FALSE)
#> # A tibble: 15 x 5
#>    id    .metric  .estimator .estimate .config             
#>    <chr> <chr>    <chr>          <dbl> <fct>               
#>  1 Fold1 accuracy binary         0.931 Preprocessor1_Model1
#>  2 Fold1 sens     binary         0.943 Preprocessor1_Model1
#>  3 Fold1 spec     binary         0.8   Preprocessor1_Model1
#>  4 Fold2 accuracy binary         0.793 Preprocessor1_Model1
#>  5 Fold2 sens     binary         0.811 Preprocessor1_Model1
#>  6 Fold2 spec     binary         0.6   Preprocessor1_Model1
#>  7 Fold3 accuracy binary         0.825 Preprocessor1_Model1
#>  8 Fold3 sens     binary         0.86  Preprocessor1_Model1
#>  9 Fold3 spec     binary         0.571 Preprocessor1_Model1
#> 10 Fold4 accuracy binary         0.825 Preprocessor1_Model1
#> 11 Fold4 sens     binary         0.868 Preprocessor1_Model1
#> 12 Fold4 spec     binary         0.25  Preprocessor1_Model1
#> 13 Fold5 accuracy binary         0.807 Preprocessor1_Model1
#> 14 Fold5 sens     binary         0.852 Preprocessor1_Model1
#> 15 Fold5 spec     binary         0     Preprocessor1_Model1
conf_mat_resampled(lr_models)
#> # A tibble: 4 x 3
#>   Prediction Truth  Freq
#>   <fct>      <fct> <dbl>
#> 1 0          0      45.6
#> 2 0          1       2.4
#> 3 1          0       7  
#> 4 1          1       2.4


print(collect)
#> function (x, ...) 
#> {
#>     UseMethod("collect")
#> }
#> <bytecode: 0x7f89fd0be760>
#> <environment: namespace:dplyr>

confusion_matrices <- 
  lr_models %>% 
  collect_predictions() %>% 
  group_nest(id) %>% 
  mutate(confusion = 
           map(PTSD, ~ conf_mat(.x, truth = PCLV_Binary_Score, estimate = .pred_class)))
#> Error in UseMethod("conf_mat"): no applicable method for 'conf_mat' applied to an object of class "character"

confusion_matrices$confusion[[1]]
#> Error in eval(expr, envir, enclos): object 'confusion_matrices' not found

Can you run this?

lr_models %>% 
  collect_predictions() %>% 
  str()

Unfortunately not. It yields:

Error in eval(f) : 'list' object cannot be coerced to type 'double'

Also I apologise, I forgot to mention that I had to switch over to my actual dataset in the previous reprex because the errors were showing up there but not for mtcars.

I'm not sure what the issue is. Can you make a small, reproducible example with the same issue?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.