How do I obtain the coefficients, z scores, and p-values for each fold of a k-fold cross validation in R?

Liber · April 19, 2021, 7:48pm

I'm performing 5-fold cross-validation using glm to perform logistic regression. This all works fine, but I'd like to obtain the summary(model) information for each fold(meaning the coefficients, p valuess, z scores, etc that you get when performing summary() ), as well as the sensitivity and specificity for each fold if possible. Can someone help?

Here is a reproducible example of what I have so far using the built-in cars dataset:

library(caret)

data("mtcars")
str(mtcars)


mtcars$vs<-as.factor(mtcars$vs)
df0<-na.omit(mtcars)

set.seed(123) 
train.control <- trainControl(method = "cv", number = 5)
# Train the model
model <- train(vs ~., data = mtcars, method = "glm",
               trControl = train.control)


print(model)

summary(model)

model$resample

confusionMatrix(model)

pred.mod  <- predict(model)
confusionMatrix(data=pred.mod, reference=mtcars$vs)

Thank you!

Max · April 19, 2021, 10:20pm

caret doesn't really let you do that. tidymodels does via the extract option.

library(tidymodels)

data(ad_data, package = "modeldata")

set.seed(1)
rs <- vfold_cv(ad_data, repeats = 5)

lr_mod <- logistic_reg() %>% set_engine("glm")

# A function to pull out the model and tidy it
get_model_stats <- function(x) {
  # x is a workflow, so pull the model out
  mod <- pull_workflow_fit(x)
  # return the stats in a good format
  tidy(mod)
}

lr_models <-
  lr_mod %>%
  fit_resamples(Class ~ male + p_tau + age,
                resamples = rs,
                control = control_resamples(extract = get_model_stats))

# Pull out the results; it's a bit of a slog
model_stats <- 
  lr_models %>% 
  select(starts_with("id"), .extracts) %>% 
  bind_rows() %>% 
  unnest(".extracts") %>%  
  unnest(".extracts") %>% 
  select(-.config)

model_stats
#> # A tibble: 200 x 7
#>    id      id2    term        estimate std.error statistic     p.value
#>    <chr>   <chr>  <chr>          <dbl>     <dbl>     <dbl>       <dbl>
#>  1 Repeat1 Fold01 (Intercept)  260.      105.         2.47 0.0133     
#>  2 Repeat1 Fold01 male          -0.706     0.288     -2.45 0.0142     
#>  3 Repeat1 Fold01 p_tau         -1.72      0.355     -4.85 0.00000121 
#>  4 Repeat1 Fold01 age         -255.      107.        -2.39 0.0169     
#>  5 Repeat1 Fold02 (Intercept)  230.      103.         2.24 0.0251     
#>  6 Repeat1 Fold02 male          -0.676     0.286     -2.36 0.0182     
#>  7 Repeat1 Fold02 p_tau         -1.64      0.329     -4.98 0.000000632
#>  8 Repeat1 Fold02 age         -225.      104.        -2.16 0.0310     
#>  9 Repeat1 Fold03 (Intercept)  266.      105.         2.54 0.0112     
#> 10 Repeat1 Fold03 male          -0.697     0.291     -2.40 0.0165     
#> # … with 190 more rows

^{Created on 2021-04-19 by the reprex package (v1.0.0.9000)}

Liber · April 20, 2021, 6:22pm

Thank you for the reply, the code works as shown. Unfortunately

model$resample
confusionMatrix(model)
pred.mod  <- predict(model)
confusionMatrix(data=pred.mod, reference=mtcars$vs)

Now no longer work. Is there an equivalent way to do this with the tidyverse package? Thank you!

Max · April 20, 2021, 8:37pm

Which part doesn't work? Can you do another reprex?

It looks like you are getting the confusion matrix for the training set (resampled and repredicted).

library(tidymodels)

data(ad_data, package = "modeldata")

set.seed(1)
rs <- vfold_cv(ad_data, repeats = 5)

lr_mod <- logistic_reg() %>% set_engine("glm")

lr_models <-
  lr_mod %>%
  fit_resamples(Class ~ male + p_tau + age,
                resamples = rs,
                control = control_resamples(save_pred = TRUE))

# Resampled confsion matrix: 
conf_mat_resampled(lr_models, tidy = FALSE)
#>          Impaired Control
#> Impaired     24.8    66.2
#> Control      19.2   222.8

# Repredicted
lr_fit <- 
  lr_mod %>% 
  fit(Class ~ male + p_tau + age, data = ad_data)

training_predictions <- augment(lr_fit, new_data = ad_data)

training_predictions %>% 
  conf_mat(truth = Class, estimate = .pred_class)
#>           Truth
#> Prediction Impaired Control
#>   Impaired       25      17
#>   Control        66     225

^{Created on 2021-04-20 by the reprex package (v1.0.0.9000)}

Liber · April 21, 2021, 1:16am

Ah I think I may have been doing the wrong thing in my original code anyways: My apologies, I'm still quite new to both statistics and R. Allow me to start over: I am trying to obtain the confusion matrixes for each fold in the cross-validation by considering the predictions for test data in each fold. I am also trying to obtain the sensitivity and specificity for each fold. Is there a way to do that? Thank you!

Max · April 21, 2021, 1:33pm

Ok, got it. There's some code below (that is very tidyverse) to do what you want.

One note: I figured out from my last post that the results of conf_mat_resamples(tidy = FALSE) are transposed. I"m fixing that now. I use the tidy format of that matrix below (which is correct).

library(tidymodels)

data(ad_data, package = "modeldata")

set.seed(1)
rs <- vfold_cv(ad_data, repeats = 5)

lr_mod <- logistic_reg() %>% set_engine("glm")

perf_metrics <- metric_set(roc_auc, accuracy, sensitivity, specificity)

lr_models <-
  lr_mod %>%
  fit_resamples(Class ~ male + p_tau + age,
                resamples = rs,
                metrics = perf_metrics,
                control = control_resamples(save_pred = TRUE))

# Sens and spec

collect_metrics(lr_models, summarize = FALSE)
#> # A tibble: 200 x 6
#>    id      id2    .metric  .estimator .estimate .config             
#>    <chr>   <chr>  <chr>    <chr>          <dbl> <chr>               
#>  1 Repeat1 Fold01 accuracy binary         0.765 Preprocessor1_Model1
#>  2 Repeat1 Fold01 sens     binary         0.455 Preprocessor1_Model1
#>  3 Repeat1 Fold01 spec     binary         0.913 Preprocessor1_Model1
#>  4 Repeat1 Fold01 roc_auc  binary         0.723 Preprocessor1_Model1
#>  5 Repeat1 Fold02 accuracy binary         0.735 Preprocessor1_Model1
#>  6 Repeat1 Fold02 sens     binary         0.2   Preprocessor1_Model1
#>  7 Repeat1 Fold02 spec     binary         0.958 Preprocessor1_Model1
#>  8 Repeat1 Fold02 roc_auc  binary         0.804 Preprocessor1_Model1
#>  9 Repeat1 Fold03 accuracy binary         0.618 Preprocessor1_Model1
#> 10 Repeat1 Fold03 sens     binary         0.222 Preprocessor1_Model1
#> # … with 190 more rows

# Confusion matrices

# Averaged confusion matrix
conf_mat_resampled(lr_models)
#> # A tibble: 4 x 3
#>   Prediction Truth     Freq
#>   <fct>      <fct>    <dbl>
#> 1 Impaired   Impaired  24.8
#> 2 Impaired   Control   19.2
#> 3 Control    Impaired  66.2
#> 4 Control    Control  223.

# Individual tables
confusion_matrices <- 
  lr_models %>% 
  collect_predictions() %>% 
  group_nest(id, id2) %>% 
  mutate(confusion = 
         map(data, ~ conf_mat(.x, truth = Class, estimate = .pred_class)))

confusion_matrices$confusion[[1]]
#>           Truth
#> Prediction Impaired Control
#>   Impaired        5       2
#>   Control         6      21

^{Created on 2021-04-21 by the reprex package (v1.0.0.9000)}

Liber · April 21, 2021, 7:07pm

Thank you Max! This mostly works, there are two error message that show up right at the end of the script that I can't quite fix:

library(tidymodels)

data(mtcars)

mtcars$vs<-as.factor(mtcars$vs)

set.seed(1)

rs <- vfold_cv(mtcars, v = 5)

lr_mod <- logistic_reg() %>% set_engine("glm")

perf_metrics <- metric_set(roc_auc, accuracy, sensitivity, specificity)

lr_models <-
  
  lr_mod %>%
  
  fit_resamples(vs ~ mpg+am+gear,
                
                resamples = rs,
                
                metrics = perf_metrics,
                
                control = control_resamples(save_pred = TRUE))


collect_metrics(lr_models, summarize = FALSE)
#> # A tibble: 20 x 5
#>    id    .metric  .estimator .estimate .config             
#>    <chr> <chr>    <chr>          <dbl> <fct>               

#>  1 Fold1 accuracy binary         0.857 Preprocessor1_Model1
#>  2 Fold1 sens     binary         0.8   Preprocessor1_Model1
#>  3 Fold1 spec     binary         1     Preprocessor1_Model1
#>  4 Fold1 roc_auc  binary         0.8   Preprocessor1_Model1
#>  5 Fold2 accuracy binary         0.571 Preprocessor1_Model1
#>  6 Fold2 sens     binary         0.667 Preprocessor1_Model1
#>  7 Fold2 spec     binary         0.5   Preprocessor1_Model1
#>  8 Fold2 roc_auc  binary         0.5   Preprocessor1_Model1
#>  9 Fold3 accuracy binary         0.667 Preprocessor1_Model1
#> 10 Fold3 sens     binary         0.667 Preprocessor1_Model1
#> 11 Fold3 spec     binary         0.667 Preprocessor1_Model1
#> 12 Fold3 roc_auc  binary         0.889 Preprocessor1_Model1
#> 13 Fold4 accuracy binary         0.667 Preprocessor1_Model1
#> 14 Fold4 sens     binary         1     Preprocessor1_Model1
#> 15 Fold4 spec     binary         0.333 Preprocessor1_Model1
#> 16 Fold4 roc_auc  binary         0.667 Preprocessor1_Model1
#> 17 Fold5 accuracy binary         1     Preprocessor1_Model1
#> 18 Fold5 sens     binary         1     Preprocessor1_Model1
#> 19 Fold5 spec     binary         1     Preprocessor1_Model1
#> 20 Fold5 roc_auc  binary         1     Preprocessor1_Model1

conf_mat_resampled(lr_models)
#> # A tibble: 4 x 3
#>   Prediction Truth  Freq
#>   <fct>      <fct> <dbl>
#> 1 0          0       3  
#> 2 0          1       1  
#> 3 1          0       0.6
#> 4 1          1       1.8

confusion_matrices <- 
  lr_models %>% 
  collect_predictions() %>% 
  group_nest(id, id2) %>% 
  mutate(confusion = 
           map(data, ~ conf_mat(.x, truth = vs, estimate = .pred_class)))
#> Error: Must group by variables found in `.data`.
#> * Column `id2` is not found.

confusion_matrices$confusion[[1]]
#> Error in eval(expr, envir, enclos): object 'confusion_matrices' not found

Thank you!

Max · April 21, 2021, 7:50pm

I think that you changed the resampling to regular V-fold (with no repeats). When that is the case, there is only one id field, so get rid of the id2 in the group_nest() call.

Liber · April 21, 2021, 8:58pm

Hi Max, thank you for the help. I had thought that might be the case, but I still get an error, albeit a slightly different one:

library(tidymodels)
library(readr)
#> 
#> Attaching package: 'readr'
#> The following object is masked from 'package:yardstick':
#> 
#>     spec
#> The following object is masked from 'package:scales':
#> 
#>     col_factor
library(Rmisc)
#> Loading required package: lattice
#> Loading required package: plyr
#> ------------------------------------------------------------------------------
#> You have loaded plyr after dplyr - this is likely to cause problems.
#> If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
#> library(plyr); library(dplyr)
#> ------------------------------------------------------------------------------
#> 
#> Attaching package: 'plyr'
#> The following object is masked from 'package:purrr':
#> 
#>     compact
#> The following objects are masked from 'package:dplyr':
#> 
#>     arrange, count, desc, failwith, id, mutate, rename, summarise,
#>     summarize
#library(boot)
setwd("~/covid_survey/Cleaned_Up")
df0<- read_csv("PTSD_cleaned.csv")
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   .default = col_double(),
#>   URN = col_character(),
#>   StartDate = col_character(),
#>   CompletionDate = col_character()
#> )
#> ℹ Use `spec()` for the full column specifications.


df0$PCLV_Binary_Score<-as.factor(df0$PCLV_Binary_Score)
df0$Gender<-as.factor(df0$Gender)
df0$Ethnicity<-as.factor(df0$Ethnicity)
df0$Age<-as.factor(df0$Age)
df0$Location<-as.factor(df0$Location)
df0$Income<-as.factor(df0$Income)
df0$Education<-as.factor(df0$Education)
df0$Working_From_Home<-as.factor(df0$Working_From_Home)

PTSD<-na.omit(df0)




set.seed(123)

rs <- vfold_cv(PTSD, v = 5)



lr_mod <- logistic_reg() %>% set_engine("glm")



perf_metrics <- metric_set(accuracy, sensitivity, specificity)



lr_models <-
  
  lr_mod %>%
  
  fit_resamples(PCLV_Binary_Score ~ Traditional_Time_Hrs+
                  Private_Use+
                  Public_Use+
                  Gender+
                  Ethnicity+
                  Age+
                  Location+
                  Income+
                  Education+
                  Working_From_Home+
                  COVID_Risk+
                  Anxiety_Diagnosed+
                  Depression_Diagnosed+
                  PTSD_Diagnosed+
                  Lived_Alone,
                resamples = rs,
                metrics = perf_metrics,
                control = control_resamples(save_pred = TRUE))
#> ! Fold1: preprocessor 1/1, model 1/1: glm.fit: algorithm did not converge, glm.fi...
#> ! Fold1: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...
#> ! Fold2: preprocessor 1/1, model 1/1: glm.fit: algorithm did not converge, glm.fi...
#> ! Fold2: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...
#> ! Fold3: preprocessor 1/1, model 1/1: glm.fit: algorithm did not converge, glm.fi...
#> ! Fold3: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...
#> ! Fold4: preprocessor 1/1, model 1/1: glm.fit: algorithm did not converge, glm.fi...
#> ! Fold4: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...
#> ! Fold5: preprocessor 1/1, model 1/1: glm.fit: algorithm did not converge, glm.fi...
#> ! Fold5: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-defici...

collect_metrics(lr_models, summarize = FALSE)
#> # A tibble: 15 x 5
#>    id    .metric  .estimator .estimate .config             
#>    <chr> <chr>    <chr>          <dbl> <fct>               
#>  1 Fold1 accuracy binary         0.931 Preprocessor1_Model1
#>  2 Fold1 sens     binary         0.943 Preprocessor1_Model1
#>  3 Fold1 spec     binary         0.8   Preprocessor1_Model1
#>  4 Fold2 accuracy binary         0.793 Preprocessor1_Model1
#>  5 Fold2 sens     binary         0.811 Preprocessor1_Model1
#>  6 Fold2 spec     binary         0.6   Preprocessor1_Model1
#>  7 Fold3 accuracy binary         0.825 Preprocessor1_Model1
#>  8 Fold3 sens     binary         0.86  Preprocessor1_Model1
#>  9 Fold3 spec     binary         0.571 Preprocessor1_Model1
#> 10 Fold4 accuracy binary         0.825 Preprocessor1_Model1
#> 11 Fold4 sens     binary         0.868 Preprocessor1_Model1
#> 12 Fold4 spec     binary         0.25  Preprocessor1_Model1
#> 13 Fold5 accuracy binary         0.807 Preprocessor1_Model1
#> 14 Fold5 sens     binary         0.852 Preprocessor1_Model1
#> 15 Fold5 spec     binary         0     Preprocessor1_Model1
conf_mat_resampled(lr_models)
#> # A tibble: 4 x 3
#>   Prediction Truth  Freq
#>   <fct>      <fct> <dbl>
#> 1 0          0      45.6
#> 2 0          1       2.4
#> 3 1          0       7  
#> 4 1          1       2.4


print(collect)
#> function (x, ...) 
#> {
#>     UseMethod("collect")
#> }
#> <bytecode: 0x7f89fd0be760>
#> <environment: namespace:dplyr>

confusion_matrices <- 
  lr_models %>% 
  collect_predictions() %>% 
  group_nest(id) %>% 
  mutate(confusion = 
           map(PTSD, ~ conf_mat(.x, truth = PCLV_Binary_Score, estimate = .pred_class)))
#> Error in UseMethod("conf_mat"): no applicable method for 'conf_mat' applied to an object of class "character"

confusion_matrices$confusion[[1]]
#> Error in eval(expr, envir, enclos): object 'confusion_matrices' not found

Max · April 21, 2021, 9:06pm

Can you run this?

lr_models %>% 
  collect_predictions() %>% 
  str()

Liber · April 21, 2021, 9:58pm

Unfortunately not. It yields:

Error in eval(f) : 'list' object cannot be coerced to type 'double'

Also I apologise, I forgot to mention that I had to switch over to my actual dataset in the previous reprex because the errors were showing up there but not for mtcars.

Max · April 22, 2021, 5:36pm

I'm not sure what the issue is. Can you make a small, reproducible example with the same issue?

system · May 13, 2021, 5:36pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.