Large standard error of prediction from parsnip vs base R

levi.baguley · June 5, 2020, 2:54am

It seems like predict is producing a standard error that is too large. I get 0.820 with a parsnip model but 0.194 with a base R model. 0.194 for a standard error seems more reasonable since about 2*0.195 above and below my prediction are the ends of the confidence interval. What is my problem/misunderstanding?

library(parsnip)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# example data
mod_dat <- mtcars %>%
  as_tibble() %>%
  mutate(cyl_8 = as.numeric(cyl == 8)) %>%
  select(mpg, cyl_8)

parsnip_mod <- logistic_reg() %>%
  set_engine("glm") %>%
  fit(as.factor(cyl_8) ~ mpg, data = mod_dat)

base_mod <- glm(as.factor(cyl_8) ~ mpg, data = mod_dat, family = "binomial")

parsnip_pred <- tibble(mpg = 18) %>%
  bind_cols(predict(parsnip_mod, new_data = ., type = 'prob'),
            predict(parsnip_mod, new_data = ., type = 'conf_int', std_error = T)) %>%
  select(!ends_with("_0"))
#> New names:
#> * lo -> lo...1
#> * hi -> hi...2
#> * lo -> lo...3
#> * hi -> hi...4

base_pred <- predict(base_mod, tibble(mpg = 18), se.fit = T, type = "response") %>%
  unlist()

# these give the same prediction but different SE
parsnip_pred
#> # A tibble: 1 x 5
#>     mpg .pred_1 .pred_lower_1 .pred_upper_1 .std_error
#>   <dbl>   <dbl>         <dbl>         <dbl>      <dbl>
#> 1    18   0.614         0.230         0.895      0.820
base_pred
#>          fit.1       se.fit.1 residual.scale 
#>      0.6140551      0.1942435      1.0000000

^{Created on 2020-06-04 by the reprex package (v0.3.0)}

toryn_stat · June 25, 2020, 2:39am

After a little bit of digging, it appears that the standard error in the parsnip prediction is on the scale of the link function (log odds). I do not see this clearly in the parsnip documentation... Anyway, the relationship between the two standard errors is based on the delta method, derivative of the transformation. For logistic regression the standard error of the prediction probability is

\hat{p}(1-\hat{p})*se = (.614)*(1-.614)*.82 = 0.19

system · July 16, 2020, 2:42am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.