Computing confidence intervals with dplyr

Andrea · May 28, 2019, 5:25pm

Hi, all, long time no see!

A quick question: when you analyze a new data frame, and you want to summarise it including t-statistics confidence intervals for numeric variables, what functions/packages do you use? My old workflow, largely copied from

is:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

test <- data.frame(
           n = c(298, 298, 298, 298, 298, 298, 298, 298, 298, 298, 1, 1, 1, 1,
                 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3),
         run = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
                 1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
         mAP = c(0.8112, 0.8006, 0.8076, 0.7999, 0.8067, 0.8046, 0.8004, 0.799,
                 0.8052, 0.8002, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.8333, 0.8333,
                 0.8333, 1, 0.8333, 1, 1, 0.8333, 1, 1)
)

lower_ci <- function(mean, se, n, conf_level = 0.95){
  lower_ci <- mean - qt(1 - ((1 - conf_level) / 2), n - 1) * se
}
upper_ci <- function(mean, se, n, conf_level = 0.95){
  upper_ci <- mean + qt(1 - ((1 - conf_level) / 2), n - 1) * se
}

foobar <- test %>%
 group_by(n) %>%
  summarise(smean = mean(mAP, na.rm = TRUE),
            ssd = sd(mAP, na.rm = TRUE)) %>%
  mutate(se = ssd / sqrt(n),
         lower_ci = lower_ci(smean, se, n),
         upper_ci = upper_ci(smean, se, n))
#> Warning in qt(1 - ((1 - conf_level)/2), n - 1): Si è prodotto un NaN

#> Warning in qt(1 - ((1 - conf_level)/2), n - 1): Si è prodotto un NaN

^{Created on 2019-05-28 by the reprex package (v0.3.0)}

but I've been away from R for a while, and I may have missed/forgotten better ways to do it. What do you think? Would you modify/improve something? Would you use a completely different approach altogether? Let me know!

joels · May 29, 2019, 4:19am

In your example, n is a group identifier, but then you also use it as the number of observations. It seems like the code for your example should use the actual number of observations in each group as the "n" argument in the CI functions, rather than the n column in the data frame. For example:

foobar <- test %>%
  group_by(n) %>%
  summarise(smean = mean(mAP, na.rm = TRUE),
            ssd = sd(mAP, na.rm = TRUE),
            count = n()) %>%
  mutate(se = ssd / sqrt(count),
         lower_ci = lower_ci(smean, se, count),
         upper_ci = upper_ci(smean, se, count))

foobar

      n smean     ssd count      se lower_ci upper_ci
1     1 1     0          10 0          1        1    
2     3 0.917 0.0879     10 0.0278     0.854    0.980
3   298 0.804 0.00412    10 0.00130    0.801    0.806

If you want to use a function in a pre-existing package, you could use mean_cl_normal from ggplot2 (mean_cl_normal is wrapper around Hmisc::smean.cl.normal()) as a quick way to achieve something similar. Other than returning the upper and lower confidence limits with a single function call, Hmisc::smean.cl.normal is using the same method to calculate the confidence limits.

mean_cl_normal uses y, ymin, and ymax as the names for the mean and confidence limits, respectively, so I've also renamed them. Use mean_cl_boot instead of mean_cl_normal if you want bootstrapped confidence intervals instead of confidence limits that assume normality.

library(tidyverse) 

test %>% 
  group_by(n) %>% 
  summarise(ci = list(mean_cl_normal(mAP) %>% 
                        rename(mean=y, lwr=ymin, upr=ymax))) %>% 
  unnest

      n  mean   lwr   upr
1     1 1     1     1    
2     3 0.917 0.854 0.980
3   298 0.804 0.801 0.806

You could also use the Hmisc functions directly. Hmisc::smean.cl.normal returns a named vector, rather than a data frame, so the code to get wide output is slightly different:

test %>% 
  group_by(n) %>% 
  summarise(ci = list(enframe(Hmisc::smean.cl.normal(mAP)))) %>% 
  unnest %>% 
  spread(name, value)

      n Lower  Mean Upper
1     1 1     1     1    
2     3 0.854 0.917 0.980
3   298 0.801 0.804 0.806

Andrea · May 29, 2019, 7:43am

n your example, n is a group identifier, but then you also use it as the number of observations.

Yep! Buggity bug I found out later, but I was too tired to get online again and fix it. I would have done it today. Thanks for catching it!

If you want to use a function in a pre-existing package, you could use mean_cl_normal from ggplot2 ( mean_cl_normal is wrapper around Hmisc::smean.cl.normal() )

Nice! But your code is a bit cryptic for my limited pipe expansion skills

 summarise(ci = list(mean_cl_normal(mAP) %>% 
                        rename(mean=y, lwr=ymin, upr=ymax))) %>% 
  unnest

If I understand correctly, you are:

callling mean_cl_normal(mAP), which returns a list and not a vector
you rename the components of the list with rename
now you have a nested dataframe, where a column is a column of lists. Thus you use unnest to create 3 columns from the components of those columns. Right?

Andrea · May 29, 2019, 8:19am

PS I wrote a generic function using your code, in case other people may find it useful. It's untested!

library(dplyr)
library(ggplot2)

ci <- function(dataframe, summary_var, group_var = NULL){
  summary_var <- enquo(summary_var)
  
  if (!rlang::quo_is_null(group_var)) {
    group_var    <- enquo(group_var)
    dataframe <- dataframe %>% group_by(!! group_var)
  }  
  
  summary_df <- dataframe %>%
    summarise(ci = list(mean_cl_normal(!! summary_var) %>% 
                        rename(mean = y, lower_ci = ymin, upper_ci = ymax))) %>% 
  tidyr::unnest
  }

joels · May 29, 2019, 2:15pm

Just one change to your summary:

mean_cl_normal returns a data frame, but it needs to be wrapped (nested) in a list so that summarise returns a single summary value (the list element).

Andrea · June 5, 2019, 2:15pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.