Use dplyr to do grouped t-tests and get number of observations simultanously

Leon · February 11, 2019, 1:56pm

This works, but it seems a bit cumbersome having to create a temporary tibble with the number of observations and then join that with the test estimates. I wonder, is there a better way?

Given the data d:

library('tidyverse')
library('broom')
set.seed(354654)
d = tibble(value = rnorm(100),
       category = sample(1:5, replace = TRUE, 100),
       group = sample(c('A', 'B'), replace = TRUE, 100)) %>% 
  arrange(category)

I.e.

> d
# A tibble: 100 x 3
     value category group
     <dbl>    <int> <chr>
 1  0.596         1 B    
 2  0.0992        1 B    
 3 -1.17          1 B    
 4 -0.341         1 B    
 5  0.222         1 A    
 6  0.479         1 B    
 7 -0.155         1 A    
 8  0.921         1 B    
 9  0.795         1 B    
10  0.215         1 B    
# … with 90 more rows

I want to perform 5 t.test calls, one for each category, comparing group and get the number of observations in each group, this I can do like so:

est = d %>% group_by(category) %>% do(tidy(t.test(value ~ group, data = .)))
ns = d %>% count(category, group) %>% spread(group, n)
est %>% full_join(ns, by = 'category')
# A tibble: 5 x 13
# Groups:   category [?]
  category estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method                  alternative     A     B
     <int>    <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl> <chr>                   <chr>       <int> <int>
1        1    0.296    0.290   -0.00634     0.889   0.385     18.8    -0.402     0.994 Welch Two Sample t-test two.sided       9    13
2        2   -0.698   -0.668    0.0299     -1.18    0.298      4.23   -2.30      0.903 Welch Two Sample t-test two.sided       5     7
3        3    0.359    0.388    0.0292      0.801   0.435     15.6    -0.592     1.31  Welch Two Sample t-test two.sided      14    10
4        4    0.387    0.0910  -0.296       0.791   0.442     13.9    -0.664     1.44  Welch Two Sample t-test two.sided       8    13
5        5    0.271    0.232   -0.0388      0.713   0.485     18.5    -0.526     1.07  Welch Two Sample t-test two.sided       7    14

But I'd prefer not to have to create a temporary tibble, which I then join?

tbradley · February 11, 2019, 5:29pm

You can use list-columns via group_by + nest to do it like this:

library('tidyverse')
library('broom')
set.seed(354654)
d = tibble(value = rnorm(100),
           category = sample(1:5, replace = TRUE, 100),
           group = sample(c('A', 'B'), replace = TRUE, 100)) %>% 
  arrange(category)

d %>% 
  group_by(category, group) %>% 
  nest() %>% 
  spread(key = group, value = data) %>% 
  mutate(
    t_test = map2(A, B, ~{t.test(.x$value, .y$value) %>% tidy()}),
    A = map(A, nrow),
    B = map(B, nrow)
  ) %>% 
  unnest()
#> # A tibble: 5 x 13
#>   category     A     B estimate estimate1 estimate2 statistic p.value
#>      <int> <int> <int>    <dbl>     <dbl>     <dbl>     <dbl>   <dbl>
#> 1        1     9    13    0.296    0.290   -0.00634     0.889   0.385
#> 2        2     5     7   -0.698   -0.668    0.0299     -1.18    0.298
#> 3        3    14    10    0.359    0.388    0.0292      0.801   0.435
#> 4        4     8    13    0.387    0.0910  -0.296       0.791   0.442
#> 5        5     7    14    0.271    0.232   -0.0388      0.713   0.485
#> # ... with 5 more variables: parameter <dbl>, conf.low <dbl>,
#> #   conf.high <dbl>, method <chr>, alternative <chr>

Created on 2019-02-11 by the reprex package (v0.2.0).

Leon · February 11, 2019, 6:08pm

Excellent @tbradley - Standing Applause

Leon · February 18, 2019, 6:08pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.