Calculate mean over a subset of multiple specific variables, but without stating variables names

Leon · April 25, 2018, 1:16pm

So, I am interested in calculating the mean over a set of variables in a tibble, but as I have a lot of these variables, I do not want to specifically name them. The following works, but I do not like looking at it

# Dummy function for making numeric data with NAs
set_dat = function(n){
  sample(c(rnorm(n), rep(NA,n)), n, replace = TRUE)
}

# Create tibble of example data
d = tibble(smpl_id = sample(c('id1','id2','id3'), 50, replace = TRUE),
           origin  = sample(1:3, 50, replace = TRUE),
           a_group = set_dat(50),
           b_group = set_dat(50),
           c_group = set_dat(50),
           d_group = set_dat(50))

# Calculate mean of numeric variables with NAs containing 'group'
d %>%
  mutate(mean_group = d %>% select(contains('group')) %>%
                      apply(., 1, function(row){ mean(row, na.rm = TRUE) }))

Perhaps a more tidy approach exists? E.g. using rowwise() somehow? As I have been able to read, it seems that people are using various 'hacks' of rowMeans()

mishabalyasin · April 25, 2018, 3:20pm

I think, the best way is to tuck the logic into its own function and then your code will look much cleaner. But the function itself is more or less identical to what you are using except aforementioned dplyr::rowwise which does exactly what you thought it will do

library(tidyverse)
set_dat = function(n){
  sample(c(rnorm(n), rep(NA,n)), 50, replace = TRUE)
}

# Create tibble of example data
d = tibble(smpl_id = sample(c('id1','id2','id3'), 50, replace = TRUE),
           origin  = sample(1:3, 50, replace = TRUE),
           a_group = set_dat(50),
           b_group = set_dat(50),
           c_group = set_dat(50),
           d_group = set_dat(50))

# Calculate mean of numeric variables with NAs containing 'group'
# 
group_mean <- function(x, contains = "group"){
  x %>%
    select(contains(contains)) %>%
    dplyr::rowwise() %>%
    rowMeans(na.rm = TRUE) %>%
    unname()
}

res1 <- d %>%
  mutate(mean_group = group_mean(.))

res2 <- d %>%
  mutate(mean_group = d %>% select(contains('group')) %>%
           apply(., 1, function(row){ mean(row, na.rm = TRUE) }))

identical(res1, res2)
#> [1] TRUE

Created on 2018-04-25 by the reprex package (v0.2.0).

unname is needed because rowMeans adds names to the vector mean_group and then identical returns FALSE since, well, it is not identical

prosoitos · April 25, 2018, 4:40pm

Another approach (no better, just different. Maybe a little faster):

library(tidyverse)

mean_group <- d %>%
  select(contains("group")) %>%
  rowMeans(na.rm = TRUE)

bind_cols(d, mean_group = mean_group)

alistaire · April 25, 2018, 5:20pm

You don't need rowwise, since rowMeans will evaluate rowwise anyway:

library(tidyverse)
set.seed(47)

set_dat = function(n){
  sample(c(rnorm(n), rep(NA, n)), 50, replace = TRUE)
}

d = tibble(smpl_id = sample(c('id1','id2','id3'), 50, replace = TRUE),
           origin  = sample(1:3, 50, replace = TRUE),
           a_group = set_dat(50),
           b_group = set_dat(50),
           c_group = set_dat(50),
           d_group = set_dat(50))

d %>% mutate(group_mean = rowMeans(select(., contains('group')), na.rm = TRUE))
#> # A tibble: 50 x 7
#>    smpl_id origin a_group  b_group c_group d_group group_mean
#>    <chr>    <int>   <dbl>    <dbl>   <dbl>   <dbl>      <dbl>
#>  1 id3          1   0.926  NA       -0.143   2.44      1.08  
#>  2 id2          3  NA       1.10    NA      NA         1.10  
#>  3 id3          1  -0.413  -0.371   NA      NA        -0.392 
#>  4 id3          1  NA       0.0349  NA      NA         0.0349
#>  5 id2          1  -0.833  NA       NA      NA        -0.833 
#>  6 id3          2  NA      NA       NA       0.271     0.271 
#>  7 id2          1  NA      -0.419   NA      -0.329    -0.374 
#>  8 id2          1  NA      -0.263   -1.51   -0.164    -0.646 
#>  9 id2          2  NA      NA       NA       0.436     0.436 
#> 10 id3          2  NA      NA       NA       1.41      1.41  
#> # ... with 40 more rows

It—like apply—will also coerce to a matrix, which is ok in the sense that you're unlikely to take the mean of multiple types, but still possibly expensive.

There are ways to avoid coercion, though each has its flaws:

Reshape to a tidier long format.

d %>% 
    rowid_to_column('i') %>% 
    gather(group, value, contains('group')) %>% 
    group_by(i) %>% 
    mutate(group_mean = mean(value, na.rm = TRUE)) %>% 
    spread(group, value)
#> # A tibble: 50 x 8
#> # Groups:   i [50]
#>        i smpl_id origin group_mean a_group  b_group c_group d_group
#>    <int> <chr>    <int>      <dbl>   <dbl>    <dbl>   <dbl>   <dbl>
#>  1     1 id3          1     1.08     0.926  NA       -0.143   2.44 
#>  2     2 id2          3     1.10    NA       1.10    NA      NA    
#>  3     3 id3          1    -0.392   -0.413  -0.371   NA      NA    
#>  4     4 id3          1     0.0349  NA       0.0349  NA      NA    
#>  5     5 id2          1    -0.833   -0.833  NA       NA      NA    
#>  6     6 id3          2     0.271   NA      NA       NA       0.271
#>  7     7 id2          1    -0.374   NA      -0.419   NA      -0.329
#>  8     8 id2          1    -0.646   NA      -0.263   -1.51   -0.164
#>  9     9 id2          2     0.436   NA      NA       NA       0.436
#> 10    10 id3          2     1.41    NA      NA       NA       1.41 
#> # ... with 40 more rows

This is not necessarily any faster, though this data does arguably belong in long form.

Use purrr::pmap to iterate over rows.

d %>% mutate(group_mean = pmap_dbl(select(., contains('group')), 
                                   ~mean(c(...), na.rm = TRUE)))
#> # A tibble: 50 x 7
#>    smpl_id origin a_group  b_group c_group d_group group_mean
#>    <chr>    <int>   <dbl>    <dbl>   <dbl>   <dbl>      <dbl>
#>  1 id3          1   0.926  NA       -0.143   2.44      1.08  
#>  2 id2          3  NA       1.10    NA      NA         1.10  
#>  3 id3          1  -0.413  -0.371   NA      NA        -0.392 
#>  4 id3          1  NA       0.0349  NA      NA         0.0349
#>  5 id2          1  -0.833  NA       NA      NA        -0.833 
#>  6 id3          2  NA      NA       NA       0.271     0.271 
#>  7 id2          1  NA      -0.419   NA      -0.329    -0.374 
#>  8 id2          1  NA      -0.263   -1.51   -0.164    -0.646 
#>  9 id2          2  NA      NA       NA       0.436     0.436 
#> 10 id3          2  NA      NA       NA       1.41      1.41  
#> # ... with 40 more rows

This avoids coercion, but at the cost of vectorization, so this won't scale well.

Use purrr::reduce to add the variables as vectors.

d %>% mutate(group_mean = reduce(select(., contains('group')), 
                                 ~.x + coalesce(.y, 0), 
                                 .init = 0) / 
                          reduce(select(., contains('group')), 
                                 ~.x + !is.na(.y), 
                                 .init = 0))
#> # A tibble: 50 x 7
#>    smpl_id origin a_group  b_group c_group d_group group_mean
#>    <chr>    <int>   <dbl>    <dbl>   <dbl>   <dbl>      <dbl>
#>  1 id3          1   0.926  NA       -0.143   2.44      1.08  
#>  2 id2          3  NA       1.10    NA      NA         1.10  
#>  3 id3          1  -0.413  -0.371   NA      NA        -0.392 
#>  4 id3          1  NA       0.0349  NA      NA         0.0349
#>  5 id2          1  -0.833  NA       NA      NA        -0.833 
#>  6 id3          2  NA      NA       NA       0.271     0.271 
#>  7 id2          1  NA      -0.419   NA      -0.329    -0.374 
#>  8 id2          1  NA      -0.263   -1.51   -0.164    -0.646 
#>  9 id2          2  NA      NA       NA       0.436     0.436 
#> 10 id3          2  NA      NA       NA       1.41      1.41  
#> # ... with 40 more rows

This approach should be pretty efficient, as the math is vectorized and it only iterates over columns. It requires more math and programming, though, making it easier to screw up. It could be refactored to only iterate once, though the function would need to be more complicated.

Leon · April 25, 2018, 7:16pm

I think this is the cleanest of the answers provided - Awesome! Thanks for input!

Although I am still not completely satisfied with having to use d inside mutate()

Leon · April 26, 2018, 7:36am

and of course

rowMeans(X, na.rm = TRUE)

is the same as

apply(X, 1, function(x_i){ mean(x_i, na.rm = TRUE) })

Which basically is the difference between my suggestion and @alistaire's suggestion