Calculate mean over a subset of multiple specific variables, but without stating variables names

dplyr
tidyverse

#1

So, I am interested in calculating the mean over a set of variables in a tibble, but as I have a lot of these variables, I do not want to specifically name them. The following works, but I do not like looking at it :wink:

# Dummy function for making numeric data with NAs
set_dat = function(n){
  sample(c(rnorm(n), rep(NA,n)), n, replace = TRUE)
}

# Create tibble of example data
d = tibble(smpl_id = sample(c('id1','id2','id3'), 50, replace = TRUE),
           origin  = sample(1:3, 50, replace = TRUE),
           a_group = set_dat(50),
           b_group = set_dat(50),
           c_group = set_dat(50),
           d_group = set_dat(50))

# Calculate mean of numeric variables with NAs containing 'group'
d %>%
  mutate(mean_group = d %>% select(contains('group')) %>%
                      apply(., 1, function(row){ mean(row, na.rm = TRUE) }))

Perhaps a more tidy approach exists? E.g. using rowwise() somehow? As I have been able to read, it seems that people are using various 'hacks' of rowMeans()


#2

I think, the best way is to tuck the logic into its own function and then your code will look much cleaner. But the function itself is more or less identical to what you are using except aforementioned dplyr::rowwise which does exactly what you thought it will do :slight_smile:

library(tidyverse)
set_dat = function(n){
  sample(c(rnorm(n), rep(NA,n)), 50, replace = TRUE)
}

# Create tibble of example data
d = tibble(smpl_id = sample(c('id1','id2','id3'), 50, replace = TRUE),
           origin  = sample(1:3, 50, replace = TRUE),
           a_group = set_dat(50),
           b_group = set_dat(50),
           c_group = set_dat(50),
           d_group = set_dat(50))

# Calculate mean of numeric variables with NAs containing 'group'
# 
group_mean <- function(x, contains = "group"){
  x %>%
    select(contains(contains)) %>%
    dplyr::rowwise() %>%
    rowMeans(na.rm = TRUE) %>%
    unname()
}

res1 <- d %>%
  mutate(mean_group = group_mean(.))

res2 <- d %>%
  mutate(mean_group = d %>% select(contains('group')) %>%
           apply(., 1, function(row){ mean(row, na.rm = TRUE) }))

identical(res1, res2)
#> [1] TRUE

Created on 2018-04-25 by the reprex package (v0.2.0).

unname is needed because rowMeans adds names to the vector mean_group and then identical returns FALSE since, well, it is not identical :slight_smile:


#3

Another approach (no better, just different. Maybe a little faster):

library(tidyverse)

mean_group <- d %>%
  select(contains("group")) %>%
  rowMeans(na.rm = TRUE)

bind_cols(d, mean_group = mean_group)

#4

You don't need rowwise, since rowMeans will evaluate rowwise anyway:

library(tidyverse)
set.seed(47)

set_dat = function(n){
  sample(c(rnorm(n), rep(NA, n)), 50, replace = TRUE)
}

d = tibble(smpl_id = sample(c('id1','id2','id3'), 50, replace = TRUE),
           origin  = sample(1:3, 50, replace = TRUE),
           a_group = set_dat(50),
           b_group = set_dat(50),
           c_group = set_dat(50),
           d_group = set_dat(50))

d %>% mutate(group_mean = rowMeans(select(., contains('group')), na.rm = TRUE))
#> # A tibble: 50 x 7
#>    smpl_id origin a_group  b_group c_group d_group group_mean
#>    <chr>    <int>   <dbl>    <dbl>   <dbl>   <dbl>      <dbl>
#>  1 id3          1   0.926  NA       -0.143   2.44      1.08  
#>  2 id2          3  NA       1.10    NA      NA         1.10  
#>  3 id3          1  -0.413  -0.371   NA      NA        -0.392 
#>  4 id3          1  NA       0.0349  NA      NA         0.0349
#>  5 id2          1  -0.833  NA       NA      NA        -0.833 
#>  6 id3          2  NA      NA       NA       0.271     0.271 
#>  7 id2          1  NA      -0.419   NA      -0.329    -0.374 
#>  8 id2          1  NA      -0.263   -1.51   -0.164    -0.646 
#>  9 id2          2  NA      NA       NA       0.436     0.436 
#> 10 id3          2  NA      NA       NA       1.41      1.41  
#> # ... with 40 more rows

It—like apply—will also coerce to a matrix, which is ok in the sense that you're unlikely to take the mean of multiple types, but still possibly expensive.

There are ways to avoid coercion, though each has its flaws:

  1. Reshape to a tidier long format.
d %>% 
    rowid_to_column('i') %>% 
    gather(group, value, contains('group')) %>% 
    group_by(i) %>% 
    mutate(group_mean = mean(value, na.rm = TRUE)) %>% 
    spread(group, value)
#> # A tibble: 50 x 8
#> # Groups:   i [50]
#>        i smpl_id origin group_mean a_group  b_group c_group d_group
#>    <int> <chr>    <int>      <dbl>   <dbl>    <dbl>   <dbl>   <dbl>
#>  1     1 id3          1     1.08     0.926  NA       -0.143   2.44 
#>  2     2 id2          3     1.10    NA       1.10    NA      NA    
#>  3     3 id3          1    -0.392   -0.413  -0.371   NA      NA    
#>  4     4 id3          1     0.0349  NA       0.0349  NA      NA    
#>  5     5 id2          1    -0.833   -0.833  NA       NA      NA    
#>  6     6 id3          2     0.271   NA      NA       NA       0.271
#>  7     7 id2          1    -0.374   NA      -0.419   NA      -0.329
#>  8     8 id2          1    -0.646   NA      -0.263   -1.51   -0.164
#>  9     9 id2          2     0.436   NA      NA       NA       0.436
#> 10    10 id3          2     1.41    NA      NA       NA       1.41 
#> # ... with 40 more rows

This is not necessarily any faster, though this data does arguably belong in long form.

  1. Use purrr::pmap to iterate over rows.
d %>% mutate(group_mean = pmap_dbl(select(., contains('group')), 
                                   ~mean(c(...), na.rm = TRUE)))
#> # A tibble: 50 x 7
#>    smpl_id origin a_group  b_group c_group d_group group_mean
#>    <chr>    <int>   <dbl>    <dbl>   <dbl>   <dbl>      <dbl>
#>  1 id3          1   0.926  NA       -0.143   2.44      1.08  
#>  2 id2          3  NA       1.10    NA      NA         1.10  
#>  3 id3          1  -0.413  -0.371   NA      NA        -0.392 
#>  4 id3          1  NA       0.0349  NA      NA         0.0349
#>  5 id2          1  -0.833  NA       NA      NA        -0.833 
#>  6 id3          2  NA      NA       NA       0.271     0.271 
#>  7 id2          1  NA      -0.419   NA      -0.329    -0.374 
#>  8 id2          1  NA      -0.263   -1.51   -0.164    -0.646 
#>  9 id2          2  NA      NA       NA       0.436     0.436 
#> 10 id3          2  NA      NA       NA       1.41      1.41  
#> # ... with 40 more rows

This avoids coercion, but at the cost of vectorization, so this won't scale well.

  1. Use purrr::reduce to add the variables as vectors.
d %>% mutate(group_mean = reduce(select(., contains('group')), 
                                 ~.x + coalesce(.y, 0), 
                                 .init = 0) / 
                          reduce(select(., contains('group')), 
                                 ~.x + !is.na(.y), 
                                 .init = 0))
#> # A tibble: 50 x 7
#>    smpl_id origin a_group  b_group c_group d_group group_mean
#>    <chr>    <int>   <dbl>    <dbl>   <dbl>   <dbl>      <dbl>
#>  1 id3          1   0.926  NA       -0.143   2.44      1.08  
#>  2 id2          3  NA       1.10    NA      NA         1.10  
#>  3 id3          1  -0.413  -0.371   NA      NA        -0.392 
#>  4 id3          1  NA       0.0349  NA      NA         0.0349
#>  5 id2          1  -0.833  NA       NA      NA        -0.833 
#>  6 id3          2  NA      NA       NA       0.271     0.271 
#>  7 id2          1  NA      -0.419   NA      -0.329    -0.374 
#>  8 id2          1  NA      -0.263   -1.51   -0.164    -0.646 
#>  9 id2          2  NA      NA       NA       0.436     0.436 
#> 10 id3          2  NA      NA       NA       1.41      1.41  
#> # ... with 40 more rows

This approach should be pretty efficient, as the math is vectorized and it only iterates over columns. It requires more math and programming, though, making it easier to screw up. It could be refactored to only iterate once, though the function would need to be more complicated.


#5

I think this is the cleanest of the answers provided - Awesome! Thanks for input! :+1:

Although I am still not completely satisfied with having to use d inside mutate() :wink:


#6

and of course

rowMeans(X, na.rm = TRUE)

is the same as

apply(X, 1, function(x_i){ mean(x_i, na.rm = TRUE) })

Which basically is the difference between my suggestion and @alistaire's suggestion :+1: