Calculate rowwise and elementwise means over tibble of lists

Leon · May 10, 2018, 7:12am

A bit tricky - Input needed

Given a tibble my_dat created like so:

# Simple function for generating data
mk_dat = function(m, d){
  d %>% rnorm %>% list %>% rep(m) %>% return
}

# Dimensions of example data
n_row = 5
n_col = 3
depth = 10

# Create tibble example
my_dat = tibble(1:n_row) %>% select
for( var_name in letters[1:n_col] ){
  my_dat = my_dat %>% mutate(!!var_name := mk_dat(m = n_row, d = depth))
}

I.e. basically a tensor disguised as a tibble:

> my_dat
# A tibble: 5 x 3
  a          b          c         
  <list>     <list>     <list>    
1 <dbl [10]> <dbl [10]> <dbl [10]>
2 <dbl [10]> <dbl [10]> <dbl [10]>
3 <dbl [10]> <dbl [10]> <dbl [10]>
4 <dbl [10]> <dbl [10]> <dbl [10]>
5 <dbl [10]> <dbl [10]> <dbl [10]>

I want to create a new variable, which is the rowwise and elementwise mean, such that the first element of the first list in the new variable will be something like:

mean(c(my_dat$a[[1]][1], my_dat$b[[1]][1], my_dat$c[[1]][1]))

and the second element in the first list will be:

mean(c(my_dat$a[[1]][2], my_dat$b[[1]][2], my_dat$c[[1]][2]))

etc...

I could code something complicated, but am interested in input to a more clever solution?

Leon · May 10, 2018, 10:23am

I.e. more clever as in more tidy - Perhaps using map() somehow?

Leon · May 10, 2018, 2:31pm

Ok, so suffering from a complete lack of patience, I think I solved it, though it's not too pretty:

# Create dummy test data
test_dat = 1:150 %>% split(sort(. %% 15)) %>% unname

# Define tibble with dummy data
my_dat = tibble(a = test_dat[1:5],
                b = test_dat[6:10],
                c = test_dat[11:15])

# Do magic
my_dat_aug = my_dat %>%
  mutate(d = my_dat %>% select(a, b, c) %>% map(unlist) %>% bind_cols %>%
           rowMeans %>% split(sort(. %% nrow(my_dat))) %>% unname)

...and check if it worked:

> test_dat %>% head
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[2]]
 [1] 11 12 13 14 15 16 17 18 19 20

[[3]]
 [1] 21 22 23 24 25 26 27 28 29 30

[[4]]
 [1] 31 32 33 34 35 36 37 38 39 40

[[5]]
 [1] 41 42 43 44 45 46 47 48 49 50

[[6]]
 [1] 51 52 53 54 55 56 57 58 59 60
> my_dat
# A tibble: 5 x 3
  a          b          c         
  <list>     <list>     <list>    
1 <int [10]> <int [10]> <int [10]>
2 <int [10]> <int [10]> <int [10]>
3 <int [10]> <int [10]> <int [10]>
4 <int [10]> <int [10]> <int [10]>
5 <int [10]> <int [10]> <int [10]>
> my_dat_aug
# A tibble: 5 x 4
  a          b          c          d         
  <list>     <list>     <list>     <list>    
1 <int [10]> <int [10]> <int [10]> <dbl [10]>
2 <int [10]> <int [10]> <int [10]> <dbl [10]>
3 <int [10]> <int [10]> <int [10]> <dbl [10]>
4 <int [10]> <int [10]> <int [10]> <dbl [10]>
5 <int [10]> <int [10]> <int [10]> <dbl [10]>
> my_dat_aug %>% pull(d)
[[1]]
 [1] 51 52 53 54 55 56 57 58 59 60

[[2]]
 [1] 61 62 63 64 65 66 67 68 69 70

[[3]]
 [1] 71 72 73 74 75 76 77 78 79 80

[[4]]
 [1] 81 82 83 84 85 86 87 88 89 90

[[5]]
 [1]  91  92  93  94  95  96  97  98  99 100

mishabalyasin · May 10, 2018, 2:38pm

Does that do what you want?

library(tidyverse)
# Simple function for generating data
mk_dat = function(m, d){
  d %>% rnorm %>% list %>% rep(m) %>% return
}

# Dimensions of example data
n_row = 5
n_col = 3
depth = 10

# Create tibble example
my_dat = tibble(1:n_row) %>% select

for( var_name in letters[1:n_col] ){
  my_dat = my_dat %>% mutate(!!var_name := mk_dat(m = n_row, d = depth))
}

my_dat %>%
  dplyr::mutate(means = purrr::pmap(., function(a, b, c){
    purrr::pmap_dbl(list(a, b, c), ~mean(..1, ..2, ..3))
  }))
#> # A tibble: 5 x 4
#>   a          b          c          means     
#>   <list>     <list>     <list>     <list>    
#> 1 <dbl [10]> <dbl [10]> <dbl [10]> <dbl [10]>
#> 2 <dbl [10]> <dbl [10]> <dbl [10]> <dbl [10]>
#> 3 <dbl [10]> <dbl [10]> <dbl [10]> <dbl [10]>
#> 4 <dbl [10]> <dbl [10]> <dbl [10]> <dbl [10]>
#> 5 <dbl [10]> <dbl [10]> <dbl [10]> <dbl [10]>

Created on 2018-05-10 by the reprex package (v0.2.0).

I'm not sure if that's the best idea to do it this way. rowMeans supports more than 2 dimensions, so you should probably try to do it that way, unless you absolutely have to stay with tibbles.

Leon · May 10, 2018, 6:43pm

Unfortunately not, but I have converged on this, which will:

map_means = function(x, na_rm = TRUE){
  if( !is_tibble(x) ){ stop("tibble expected") }
  n_col = length(x[[1]][[1]])
  x %>% unnest %>% rowMeans(na.rm = na_rm) %>%
    matrix(ncol = n_col, byrow = TRUE) %>% apply(1, list) %>% map(1) %>% 
    return
}