Calculate sd across a subset of columns based on a regex

Leon · January 22, 2019, 8:12pm

Okay, so I'm looking for a tidy solution to the following:

Given a tibble d:

library('tidyverse')
set.seed(733744)
n = 10
d = tibble(x1 = rnorm(n), x2 = rnorm(n),
           y1 = rnorm(n), y2 = rnorm(n),
           z1 = rnorm(n), z2 = rnorm(n))

I.e.

> d
# A tibble: 10 x 6
       x1     x2       y1      y2     z1       z2
    <dbl>  <dbl>    <dbl>   <dbl>  <dbl>    <dbl>
 1  1.40  -1.59   0.0458  -0.558   0.484  0.794  
 2  1.24   0.124 -0.0210  -1.57   -0.234 -2.30   
 3 -0.234 -1.93   0.804    0.845  -1.90   0.00116
 4  0.549  1.12  -0.221   -0.421   0.169  1.11   
 5  0.633 -0.140  0.00652 -0.200   0.202  1.12   
 6 -0.257  0.963 -1.86    -0.208   0.237  0.544  
 7  0.283 -0.152  1.47     0.423   0.747  0.518  
 8  2.31  -1.46  -0.908    0.603  -0.506  0.850  
 9 -0.616  0.165  0.651   -0.0481 -1.05  -0.619  
10  0.559 -1.18   0.878   -1.19   -1.91   1.02

I want to calculate the row-wise sd() across columns x and y, but NOT z. The following works, but is not tidy IMHO:

# Equivalent to rowMeans, but for sd
rowSds = function(x){ return(apply(x, 1, sd)) }

# Calculate sd across columns, where the column name contain x or y
d %>% mutate(sd_xy = d %>% select(matches("x|y")) %>% rowSds)

So I am looking for something like this, but without having to write out the variable names:

d %>% rowwise %>% mutate(sd_xy = c(x1, x2, y1, y2) %>% sd) %>% ungroup

There must be some NSE way to do this elegantly?

lemairev · January 22, 2019, 9:33pm

I do believe there is a better way than that...

d %>% 
 select(matches("x|y")) %>% 
 rowwise %>% 
 do(data.frame(., sd_xy = sd(unlist(.))))

It appears easier to use apply here (at least for me)!

Leon · January 23, 2019, 7:41am

Thanks for input, but still not really that tidy

lemairev · January 23, 2019, 1:59pm

too bad! I'm probably not enough familiar with tidy meaning

last attempt based on that

d %>% mutate(sd_xy = pmap_dbl(select(., matches("x|y")), lift_vd(sd)))

thank you for the question, I'm discovering some interesting stuff

tbradley · January 23, 2019, 2:33pm

So here is a method that I would consider "tidy" in terms of it being all of tidyverse functions, but it is certainly more verbose than your current solution. However, if you define it in a function as shown then it is not too bad, IMO:

library('tidyverse')
set.seed(733744)
n = 10
d = tibble(x1 = rnorm(n), x2 = rnorm(n),
           y1 = rnorm(n), y2 = rnorm(n),
           z1 = rnorm(n), z2 = rnorm(n))


rowwise_sd <- function(data, ...){
  my_cols <- enquos(...)
  data %>% 
    group_by(row = row_number()) %>% 
    nest() %>% 
    mutate(row_sd = map(data, ~{
      .x %>% 
        dplyr::select(!!!my_cols) %>% 
        gather(key = col, value = value) %>% 
        summarize(sd = sd(value))
    })) %>% 
    unnest() %>% 
    select(-row)
}


d %>% rowwise_sd(x1, x2, y1, y2)

#> # A tibble: 10 x 7
#>        x1     x2       y1      y2     z1       z2    sd
#>     <dbl>  <dbl>    <dbl>   <dbl>  <dbl>    <dbl> <dbl>
#>  1  1.40  -1.59   0.0458  -0.558   0.484  0.794   1.25 
#>  2  1.24   0.124 -0.0210  -1.57   -0.234 -2.30    1.16 
#>  3 -0.234 -1.93   0.804    0.845  -1.90   0.00116 1.30 
#>  4  0.549  1.12  -0.221   -0.421   0.169  1.11    0.710
#>  5  0.633 -0.140  0.00652 -0.200   0.202  1.12    0.382
#>  6 -0.257  0.963 -1.86    -0.208   0.237  0.544   1.16 
#>  7  0.283 -0.152  1.47     0.423   0.747  0.518   0.688
#>  8  2.31  -1.46  -0.908    0.603  -0.506  0.850   1.69 
#>  9 -0.616  0.165  0.651   -0.0481 -1.05  -0.619   0.525
#> 10  0.559 -1.18   0.878   -1.19   -1.91   1.02    1.11

d %>% rowwise_sd(matches("x|y"))
#> # A tibble: 10 x 7
#>        x1     x2       y1      y2     z1       z2    sd
#>     <dbl>  <dbl>    <dbl>   <dbl>  <dbl>    <dbl> <dbl>
#>  1  1.40  -1.59   0.0458  -0.558   0.484  0.794   1.25 
#>  2  1.24   0.124 -0.0210  -1.57   -0.234 -2.30    1.16 
#>  3 -0.234 -1.93   0.804    0.845  -1.90   0.00116 1.30 
#>  4  0.549  1.12  -0.221   -0.421   0.169  1.11    0.710
#>  5  0.633 -0.140  0.00652 -0.200   0.202  1.12    0.382
#>  6 -0.257  0.963 -1.86    -0.208   0.237  0.544   1.16 
#>  7  0.283 -0.152  1.47     0.423   0.747  0.518   0.688
#>  8  2.31  -1.46  -0.908    0.603  -0.506  0.850   1.69 
#>  9 -0.616  0.165  0.651   -0.0481 -1.05  -0.619   0.525
#> 10  0.559 -1.18   0.878   -1.19   -1.91   1.02    1.11

Created on 2019-01-23 by the reprex package (v0.2.0).

Leon · January 23, 2019, 6:40pm

This... This is why this is an awesome forum! Thanks for the discussion @lemairev and @tbradley

I'm going with @lemairev's suggestion as a solution, which I definitely think is tidy - Excellent!

Leon · January 30, 2019, 6:40pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.