Successively combining data across groups (need successive pooled standard deviations)

jefriedel · May 30, 2020, 6:19pm

I am trying to calculate pooled standard deviations across successive groups. I can get the means and standard deviations per group with summarize. I can also get simple differences between successive groups on the summary table with a lag.

library(tidyverse)

data = tibble(
  groups = c(1,1,2,2,3,3),
  vals = (c(1,2,10,20,100,200))
) %>%
  group_by(groups)

summary = data %>% 
  summarize(mean = mean(vals),
            sd = sd(vals)) %>%
  mutate(mean_diff = mean-lag(mean))

Ideally, the solution will get the successive pooled standard deviations to the summary table included above.The pooled standard deviations I would need (from the example data) are:

#Group 1 and 2 pooled
sd(data %>% filter(between(groups,1,2)) %>% pull(vals))

#Group 2 and 3 pooled
sd(data %>% filter(between(groups,2,3)) %>% pull(vals))

It seems like the data should be nested in some manner and the standard deviation could be obtained on the nested data. However, the data for my middle groups (in this case group 2) need to be nested more than once because these groups go into more than one pooled standard deviation. This type of multiple nesting does not seem possible.

Any suggestions are greatly appreciated.

nirgrahamuk · May 30, 2020, 7:11pm

you seem to ask how to calculate something, then you provide code that calculates it. Then your question becomes how to put it in a single table...
you could simply do the latter like this

#Group 1 and 2 pooled
g12_pooled_sd <- sd(data %>% filter(between(groups,1,2)) %>% pull(vals))

#Group 2 and 3 pooled
g23_pooled_sd <- sd(data %>% filter(between(groups,2,3)) %>% pull(vals))

summary$pooled_sd <- c(NA,g12_pooled_sd,g23_pooled_sd)

summary
 A tibble: 3 x 5
  groups  mean     sd mean_diff pooled_sd
   <dbl> <dbl>  <dbl>     <dbl>     <dbl>
1      1   1.5  0.707      NA       NA   
2      2  15    7.07       13.5      8.81
3      3 150   70.7       135       88.1

If you can explain what I might have missed, I can give it another look for you.

jefriedel · May 30, 2020, 7:21pm

Your point is valid, sorry for the lack of clarity on my part. I kept my example data to three groups, just enough to illustrate my question. The solution I provided on calculating the pooled SDs was just to show what the end result would be for those two specific pairwise groups. My real data has a few thousand rows, several dozen groups, and the number of rows per group is variable, so it is impractical to hand calculate every pairwise pooled standard deviation.

nirgrahamuk · May 30, 2020, 7:42pm

what can be done once can be repeated programatically with the right tools...

library(tidyverse)

(data = tibble(
  groups = c(1,1,2,2,3,3),
  vals = (c(1,2,10,20,100,200))
) %>%
  group_by(groups))

(summary = data %>% 
  summarize(mean = mean(vals),
            sd = sd(vals)) %>%
  mutate(mean_diff = mean-lag(mean)))

total_groups_to_do <- unique(data$groups) %>% sort
lead_total_groups_to_do <- lead(total_groups_to_do)

gtodo <- tibble(g1=total_groups_to_do,
                             g2=lead_total_groups_to_do) %>% na.omit()

library(purrr)
pooled_sd_results <- map2_dbl(.x=gtodo$g1,
                          .y=gtodo$g2,
                          .f = ~ sd(data %>% 
                                      filter(between(groups,.x,.y)) %>% 
                                      pull(vals)))


summary$pooled_sd <- c(NA,pooled_sd_results)

summary

jefriedel · May 30, 2020, 8:42pm

Ahh, perfect! I got too wrapped up trying to solve it with a dplyr based solution. I appreciate the help.

system · June 6, 2020, 8:42pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.