Dplyr group_by mean not working and reprex

ab604 · August 30, 2018, 10:58am

I'm having trouble using dplyr group_by to do a grouped mean and not sure what is going on.

When I create this reprex to perform a grouped mean it works exactly as usual and as I want it to:

library(tidyverse)
set.seed(2095)
df <- tibble(prot_id = LETTERS[1:10],
             S1 = rnorm(10,30,1),
             S2 = rnorm(10,30,1),
             S3 = rnorm(10,30,1))

df
#> # A tibble: 10 x 4
#>    prot_id    S1    S2    S3
#>    <chr>   <dbl> <dbl> <dbl>
#>  1 A        29.5  28.9  30.6
#>  2 B        29.5  32.4  30.3
#>  3 C        31.4  30.5  30.3
#>  4 D        30.3  28.4  30.9
#>  5 E        29.6  29.7  29.1
#>  6 F        30.5  30.9  29.0
#>  7 G        28.8  31.7  29.5
#>  8 H        28.6  30.3  30.8
#>  9 I        28.8  30.0  30.1
#> 10 J        30.3  29.5  29.0

df %>% group_by(prot_id) %>%
  mutate(mean_s = mean(c(S1,S2,S3)))
#> # A tibble: 10 x 5
#> # Groups:   prot_id [10]
#>    prot_id    S1    S2    S3 mean_s
#>    <chr>   <dbl> <dbl> <dbl>  <dbl>
#>  1 A        29.5  28.9  30.6   29.7
#>  2 B        29.5  32.4  30.3   30.7
#>  3 C        31.4  30.5  30.3   30.8
#>  4 D        30.3  28.4  30.9   29.9
#>  5 E        29.6  29.7  29.1   29.5
#>  6 F        30.5  30.9  29.0   30.2
#>  7 G        28.8  31.7  29.5   30.0
#>  8 H        28.6  30.3  30.8   29.9
#>  9 I        28.8  30.0  30.1   29.6
#> 10 J        30.3  29.5  29.0   29.6

mean(c(df$S1[1],df$S2[1],df$S3[1]))
#> [1] 29.66834
mean(c(df$S1[2],df$S2[2],df$S3[2]))
#> [1] 30.72519

Created on 2018-08-30 by the reprex package (v0.2.0).

However, when I run exactly the same code in RStudio I get this:


> df %>% group_by(prot_id) %>%
+   mutate(mean_s = mean(c(S1,S2,S3)))
# A tibble: 10 x 5
# Groups:   prot_id [10]
   prot_id    S1    S2    S3 mean_s
   <chr>   <dbl> <dbl> <dbl>  <dbl>
 1 A        29.5  28.9  30.6   30.0
 2 B        29.5  32.4  30.3   30.0
 3 C        31.4  30.5  30.3   30.0
 4 D        30.3  28.4  30.9   30.0
 5 E        29.6  29.7  29.1   30.0
 6 F        30.5  30.9  29.0   30.0
 7 G        28.8  31.7  29.5   30.0
 8 H        28.6  30.3  30.8   30.0
 9 I        28.8  30.0  30.1   30.0
10 J        30.3  29.5  29.0   30.0

I've done this many times before, but today it's not working. Does anyone have any ideas what I'm doing wrong?

jdlong · August 30, 2018, 11:32am

hmmm... this is odd. it looks almost like a rounding issue, but I would expect that to impact ALL numbers, not just the mean_s column...

Can you try this and see if you get the same thing:

df %>% group_by(prot_id) %>%
  mutate(mean_s = mean(c(S1,S2,S3))) ->
  out_df

as.data.frame( out_df )

FWIW, I don't get the rounding on rstudio.cloud.

ab604 · August 30, 2018, 11:57am

Thanks James. This is what I get:

> df %>% group_by(prot_id) %>%
+   mutate(mean_s = mean(c(S1,S2,S3))) ->
+   out_df
> 
> as.data.frame( out_df )
   prot_id       S1       S2       S3   mean_s
1        A 29.46594 28.92563 30.61345 29.98498
2        B 29.47469 32.36655 30.33432 29.98498
3        C 31.43226 30.52121 30.34889 29.98498
4        D 30.30061 28.39447 30.93854 29.98498
5        E 29.62858 29.72010 29.11866 29.98498
6        F 30.50273 30.91271 29.03644 29.98498
7        G 28.77819 31.73520 29.47392 29.98498
8        H 28.61218 30.34214 30.82327 29.98498
9        I 28.83311 30.00083 30.07306 29.98498
10       J 30.33777 29.50011 29.00388 29.98498
>

Probably irrelevant, but the only difference today doing this and other days is that I'm running another process in the background outside of R that is using 125 GB of my 128 GB of RAM.

JohnMount · August 30, 2018, 11:58am

I am not sure if `mean(c(S1, S2, S3))` is correct `dplyr` (I would have thought it would be `mean(S1, S2, S3)`; but no error is being throw so notation seems to be allowed). That being said I get a bunch of weird results with this code pattern.
library("dplyr") packageVersion("dplyr") set.seed(2095) df <- tibble(prot_id = LETTERS[1:10], S1 = sample(0:1, 10, replace = TRUE), S2 = sample(0:1, 10, replace = TRUE), S3 = sample(0:1, 10, replace = TRUE)) ls() # [1] "df" df %>% group_by(prot_id) %>% mutate(mean_s = mean(S1, S2, S3)) # # A tibble: 10 x 5 # # Groups: prot_id [10] # prot_id S1 S2 S3 mean_s # <chr> <int> <int> <int> <dbl> # 1 A 0 1 0 0 # 2 B 0 1 1 0 # 3 C 0 0 1 0 # 4 D 0 1 0 0 # 5 E 1 0 1 1 # 6 F 1 0 0 1 # 7 G 1 0 0 1 # 8 H 1 1 0 1 # 9 I 0 1 0 0 # 10 J 0 0 0 0 ls() # [1] "df" df %>% group_by(prot_id) %>% mutate(mean_s = mean(c(S1, S2, S3))) # A tibble: 10 x 5 # Groups: prot_id [10] # prot_id S1 S2 S3 mean_s # <chr> <int> <int> <int> <dbl> # 1 A 0 1 0 0.333 # 2 B 0 1 1 0.667 # 3 C 0 0 1 0.333 # 4 D 0 1 0 0.333 # 5 E 1 0 1 0.667 # 6 F 1 0 0 0.333 # 7 G 1 0 0 0.333 # 8 H 1 1 0 0.667 # 9 I 0 1 0 0.333 # 10 J 0 0 0 0 ls() # [1] "df" mean(0:1) # [1] 0.5 str(0:1) # int [1:2] 0 1

Sorry about the above incorrect note- did not mean to mislead. Leaving it up to avoid further confusion. Obviously only mean(c(S1, S2, S3)) is the correct notation (in mean(S1, S2, S3) all S2 and S3 are lost in the ...).

hadley · August 30, 2018, 12:17pm

mean() is a base R function that takes a single vector x, with ... being passed to to methods. I'm not sure why you think it would have a different syntax within dplyr.

Your code simply returns the mean of a single number; you'll notice that mean_s is identical to S1 in your code.

ab604 · August 30, 2018, 12:17pm

Thanks John and James that was really helpful. I think I must have had a conflicting package loaded that caused the problem as I started a new session to run your code John and everything started working.

Take home lessons being to have the conflicted package loaded and that I don't need c() in my grouped means.

hadley · August 30, 2018, 12:18pm

Regardless of how I run it, I get the same results:

library(tidyverse)
set.seed(2095)
df <- tibble(
  prot_id = LETTERS[1:10],
  S1 = rnorm(10,30,1),
  S2 = rnorm(10,30,1),
  S3 = rnorm(10,30,1)
)

df %>% 
  group_by(prot_id) %>%
  mutate(
    mean1 = mean(c(S1,S2,S3)),
    mean2 = (S1 + S2 + S3) / 3,
    diff = mean1 - mean2
  )
#> # A tibble: 10 x 7
#> # Groups:   prot_id [10]
#>    prot_id    S1    S2    S3 mean1 mean2      diff
#>    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>
#>  1 A        29.5  28.9  30.6  29.7  29.7 -3.55e-15
#>  2 B        29.5  32.4  30.3  30.7  30.7 -3.55e-15
#>  3 C        31.4  30.5  30.3  30.8  30.8  0.      
#>  4 D        30.3  28.4  30.9  29.9  29.9  0.      
#>  5 E        29.6  29.7  29.1  29.5  29.5  0.      
#>  6 F        30.5  30.9  29.0  30.2  30.2  0.      
#>  7 G        28.8  31.7  29.5  30.0  30.0  3.55e-15
#>  8 H        28.6  30.3  30.8  29.9  29.9  0.      
#>  9 I        28.8  30.0  30.1  29.6  29.6  0.      
#> 10 J        30.3  29.5  29.0  29.6  29.6  0.

Are you sure there isn't something weird about your interactive session?

mara · August 30, 2018, 12:19pm

That's incorrect. You do need the c(), see Hadley's comment, above.

This is why John got the correct results in the second run with 0s and 1s, and not in the first.

ab604 · August 30, 2018, 12:19pm

Oh I do need c(). Thank you,

I'll mark this as resolved. What threw me most was that the reprex worked, but my session didn't.

hadley · August 30, 2018, 12:20pm

TRUST IN REPREX. It knows best

mara · August 30, 2018, 12:22pm

Yeah, reprex runs in a "clean" session, so if it's working and local isn't, look for either local conflicting variables, or package conflicts!

JohnMount · August 30, 2018, 12:22pm

Ooops- I see my error. mean(S1, S2, S3) is just poor way to have written mean(S1) (and not a dplyr issue). Sorry about that.

Is this something strict checks (populating the ... arguments with non-trivial data).

hadley · August 30, 2018, 12:28pm

This specific problem is explored in ellipsis

jonspring · August 30, 2018, 2:03pm

This particular specification of the problem in your reprex is an example of "average of each row" rather than a "grouped mean," since the groups are already fully specified by their row. For this set of data, where each prot_id exists in only one row, the group_by(prot_id) part is redundant.

It's possible to get the same result by using a grouped mean if you first gather your data into "tidy data" form, and then summarize for the mean.

df %>%
  gather(S_type, value, S1:S3) %>%
  group_by(prot_id) %>%
  summarise(mean_s = mean(value))
#> # A tibble: 10 x 2
#>    prot_id mean_s
#>    <chr>    <dbl>
#>  1 A         29.7
#>  2 B         30.7
#>  3 C         30.8
#>  4 D         29.9
#>  5 E         29.5
#>  6 F         30.2
#>  7 G         30.0
#>  8 H         29.9
#>  9 I         29.6
#> 10 J         29.6