summarize anomaly

dplyr

#1

I have a problem getting summarize() to summarize by group correctly for a series of summaries of the same data.frame. The following is reproducible, but leads to an incorrect result.

testcase <- data.frame(StudyID = sample(LETTERS[1:10], 100, 
              replace = TRUE), 
              intercept = rnorm(100), stepchg = rnorm(100))
testcase %>% group_by(StudyID) %>% summarize(
  Intcpt=median(intercept,na.rm=T), MAD_Intcpt=mad(intercept,na.rm=T),
  stepchg=median(stepchg, na.rm=T), madstepchg=mad(stepchg, na.rm = T))
testcase %>% group_by(StudyID) %>% summarize(
  madstepchg=mad(stepchg, na.rm = T))

My results. Note that when I ask for four summaries, the first three are fine, but the last one just gives 0s. If I ask for the last summary by itself, the result is correct. I am not sure whether I have made a dumb mistake, or whether there is a bug in the summarize() code. I don't want to report a bug until I am sure that the problem is not just a stupid mistake on my part. (The original data has sme NAs, so I have to include the "na.rm = T." But the result is the same if I leave that out.

> testcase %>% group_by(StudyID) %>% summarize(
+   Intcpt=median(intercept,na.rm=T), MAD_Intcpt=mad(intercept,na.rm=T),
+   stepchg=median(stepchg, na.rm=T), madstepchg=mad(stepchg, na.rm = T))
# A tibble: 10 x 5
   StudyID  Intcpt MAD_Intcpt  stepchg madstepchg
   <fct>     <dbl>      <dbl>    <dbl>      <dbl>
 1 A       -0.392       0.846 -0.118            0
 2 B        0.0805      1.51   1.22             0
 3 C        0.0362      1.06  -0.585            0
 4 D       -0.0266      0.410  0.263            0
 5 E        0.370       1.66  -0.272            0
 6 F       -0.324       1.27  -0.181            0
 7 G        0.450       0.197  0.240            0
 8 H       -0.390       0.741  0.0800           0
 9 I        0.427       0.536 -0.00189          0
10 J        0.0361      0.637 -0.0393           0
> testcase %>% group_by(StudyID) %>% summarize(
+   madstepchg=mad(stepchg, na.rm = T))
# A tibble: 10 x 2
   StudyID madstepchg
   <fct>        <dbl>
 1 A            0.734
 2 B            0.863
 3 C            0.448
 4 D            1.36 
 5 E            0.777
 6 F            0.769
 7 G            0.715
 8 H            0.757
 9 I            0.684
10 J            0.719

Thanks in advance to anyone that can help me fix the above.
Larry Hunsicker


#2

Hey @lhunsicker! My hunch here is that this is neither a bug nor a stupid mistake: summarise() may, like mutate(), evaluate its arguments sequentially. That means that by the time madstepchg evaluates, its reference to stepchg could be now referring to the previous evaluated argument, not the original stepchg column.

Because summarise()—unlike mutate()—returns a vector of length 1 for each of its calculated summaries, you're essentially running mad() on a single-element vector, which I'm guessing would have a mean absolute deviation of 0.

If my hunch is right, the easiest way around this is going to be to assign your median() and mad() summaries of stepchg to new, separate names, like:

testcase %>%
  group_by(StudyID) %>%
  summarize(
    Intcpt = median(intercept,na.rm=T),
    MAD_Intcpt = mad(intercept,na.rm=T),
    stepchg_median = median(stepchg, na.rm=T),
    stepchg_mad = mad(stepchg, na.rm = T))

The critical thing is that median() summary doesn't have the same name as the original column (so that the mad() summary can still refer to it). Does that work?

(FWIW: I think the value of sequential evaluation is clear for mutate() but perhaps more questionable for summarise(). Perhaps it's done for consistency between the two, because tidyeval can sometimes throw people off. I'd love to hear what others think, though!)


#3

From the summarise() documentation, in the examples:

# Note that with data frames, newly created summaries immediately
# overwrite existing variables
mtcars %>%
  group_by(cyl) %>%
  summarise(disp = mean(disp), sd = sd(disp))
#> # A tibble: 3 x 3
#>     cyl  disp    sd
#>   <dbl> <dbl> <dbl>
#> 1     4  105.    NA
#> 2     6  183.    NA
#> 3     8  353.    NA

I have seen people miss critical details of summarise() before because they’re included in the examples and not in the main documentation. This bit is obliquely mentioned in the main text of the docs (under “Backend variations”). I’d love to see more of the details of how the function works in the main text of the documentation, but I haven’t really done anything useful about that opinion, like submitting a PR :upside_down_face:.


#4

Gee! You guys are good! I added a 1 to the first stepcgh, and now the code works perfectly. Actually, I knew about the sequential evaluation of arguments in summarise(). So I agree that this is not a bug -- and maybe a rather sophisticated error rather than a stupid error.
Many thanks to both rensa and jcblum.
Larry Hunsicker


#5

This is a good reminder to me that when teaching dplyr it's important to illustrate that summarize and mutate are sequential. This, like the dropping of the last grouping variable automagically after a summarize, is not a bug, but certainly can be surprising.


#6

I haven't done a documentation PR before, so I've set myself a reminder for this for Wednesday arvo after my review :smile:


#7

Not sure summarize is in fact sequentially evaluating its arguments (or even can for operators that change vector lengths).

dplyr::summarize(data.frame(x=1), x = max(x), min_x = min(x))
#    x min_x
# 1 1     0

packageVersion("dplyr")
# [1] ‘0.7.7’

#8

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.