How to avoid data masking in summarise step when programming with dplyr

An unintended bug I often encounter when programming functions with dplyr is unexpected data masking in a summarise step. I'll illustrate what I mean with a simple example.

Let's say you want to do a grouped summary, but happen to accidentilly use the same name as one of the columns in your summary step:

library(dplyr)

# Unintended result due to data masking in summary step
mtcars %>% 
  group_by(gear) %>% 
  summarise(
    wt = sum(mpg),
    # wt here refers the value computed in the summary step 
    # instead of wt column grouped by gear
    mean = mean(wt)
  )
#> # A tibble: 3 x 3
#>    gear    wt  mean
#>   <dbl> <dbl> <dbl>
#> 1     3  242.  242.
#> 2     4  294.  294.
#> 3     5  107.  107.

# Same result when trying to use .data
mtcars %>% 
  group_by(gear) %>% 
  summarise(
    wt = sum(mpg),
    mean = mean(.data$wt)
  )
#> # A tibble: 3 x 3
#>    gear    wt  mean
#>   <dbl> <dbl> <dbl>
#> 1     3  242.  242.
#> 2     4  294.  294.
#> 3     5  107.  107.

# Same result when using across()
mtcars %>% 
  group_by(gear) %>% 
  summarise(
    wt = sum(mpg),
    mean = across("wt", mean)
  )
#> # A tibble: 3 x 3
#>    gear    wt mean$wt
#>   <dbl> <dbl>   <dbl>
#> 1     3  242.    242.
#> 2     4  294.    294.
#> 3     5  107.    107.

# Using . the entire wt column gets used instead of grouped by gear
mtcars %>% 
  group_by(gear) %>% 
  summarise(
    wt = sum(mpg),
    mean = mean(.$wt)
  )
#> # A tibble: 3 x 3
#>    gear    wt  mean
#>   <dbl> <dbl> <dbl>
#> 1     3  242.  3.22
#> 2     4  294.  3.22
#> 3     5  107.  3.22

# Desired output
mtcars %>% 
  group_by(gear) %>% 
  summarise(,
    mean = mean(wt),
    wt = sum(mpg)
  ) %>% 
  select(gear, wt, mean)
#> # A tibble: 3 x 3
#>    gear    wt  mean
#>   <dbl> <dbl> <dbl>
#> 1     3  242.  3.89
#> 2     4  294.  2.62
#> 3     5  107.  2.63

This tends to happen unexpectedly when this type of functionality is wrapped in a function where you don't know what the column names will be and they happen to coincide with the names you use in the summarise step. E.g.

group_and_summarise <- function(data, var, by) {
  data %>% 
    group_by(.data[[by]]) %>% 
    summarise(
      wt = sum(.data[[var]]),
      mean = mean(.data[[var]])
    )
}

group_and_summarise(mtcars, "wt", "gear")
#> # A tibble: 3 x 3
#>    gear    wt  mean
#>   <dbl> <dbl> <dbl>
#> 1     3  58.4  58.4
#> 2     4  31.4  31.4
#> 3     5  13.2  13.2

Is there a way using tidyeval to avoid this data masking and ensure you summarising the data in a correct way?

(Unsatisfactory) workarounds I'm aware of would be:

  • Ordering the summary computations in a way to avoid this risk
  • Using names in your summary steps that are less likely to coincide with column names (e.g. .wt instead of wt)

But in both approaches, this bug can still sneak up on you.

Nice post. It should be possible to set intermediate results in a private variable and remove it at the end. But in this case I think it makes sense to namespace the new columns with some sort of prefix, like you suggest with .. Even if you fix the use of the new value for wt, isn't it going to be surprising to the callers that the wt column was overwritten? It seems that overwriting data could lead to unexpected results down the line.

Using . to add new computed columns to a data frame is a pattern that is used in tidymodels packages IIRC.

1 Like

@davis mentions this nice trick. Since data frame are auto-spliced, i.e. mutate(foo = 1) is the same as mutate(data.frame(foo = 1)), you can use them to create your own temporary namespace within a single expression. This way you don't run the risk of having your variables overwritten:

group_and_summarise <- function(data, var, by) {
  data %>% 
    group_by(.data[[by]]) %>% 
    summarise(
      data.frame(
        wt = sum(.data[[var]]),
        mean = mean(.data[[var]])
      )
    )
}
1 Like

Thanks @lionel and @davis! That's a neat trick that does the job. Using data.frame() encapsulates a couple of summarise() steps in a way that avoids the possibility of data masking.

To answer your question about potentially being confusing for users: in the context I'm working in, it's common for summary computations to be weight-adjusted and report on the weight by group. In the insurance sector, that weight is referred to as the exposure (reflecting how exposed you are to risk) and keeping this named the same way in summaries is most intuitive. I guess it's somewhat similar to count() always giving you a column named n, even though the input wt column might have be named n already (note that there is a name argument in count to overwrite that default though). I don't see it as overwriting the original data column, as it's a summary.

starwars %>% 
  rename(n = height) %>% 
  count(gender, wt = n)
#> # A tibble: 3 x 2
#>   gender        n
#>   <chr>     <int>
#> 1 feminine   2635
#> 2 masculine 10944
#> 3 <NA>        544

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.