non-standard evaluation with `group_by`

chris.prener · December 6, 2018, 9:00pm

I'd love any advice folks have about this little reprex below. The general goal here is take quoted or unquoted input from a function (called f here). The function f replaces a simple group_by/summarize() process. These are a simplification of a more complicated workflow, meant to isolate an issue I've run into.

This process works well (see the first two tests below) for both unquoted and quoted inputs. It falls apart, however, when a string is stored in a value (x) in this case, and that value is used when calling f.

# dependencies
suppressMessages(library(dplyr))
library(ggplot2)

# simplified function
f <- function(.data, group, value){

  # save parameters to list
  paramList <- as.list(match.call())

  # nse
  if (!is.character(paramList$group)) {
    groupQ <- rlang::enquo(group)
  } else if (is.character(paramList$group)) {
    groupQ <- rlang::quo(!! rlang::sym(group))
  }

  if (!is.character(paramList$value)) {
    valueQ <- rlang::enquo(value)
  } else if (is.character(paramList$value)) {
    valueQ <- rlang::quo(!! rlang::sym(value))
  }

  # group and summarize
  .data %>%
    dplyr::group_by(!!groupQ) %>%
    dplyr::summarize(sum = base::sum(!!valueQ)) -> out

  # return output
  return(out)

}

# test unquoted input
mpg %>%
  f(group = class, value = hwy)
#> # A tibble: 7 x 2
#>   class        sum
#>   <chr>      <int>
#> 1 2seater      124
#> 2 compact     1330
#> 3 midsize     1119
#> 4 minivan      246
#> 5 pickup       557
#> 6 subcompact   985
#> 7 suv         1124

# test quoted input
mpg %>%
  f(group = "class", value = "hwy")
#> # A tibble: 7 x 2
#>   class        sum
#>   <chr>      <int>
#> 1 2seater      124
#> 2 compact     1330
#> 3 midsize     1119
#> 4 minivan      246
#> 5 pickup       557
#> 6 subcompact   985
#> 7 suv         1124

# test input via a stored value
x <- "class"

mpg %>%
  f(group = x, value = "hwy")
#> Error in grouped_df_impl(data, unname(vars), drop): Column `x` is unknown

^{Created on 2018-12-06 by the reprex package (v0.2.1)}

What I find interesting is that this approach works with other dplyr functions, like select. When I create a simple function g that can take quoted or unquoted input, it works with both as well as when the variable name is stored in a value that is supplied for the appropriate argument.

# dependencies
suppressMessages(library(dplyr))
library(ggplot2)


# simplified function
g <- function(.data, value){

  # save parameters to list
  paramList <- as.list(match.call())

  # nse
  if (!is.character(paramList$value)) {
    valueQ <- rlang::enquo(value)
  } else if (is.character(paramList$value)) {
    valueQ <- rlang::quo(!! rlang::sym(value))
  }

  # group and summarize
  .data %>%
    dplyr::select(!!valueQ) -> out

  # return output
  return(out)

}

# test unquoted input
mpg %>%
  g(value = hwy)
#> # A tibble: 234 x 1
#>      hwy
#>    <int>
#>  1    29
#>  2    29
#>  3    31
#>  4    30
#>  5    26
#>  6    26
#>  7    27
#>  8    26
#>  9    25
#> 10    28
#> # ... with 224 more rows

# test quoted input
mpg %>%
  g(value = "hwy")
#> # A tibble: 234 x 1
#>      hwy
#>    <int>
#>  1    29
#>  2    29
#>  3    31
#>  4    30
#>  5    26
#>  6    26
#>  7    27
#>  8    26
#>  9    25
#> 10    28
#> # ... with 224 more rows

# test input via a stored value
x <- "hwy"

mpg %>%
  g(value = x)
#> # A tibble: 234 x 1
#>      hwy
#>    <int>
#>  1    29
#>  2    29
#>  3    31
#>  4    30
#>  5    26
#>  6    26
#>  7    27
#>  8    26
#>  9    25
#> 10    28
#> # ... with 224 more rows

^{Created on 2018-12-06 by the reprex package (v0.2.1)}

I'd love to know what is going on here and what I am missing with group_by(). Any help would be very much appreciated!

technocrat · December 6, 2018, 10:20pm

I'm not exactly sure where you're trying to end up, but normally in dplyr grouping, I'd expect to see something like

> mpg %>% group_by(class) %>% summarize(avg = mean(hwy))
# A tibble: 7 x 2
  class        avg
  <chr>      <dbl>
1 2seater     24.8
2 compact     28.3
3 midsize     27.3
4 minivan     22.4
5 pickup      16.9
6 subcompact  28.1
7 suv         18.1

chris.prener · December 6, 2018, 10:33pm

Ended up finding the solution (in third test at the bottom of the reprex). In order to pass the quoted value class into the function, the input needs to be turned into a quosure before being passed as an argument. Once it is in ~class form, it is passed to the function with the bang-bang (!!).

# dependencies
suppressMessages(library(dplyr))
library(ggplot2)

# simplified function
f <- function(.data, group, value){

  # save parameters to list
  paramList <- as.list(match.call())

  # nse
  if (!is.character(paramList$group)) {
    groupQ <- rlang::enquo(group)
  } else if (is.character(paramList$group)) {
    groupQ <- rlang::quo(!! rlang::sym(group))
  }

  if (!is.character(paramList$value)) {
    valueQ <- rlang::enquo(value)
  } else if (is.character(paramList$value)) {
    valueQ <- rlang::quo(!! rlang::sym(value))
  }

  # group and summarize
  .data %>%
    dplyr::group_by(!!groupQ) %>%
    dplyr::summarize(sum = base::sum(!!valueQ)) -> out

  # return output
  return(out)

}

# test unquoted input
mpg %>%
  f(group = class, value = hwy)
#> # A tibble: 7 x 2
#>   class        sum
#>   <chr>      <int>
#> 1 2seater      124
#> 2 compact     1330
#> 3 midsize     1119
#> 4 minivan      246
#> 5 pickup       557
#> 6 subcompact   985
#> 7 suv         1124

# test quoted input
mpg %>%
  f(group = "class", value = "hwy")
#> # A tibble: 7 x 2
#>   class        sum
#>   <chr>      <int>
#> 1 2seater      124
#> 2 compact     1330
#> 3 midsize     1119
#> 4 minivan      246
#> 5 pickup       557
#> 6 subcompact   985
#> 7 suv         1124

# test input via a stored value
x <- "class"
xQ <- rlang::quo(!! rlang::sym(x))

mpg %>%
  f(group = !!xQ, value = "hwy")
#> # A tibble: 7 x 2
#>   class        sum
#>   <chr>      <int>
#> 1 2seater      124
#> 2 compact     1330
#> 3 midsize     1119
#> 4 minivan      246
#> 5 pickup       557
#> 6 subcompact   985
#> 7 suv         1124

^{Created on 2018-12-06 by the reprex package (v0.2.1)}

cderv · December 7, 2018, 7:11am

This is due to the way you handle nse inside your f function I think. When you provide x <- "class", your if statement will call is.character(paramList$value) that will be false so call valueQ <- rlang::enquo(group). groupQ end up being like enquo(x) and it does not work correctly with group_by. It needs to pass through your else if clause groupQ <- rlang::quo(!! rlang::sym(group)) and it is what you end up doing outside the function: xQ <- rlang::quo(!! rlang::sym(x)). So if you modify your f function you to deal with a x <- "class" it should work. When provided as character it should pass through sym or ensym.

Note that you could also use the variant group_by_at that works with character or columns name generated by vars(). It could be very useful in you case.

Here some examples to help show how NSE work here.

library(dplyr, warn.conflicts = FALSE)
library(ggplot2)

x <- "class"

# effect of tidyevalutation
rlang::qq_show(!!quo(x))
#> ^x
rlang::qq_show(quo(!!x))
#> quo("class")
rlang::qq_show(sym(x))
#> sym(x)
rlang::qq_show(!!sym(x))
#> class
rlang::qq_show(sym(!!x))
#> sym("class")
rlang::qq_show(!!x)
#> "class"

# does not work because x is not found
mpg %>%
  dplyr::group_by(!!quo(x))
#> Error in grouped_df_impl(data, unname(vars), drop): Column `x` is unknown

# works because class is a symbol 
mpg %>%
  dplyr::group_by(!!sym(x))
#> # A tibble: 234 x 11
#> # Groups:   class [7]
#>    manufacturer model displ  year   cyl trans drv     cty   hwy fl    cla~
#>    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <ch>
#>  1 audi         a4      1.8  1999     4 auto~ f        18    29 p     com~
#>  2 audi         a4      1.8  1999     4 manu~ f        21    29 p     com~
#>  3 audi         a4      2    2008     4 manu~ f        20    31 p     com~
#>  4 audi         a4      2    2008     4 auto~ f        21    30 p     com~
#>  5 audi         a4      2.8  1999     6 auto~ f        16    26 p     com~
#>  6 audi         a4      2.8  1999     6 manu~ f        18    26 p     com~
#>  7 audi         a4      3.1  2008     6 auto~ f        18    27 p     com~
#>  8 audi         a4 q~   1.8  1999     4 manu~ 4        18    26 p     com~
#>  9 audi         a4 q~   1.8  1999     4 auto~ 4        16    25 p     com~
#> 10 audi         a4 q~   2    2008     4 manu~ 4        20    28 p     com~
#> # ... with 224 more rows

# works because the *_at variant know how to deal with character
mpg %>%
  dplyr::group_by_at(.vars = x)
#> # A tibble: 234 x 11
#> # Groups:   class [7]
#>    manufacturer model displ  year   cyl trans drv     cty   hwy fl    cla~
#>    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <ch>
#>  1 audi         a4      1.8  1999     4 auto~ f        18    29 p     com~
#>  2 audi         a4      1.8  1999     4 manu~ f        21    29 p     com~
#>  3 audi         a4      2    2008     4 manu~ f        20    31 p     com~
#>  4 audi         a4      2    2008     4 auto~ f        21    30 p     com~
#>  5 audi         a4      2.8  1999     6 auto~ f        16    26 p     com~
#>  6 audi         a4      2.8  1999     6 manu~ f        18    26 p     com~
#>  7 audi         a4      3.1  2008     6 auto~ f        18    27 p     com~
#>  8 audi         a4 q~   1.8  1999     4 manu~ 4        18    26 p     com~
#>  9 audi         a4 q~   1.8  1999     4 auto~ 4        16    25 p     com~
#> 10 audi         a4 q~   2    2008     4 manu~ 4        20    28 p     com~
#> # ... with 224 more rows

^{Created on 2018-12-07 by the reprex package (v0.2.1)}

chris.prener · December 10, 2018, 12:38pm

Thanks @cderv - this is super helpful. I haven't used the *_at functions before, and also didn't know about rlang::qq_show. Much easier than how I've been debugging quasi quotation...

chris.prener · December 17, 2018, 12:38pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.