Interesting tidy eval use cases

hadley · January 8, 2019, 9:36pm

Hi all,

To help me come up with good exercises, I'm looking for small but useful functions where you've found tidyeval helpful. Ideally, they'd be under 20 lines, and accompanied by a couple of sentences describing what the goal is. Would you mind sharing where you've found it useful?

Thanks!

Hadley

dylanjm · January 8, 2019, 9:56pm

I've seen similar functions around online but I use this function frequently that lies in the same vein as dplyr::count() but instead of just counting, it returns the proportion of values within a group. Here is the code:

library(tidyverse)

proportion <- function(df, vars){
  vars_col <- enquo(vars)

  df %>% 
    count(!!vars_col, sort = T) %>% 
    mutate(prop_n = prop.table(n)) 
}

iris %>% 
  proportion(Species)
#> # A tibble: 3 x 3
#>   Species        n prop_n
#>   <fct>      <int>  <dbl>
#> 1 setosa        50  0.333
#> 2 versicolor    50  0.333
#> 3 virginica     50  0.333

I'm just starting to learn tidy eval but I've also found it useful when refactoring ggplot's into reusable functions.

hadley · January 8, 2019, 10:07pm

I'd be tempted to write that as below, so that you can get within group proportions:

library(dplyr)
prop <- function(df, ...) {
  out <- df %>% 
    group_by(...) %>% 
    summarise(n = n()) %>% 
    mutate(prop = n / sum(n))

  out
}

starwars %>% 
  prop(homeworld, species) %>% 
  filter(substr(homeworld, 1, 1) == "T")
#> # A tibble: 6 x 4
#> # Groups:   homeworld [5]
#>   homeworld species        n  prop
#>   <chr>     <chr>      <int> <dbl>
#> 1 Tatooine  Droid          2   0.2
#> 2 Tatooine  Human          8   0.8
#> 3 Toydaria  Toydarian      1   1  
#> 4 Trandosha Trandoshan     1   1  
#> 5 Troiken   Xexto          1   1  
#> 6 Tund      Toong          1   1

^{Created on 2019-01-08 by the reprex package (v0.2.1.9000)}

hadley · January 8, 2019, 10:10pm

Actually, maybe this would be better so you can apply it after you've counted:

prop <- function(df, x = n) {
  x <- enquo(x)
  df %>% mutate(prop = !!x / sum(!!x))
}

joeystan · January 8, 2019, 10:29pm

I can't claim credit, but I did ask the question that led @danr to provide this useful function (which Kieran Healey wrote about here) . It's essentially spread() but with multiple value arguments. I use it all the time, though I usually call it spread_n in my code.

df <- data.frame(month=rep(1:3,2),
                                 student=rep(c("Amy", "Bob"), each=3),
                                 A=c(9, 7, 6, 8, 6, 9),
                                 B=c(6, 7, 8, 5, 6, 7))

t1 <- df %>%
    gather(variable, value, -(month:student)) %>%
    unite(temp, student, variable) %>%
    spread(temp, value)
#   month Amy_A Amy_B Bob_A Bob_B
# 1     1     9     6     8     5
# 2     2     7     7     6     6
# 3     3     6     8     9     7

myspread <- function(df, key, value) {
    # quote key
    keyq <- rlang::enquo(key)
    # break value vector into quotes
    valueq <- rlang::enquo(value)
    s <- rlang::quos(!!valueq)
    df %>% gather(variable, value, !!!s) %>%
        unite(temp, !!keyq, variable) %>%
        spread(temp, value)
}

t2 <- df %>% myspread(student, c(A, B))

identical(t1, t2)

nacnudus · January 8, 2019, 10:55pm

This is a version of summary() that uses all combinations of grouping variables.

library(tidyverse)

margins <- function(.data, ...) {
  groups <- dplyr::groups(.data)
    # Create all combinations of groups, excluding no groups
    # https://stackoverflow.com/a/27953641/937932
  combinations <-
    do.call(c, lapply(seq_along(groups), combn, x = groups, simplify = FALSE))
  n <- length(combinations)
  out <- vector(mode = "list", length = n)
  for (i in rev(seq_len(n))) { # summarise by one column, then two, then etc.
    out[[i]] <-
      .data %>%
      dplyr::group_by(!!!combinations[[i]]) %>%
      dplyr::summarise(...) %>%
      dplyr::group_by(!!!combinations[[i]]) # Reapply the original groups
  }
  out %>%
    dplyr::bind_rows() %>%
    dplyr::select(!!!groups, dplyr::everything())
}

mtcars %>%
  dplyr::group_by(cyl, gear, am) %>%
  margins(mean_mpg = mean(mpg),
          min_hp = min(hp)) %>%
  print(n = Inf)
#> # A tibble: 36 x 5
#> # Groups:   cyl [4]
#>      cyl  gear    am mean_mpg min_hp
#>    <dbl> <dbl> <dbl>    <dbl>  <dbl>
#>  1     4    NA    NA     26.7     52
#>  2     6    NA    NA     19.7    105
#>  3     8    NA    NA     15.1    150
#>  4    NA     3    NA     16.1     97
#>  5    NA     4    NA     24.5     52
#>  6    NA     5    NA     21.4     91
#>  7    NA    NA     0     17.1     62
#>  8    NA    NA     1     24.4     52
#>  9     4     3    NA     21.5     97
#> 10     4     4    NA     26.9     52
#> 11     4     5    NA     28.2     91
#> 12     6     3    NA     19.8    105
#> 13     6     4    NA     19.8    110
#> 14     6     5    NA     19.7    175
#> 15     8     3    NA     15.0    150
#> 16     8     5    NA     15.4    264
#> 17     4    NA     0     22.9     62
#> 18     4    NA     1     28.1     52
#> 19     6    NA     0     19.1    105
#> 20     6    NA     1     20.6    110
#> 21     8    NA     0     15.0    150
#> 22     8    NA     1     15.4    264
#> 23    NA     3     0     16.1     97
#> 24    NA     4     0     21.0     62
#> 25    NA     4     1     26.3     52
#> 26    NA     5     1     21.4     91
#> 27     4     3     0     21.5     97
#> 28     4     4     0     23.6     62
#> 29     4     4     1     28.0     52
#> 30     4     5     1     28.2     91
#> 31     6     3     0     19.8    105
#> 32     6     4     0     18.5    123
#> 33     6     4     1     21      110
#> 34     6     5     1     19.7    175
#> 35     8     3     0     15.0    150
#> 36     8     5     1     15.4    264

milesmcbain · January 8, 2019, 10:55pm

@Alicia posted this example which I thought was neat:

filter_loudly <- function(x, ...){
   in_rows <- nrow(x)
   out <- filter(x, !!!rlang::enquos(...))
   out_rows <- nrow(out)
   message("Filtered out ",in_rows-out_rows," rows.")
   return(out)
}

Could also extend the idea to a noisy join that reports duplicates and failed matches, as these are often things you immediately check.

hadley · January 8, 2019, 11:49pm

Unfortunately that doesn't actually need tidy eval, as the usual semantics of ... are enough:

filter_loudly <- function(x, ...){
  in_rows <- nrow(x)
  out <- filter(x, ...)
  out_rows <- nrow(out)
  message("Filtered out ", in_rows - out_rows, " rows.")
  out
}

(this works reliably because of tidy eval, but you don't need to explicitly use it)

hadley · January 8, 2019, 11:55pm

Nice! You can simplify that a bit to:

myspread <- function(df, key, ...) {
    key <- rlang::enquo(key)

    df %>% 
      gather("variable", "value", ...) %>%
      unite(temp, !!key, variable) %>%
      spread(temp, value)
}

hadley · January 9, 2019, 12:03am

Nice idea, thanks!

(BTW I think the send group_by() is redundant since the grouping variables will be reset by bind_rows())

em-ee · January 9, 2019, 12:30am

There’s probably a much easier way to do this (and not sure how widely useful it is!), but I was working with data which had missing codes (e.g. 777,888,999) mixed in with actual measured values, so I wanted to split them out so I could work with the actual data without having to filter out the missing codes all the time, and analyze the missing codes separately if needed.

The short version of the function just duplicates the columns (assumes that the first column is a list of ID numbers) and names each new column ‘originalname_NA’

(The longer version then replaces all the missing codes with NA in the original columns, and the original data with NA in the new columns - here:

na_columns <- function(data){
  
  for (indexCol in 2:ncol(data)){
    
    columnName <- colnames(data)[indexCol]
    varname <- paste0(columnName, "_NA")
    
    data <- mutate(data, !!varname := data[[indexCol]])
    
  }
  
  data
  
}

df <- data.frame(id_no = c(1,2,3,4,5,6,7,8,9,10),
                 age = c(45,32,44,23,47,999,45,999,50,38),
                 score = c(23,24,26,30,888,999,999,999,30,28),
                 score_2 = c(24,24,888,30,22,20,999,999,30,999))

na_columns(df)

# id_no   age   score   score_2   age_NA    score_NA    score_2_NA
# 1       45        23        24      45      23        24
# 2       32        24        24      32      24        24
# 3       44        26        888     44      26        888
# 4       23        30        30      23      30        30
# 5       47        888       22      47      888       22
# 6       999       999       20      999     999       20
# 7       45        999       999     45      999       999
# 8       999       999       999     999     999       999
# 9       50        30        30      50      30        30
# 10      38        28        999     38      28        999

milesmcbain · January 9, 2019, 2:16am

Yes okay good point.

Emilhvitfeldt · January 9, 2019, 7:43am

Goal: Create a named vector from two variables in a data.frame. One variable for the values and one variable for the names.
(I think this is a valid use of tidy eval)

library(rlang)
word_dict <- function(data, word, score) {
  word <- quo_name(enquo(word))
  score <- quo_name(enquo(score))

  x <- data[[score]]
  names(x) <- data[[word]]
  x
}

test_data <- data.frame(
  letter = letters,
  number = letters %in% c("a", "e", "i", "o", "u")
)

word_dict(test_data, letter, number)
#>     a     b     c     d     e     f     g     h     i     j     k     l 
#>  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE 
#>     m     n     o     p     q     r     s     t     u     v     w     x 
#> FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE 
#>     y     z 
#> FALSE FALSE
``

Alicia · January 9, 2019, 8:12am

Yeah I guess that was a bad example! (Or a good anti-example, those can be helpful too!). I think tidyeval can be helpful for more complex iterations of that same function, e.g. to also print out how many rows each condition individually filters out for when you have multiple conditions used for filtering

filter_loudly <- function(x, ...){
  in_rows <- nrow(x)
  args <- rlang::enquos(...)
  out <- dplyr::filter(x, !!!args) # !!!args not really needed here, could just pass ...
  out_rows <- nrow(out)
  message("Filtered out ",in_rows-out_rows," rows.")
  for (arg in args){
    temp_rows <- nrow(dplyr::filter(x, !!arg))
    message(in_rows - temp_rows, " rows don't match ", rlang::quo_text(arg))
  }
  return(out)
}

(Unless there is another easy way of splitting up '...' and passing to separate functions without evaluating?)

MikeKSmith · January 9, 2019, 9:32am

My example is very similar to this:

categoricalSummary <- function(data, variable){
  variable <- enquo(variable)
  data %>%
    count(!!variable)%>%
    rename(Values = !!variable)  %>%
    complete(Values, fill=list(n=0)) %>%
    mutate(Variable = quo_name(variable)) %>%
    mutate(Values = as.character(Values)) %>%
    select(Variable, Values, n)
}

So it can be used like this:

myTab <- mysub %>%
  group_by(Sex) %>%
  categoricalSummary(variable = Compliance) %>%
  ungroup() %>%
  mutate(prop = prop.table(n))
myTab

# # A tibble: 4 x 5
#  Sex    Variable   Values     n   prop
# <fct>  <chr>      <chr>  <dbl>  <dbl>
# 1 Male   Compliance No         0 0     
# 2 Male   Compliance Yes       59 0.797 
# 3 Female Compliance No         2 0.0270
# 4 Female Compliance Yes       13 0.176

I want to pick out categorical covariates from a dataset, summarise with optional grouping variables and show proportions relative to the total number of observations.

MikeKSmith · January 9, 2019, 9:41am

For summarising continuous data:

continuousSummary <- function(data, variable){
  variable <- enquo(variable)

  data %>%
  summarise_at(quo_name(variable),
               funs(N = length(.),
                    mean = mean(.),
                    sd = sd(.),
                    median = median(.),
                    min = min(.),
                    max = max(.))) %>%
  mutate(range = paste(min, "-", max),
         CV = 100*sd/mean) %>%
  mutate(Variable = quo_name(variable)) %>%
  select(Variable, everything())
}

Which can then be used on individual variables:

continuousSummary(mysub, variable = AGE)
#  Variable  N     mean       sd median min max   range       CV
#1      AGE 74 65.17568 7.822269     66  43  78 43 - 78 12.00182

Or used in pipelines with grouping:

mysub %>%
  group_by(SEX) %>%
  continuousSummary(variable = AGE)

# # A tibble: 2 x 10
# Variable   SEX     N  mean    sd median   min   max range      CV
#   <chr>    <int> <int> <dbl> <dbl>  <int> <dbl> <dbl> <chr>   <dbl>
# 1 AGE          1    59  65.6  7.82     67    43    78 43 - 78  11.9
# 2 AGE          2    15  63.4  7.83     61    55    78 55 - 78  12.3

The aim with both the categoricalSummary and continuousSummary functions is to be able to produce content which can then be gathered and shaped ready for display with the gt package.

lionel · January 9, 2019, 12:41pm

Hi Mike, thanks for the contribution!

There's a point that we haven't made clear in our documentation yet. You should normally use enquo() and enquos() when you expect actions (any complex R expression). Here you're really taking a selection, i.e. variable names. Your function is not strict enough about its inputs. One way to fix this is to add proper input checking by using the new as_name() function instead of quo_name(), see the NEWS for rlang 0.3.1 about this. With as_name(), you'll get a more informative error when the user supplies an action instead of a variable.

However, for the purpose of selection, using tidyselect is generally much better. You can use either tidyselect::vars_pull() to get pull()-like selection, including with negative indice, or tidyselect::vars_select() when multiple selections make sense. In this case, I think it makes sense to generalise your function to multiple selections.

Normally to implement selection semantics, we forward the dots and the data names to tidyselect. This returns a character vector of selected names:

continuous_summary <- function(.data, ...) {
  sel <- tidyselect::vars_select(tbl_vars(.data), ...)

  <rest of implementation>
}

Here we don't even have to do this because tidyr::gather() takes selections, so we can pass the dots directly:

continuous_summary <- function(.data, ...) {
  .data %>%
    tidyr::gather("Variable", "Value", ...) %>%
    group_by(Variable, add = TRUE) %>%
    summarise_at(
      "Value",
      list(
        N =      ~ length(.),
        mean =   ~ mean(.),
        sd =     ~ sd(.),
        median = ~ median(.),
        min =    ~ min(.),
        max =    ~ max(.)
      )
    ) %>%
    mutate(
      range = paste(min, "-", max),
      CV = 100 * sd / mean
    )
}

This supports several variables, possibly grouped:

mtcars %>% continuous_summary(starts_with("d"), qsec)
#> # A tibble: 3 x 9
#>   Variable     N   mean      sd median   min    max range          CV
#>   <chr>    <int>  <dbl>   <dbl>  <dbl> <dbl>  <dbl> <chr>       <dbl>
#> 1 disp        32 231.   124.    196.   71.1  472    71.1 - 472   53.7
#> 2 drat        32   3.60   0.535   3.70  2.76   4.93 2.76 - 4.93  14.9
#> 3 qsec        32  17.8    1.79   17.7  14.5   22.9  14.5 - 22.9  10.0

mtcars %>% group_by(am) %>% continuous_summary(starts_with("d"))
#> # A tibble: 4 x 10
#> # Groups:   am [2]
#>      am Variable     N   mean      sd median    min    max range          CV
#>   <dbl> <chr>    <int>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl> <chr>       <dbl>
#> 1     0 disp        19 290.   110.    276.   120.   472    120.1 - 472 37.9 
#> 2     0 drat        19   3.29   0.392   3.15   2.76   3.92 2.76 - 3.92 11.9 
#> 3     1 disp        13 144.    87.2   120.    71.1  351    71.1 - 351  60.8 
#> 4     1 drat        13   4.05   0.364   4.08   3.54   4.93 3.54 - 4.93  8.99

lionel · January 9, 2019, 1:01pm

Instead of capturing your inputs by action, we can also use tidyselect to capture them by selection. Please see remarks in my previous comment about action versus selection for more context on the following suggestion.

Since you take a fixed number of inputs, let's use vars_pull() instead of vars_select().

word_dict <- function(data, word, score) {
  vars <- tbl_vars(data)
  score <- tidyselect::vars_pull(vars, !!enquo(score))
  word <- tidyselect::vars_pull(vars, !!enquo(word))

  x <- data[[score]]
  names(x) <- data[[word]]
  x
}

Your function can be used in the same way:

test_data %>% word_dict(letter, number)
#>     a     b     c     d     e     f     g     h     i     j     k     l
#>  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
#>     m     n     o     p     q     r     s     t     u     v     w     x
#> FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
#>     y     z
#> FALSE FALSE

And now supports dplyr::pull() features:

test_data %>% word_dict(-1, -2)
#>  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
#>   "a"   "b"   "c"   "d"   "e"   "f"   "g"   "h"   "i"   "j"   "k"   "l"
#> FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
#>   "m"   "n"   "o"   "p"   "q"   "r"   "s"   "t"   "u"   "v"   "w"   "x"
#> FALSE FALSE
#>   "y"   "z"

MikeKSmith · January 9, 2019, 1:35pm

I'm smiling because it seems that many cases where we're using tidy evaluation can actually be refactored using other means... So now I'm a bit confused where you would use tidy-evaluation...?

hadley · January 9, 2019, 1:53pm

Yes, this is very interesting - it seems like the enquo() + !! pattern is commonly used, but for the more complex problems my intuition is to always start with a solution that doesn't need tidy eval.