Interesting tidy eval use cases


#16

My example is very similar to this:

categoricalSummary <- function(data, variable){
  variable <- enquo(variable)
  data %>%
    count(!!variable)%>%
    rename(Values = !!variable)  %>%
    complete(Values, fill=list(n=0)) %>%
    mutate(Variable = quo_name(variable)) %>%
    mutate(Values = as.character(Values)) %>%
    select(Variable, Values, n)
}

So it can be used like this:

myTab <- mysub %>%
  group_by(Sex) %>%
  categoricalSummary(variable = Compliance) %>%
  ungroup() %>%
  mutate(prop = prop.table(n))
myTab

# # A tibble: 4 x 5
#  Sex    Variable   Values     n   prop
# <fct>  <chr>      <chr>  <dbl>  <dbl>
# 1 Male   Compliance No         0 0     
# 2 Male   Compliance Yes       59 0.797 
# 3 Female Compliance No         2 0.0270
# 4 Female Compliance Yes       13 0.176 

I want to pick out categorical covariates from a dataset, summarise with optional grouping variables and show proportions relative to the total number of observations.


#17

For summarising continuous data:

continuousSummary <- function(data, variable){
  variable <- enquo(variable)

  data %>%
  summarise_at(quo_name(variable),
               funs(N = length(.),
                    mean = mean(.),
                    sd = sd(.),
                    median = median(.),
                    min = min(.),
                    max = max(.))) %>%
  mutate(range = paste(min, "-", max),
         CV = 100*sd/mean) %>%
  mutate(Variable = quo_name(variable)) %>%
  select(Variable, everything())
}

Which can then be used on individual variables:

continuousSummary(mysub, variable = AGE)
#  Variable  N     mean       sd median min max   range       CV
#1      AGE 74 65.17568 7.822269     66  43  78 43 - 78 12.00182

Or used in pipelines with grouping:

mysub %>%
  group_by(SEX) %>%
  continuousSummary(variable = AGE)

# # A tibble: 2 x 10
# Variable   SEX     N  mean    sd median   min   max range      CV
#   <chr>    <int> <int> <dbl> <dbl>  <int> <dbl> <dbl> <chr>   <dbl>
# 1 AGE          1    59  65.6  7.82     67    43    78 43 - 78  11.9
# 2 AGE          2    15  63.4  7.83     61    55    78 55 - 78  12.3

The aim with both the categoricalSummary and continuousSummary functions is to be able to produce content which can then be gathered and shaped ready for display with the gt package.


#18

Hi Mike, thanks for the contribution!

There's a point that we haven't made clear in our documentation yet. You should normally use enquo() and enquos() when you expect actions (any complex R expression). Here you're really taking a selection, i.e. variable names. Your function is not strict enough about its inputs. One way to fix this is to add proper input checking by using the new as_name() function instead of quo_name(), see the NEWS for rlang 0.3.1 about this. With as_name(), you'll get a more informative error when the user supplies an action instead of a variable.

However, for the purpose of selection, using tidyselect is generally much better. You can use either tidyselect::vars_pull() to get pull()-like selection, including with negative indice, or tidyselect::vars_select() when multiple selections make sense. In this case, I think it makes sense to generalise your function to multiple selections.

Normally to implement selection semantics, we forward the dots and the data names to tidyselect. This returns a character vector of selected names:

continuous_summary <- function(.data, ...) {
  sel <- tidyselect::vars_select(tbl_vars(.data), ...)

  <rest of implementation>
}

Here we don't even have to do this because tidyr::gather() takes selections, so we can pass the dots directly:

continuous_summary <- function(.data, ...) {
  .data %>%
    tidyr::gather("Variable", "Value", ...) %>%
    group_by(Variable, add = TRUE) %>%
    summarise_at(
      "Value",
      list(
        N =      ~ length(.),
        mean =   ~ mean(.),
        sd =     ~ sd(.),
        median = ~ median(.),
        min =    ~ min(.),
        max =    ~ max(.)
      )
    ) %>%
    mutate(
      range = paste(min, "-", max),
      CV = 100 * sd / mean
    )
}

This supports several variables, possibly grouped:

mtcars %>% continuous_summary(starts_with("d"), qsec)
#> # A tibble: 3 x 9
#>   Variable     N   mean      sd median   min    max range          CV
#>   <chr>    <int>  <dbl>   <dbl>  <dbl> <dbl>  <dbl> <chr>       <dbl>
#> 1 disp        32 231.   124.    196.   71.1  472    71.1 - 472   53.7
#> 2 drat        32   3.60   0.535   3.70  2.76   4.93 2.76 - 4.93  14.9
#> 3 qsec        32  17.8    1.79   17.7  14.5   22.9  14.5 - 22.9  10.0

mtcars %>% group_by(am) %>% continuous_summary(starts_with("d"))
#> # A tibble: 4 x 10
#> # Groups:   am [2]
#>      am Variable     N   mean      sd median    min    max range          CV
#>   <dbl> <chr>    <int>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl> <chr>       <dbl>
#> 1     0 disp        19 290.   110.    276.   120.   472    120.1 - 472 37.9 
#> 2     0 drat        19   3.29   0.392   3.15   2.76   3.92 2.76 - 3.92 11.9 
#> 3     1 disp        13 144.    87.2   120.    71.1  351    71.1 - 351  60.8 
#> 4     1 drat        13   4.05   0.364   4.08   3.54   4.93 3.54 - 4.93  8.99

#19

Instead of capturing your inputs by action, we can also use tidyselect to capture them by selection. Please see remarks in my previous comment about action versus selection for more context on the following suggestion.

Since you take a fixed number of inputs, let's use vars_pull() instead of vars_select().

word_dict <- function(data, word, score) {
  vars <- tbl_vars(data)
  score <- tidyselect::vars_pull(vars, !!enquo(score))
  word <- tidyselect::vars_pull(vars, !!enquo(word))

  x <- data[[score]]
  names(x) <- data[[word]]
  x
}

Your function can be used in the same way:

test_data %>% word_dict(letter, number)
#>     a     b     c     d     e     f     g     h     i     j     k     l
#>  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
#>     m     n     o     p     q     r     s     t     u     v     w     x
#> FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
#>     y     z
#> FALSE FALSE

And now supports dplyr::pull() features:

test_data %>% word_dict(-1, -2)
#>  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
#>   "a"   "b"   "c"   "d"   "e"   "f"   "g"   "h"   "i"   "j"   "k"   "l"
#> FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
#>   "m"   "n"   "o"   "p"   "q"   "r"   "s"   "t"   "u"   "v"   "w"   "x"
#> FALSE FALSE
#>   "y"   "z"

#20

I'm smiling because it seems that many cases where we're using tidy evaluation can actually be refactored using other means... So now I'm a bit confused where you would use tidy-evaluation...? :grinning:


#21

Yes, this is very interesting - it seems like the enquo() + !! pattern is commonly used, but for the more complex problems my intuition is to always start with a solution that doesn't need tidy eval.


#22

Hi @lionel,

This works really nicely for continuous outcomes, but not when I try to implement something similar for categorical outcomes, The issue is in the gather function, it seems to convert the values to character instead of preserving factors. So when you use complete it doesn't see the Value column as factor. (This is probably straying from the discussion around tidy evaluation though...)

library(tidyverse)
fac_mtcars <- mtcars %>%
  mutate_at(vars(cyl,carb), as.factor) %>%
  select(cyl, carb) %>%
  as_tibble()

table(fac_mtcars$cyl, fac_mtcars$carb)
#>    
#>     1 2 3 4 6 8
#>   4 5 6 0 0 0 0
#>   6 2 0 0 4 1 0
#>   8 0 4 3 6 0 1

## Doesn't honour the `complete` since Value is of type `character`
## rather than `factor`.
fac_mtcars %>%
  group_by(cyl) %>%
  gather("Variable", "Value", carb) %>%
  group_by(Variable, add = TRUE) %>%
  count(Value) %>%
  complete(Value, fill=list(n=0))
#> # A tibble: 9 x 4
#> # Groups:   cyl, Variable [3]
#>   cyl   Variable Value     n
#>   <fct> <chr>    <chr> <dbl>
#> 1 4     carb     1         5
#> 2 4     carb     2         6
#> 3 6     carb     1         2
#> 4 6     carb     4         4
#> 5 6     carb     6         1
#> 6 8     carb     2         4
#> 7 8     carb     3         3
#> 8 8     carb     4         6
#> 9 8     carb     8         1
```

#23

I take it back! Capturing by action is definetly better in that case. Here's a rather elegant way to implement word_dict() by forwarding inputs to transmute() and making use a tibble::deframe() to transform a two-column data frames to a named vector. cc @jennybryan

word_dict <- function(data, word, score) {
  word <- enquo(word)
  score <- enquo(score)

  data %>%
    transmute(!!word, !!score) %>%
    tibble::deframe()
}

This implementation allows simple selections as before:

test_data %>% word_dict(letter, number)
#>     a     b     c     d     e     f     g     h     i     j     k     l
#>  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
#>     m     n     o     p     q     r     s     t     u     v     w     x
#> FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
#>     y     z
#> FALSE FALSE

However, because it is now taking actions, you can transform the vectors on the fly:

test_data %>% word_dict(toupper(letter), number * 10)
#>  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z
#> 10  0  0  0 10  0  0  0 10  0  0  0  0  0 10  0  0  0  0  0 10  0  0  0  0  0

#24

Here's an example to convert user input to a partial Stan code, allowing to unquote variables and expressions (Disclaimer: I'm not familiar with Stan at all, so I don't know how useful this is in actual :wink:). I think tidy eval is particularly interesting when it's used in between some DSL (or other language) and R code.

library(rlang)

generate_stan_model_code <- function(x, y) {
  quos <- enquos(x = x, y = y, .ignore_empty = "all")
  labels <- paste(names(quos), "~", purrr::map_chr(quos, as_label))
  labels <- paste0("    ", labels, ";", collapse = "\n")
  glue::glue("
parameters {
    real<lower=0,upper=1> x;
    real<lower=0,upper=1> y;
}
model {
{{labels}}
}",
    .open = "{{", .close = "}}"
  )
}

generate_stan_model_code(beta(1, 1))
#> parameters {
#>     real<lower=0,upper=1> x;
#>     real<lower=0,upper=1> y;
#> }
#> model {
#>     x ~ beta(1, 1);
#> }
generate_stan_model_code(beta(1, 1), normal(0, 100))
#> parameters {
#>     real<lower=0,upper=1> x;
#>     real<lower=0,upper=1> y;
#> }
#> model {
#>     x ~ beta(1, 1);
#>     y ~ normal(0, 100);
#> }

mu <- 0.1
generate_stan_model_code(normal(!!mu, 1), normal(!!mu, 100))
#> parameters {
#>     real<lower=0,upper=1> x;
#>     real<lower=0,upper=1> y;
#> }
#> model {
#>     x ~ normal(0.1, 1);
#>     y ~ normal(0.1, 100);
#> }

get_mu <- function() runif(1)
generate_stan_model_code(normal(!!get_mu(), 1), normal(!!get_mu(), 100))
#> parameters {
#>     real<lower=0,upper=1> x;
#>     real<lower=0,upper=1> y;
#> }
#> model {
#>     x ~ normal(0.994106655474752, 1);
#>     y ~ normal(0.661887304158881, 100);
#> }

Created on 2019-01-10 by the reprex package (v0.2.1)


#25

I think base::deparse() would be more appropriate than as_label() here. The latter will simplify complex expressions. You'll have to squash the quosure before deparsing. Though I would just capture with enexprs() instead of enquos(), since there's no evaluation going on.


#26

Thanks, true. When we want a more nested DSL, maybe quosures will be needed, but I don't come up with a good example...


#27

I've been using this to tidy up data that has subheaders embedded in a data variable. The subheaders are matched with regex and put into their own variable.

The function:

untangle2 <- function(df, regex, orig, new) {
  orig <- dplyr::enquo(orig)
  new <- dplyr::ensym(new)
  to_fill <- dplyr::mutate(
    df,
    !!new := dplyr::if_else(grepl(regex, !!orig), !!orig, NA_character_)
  )
  dffilled <- tidyr::fill(to_fill, !!new)
  dplyr::filter(dffilled, !grepl(regex, !!orig))
}

Example usage:

dat <- tibble::tibble(
  site = c("Wet Season", "a", "b", "Dry Season", "a", "b"),
  rain = c(NA, 52, 41, NA, 12, 9)
)

dat %>% untangle2("Season", site, Season)


#28

I often find myself wanting to use tidyselect helpers to specify a series of columns for which I want to calculate a rowwise sum or mean. I use the below to wrangle rowSums() or rowMeans() into accepting tidyselect helpers. I'm sure it's a bit hacky, but it seems to do the trick!

(I also added a .value argument so I can specify the name of the output column.)

library(tidyverse)

tidyselect_row_sums <- function (.data, ..., .value = "row_sum", na.rm = FALSE) {
  
  dots <- exprs(...)
  value <- sym(.value)
  cols <- select(.data, !!!dots)
  out <- mutate(.data, !!value := rowSums(cols, na.rm = na.rm))
  return (out)
}

iris %>%
  tidyselect_row_means(starts_with("Sepal"), .value = "Sepal.Mean")
# A tibble: 150 x 6
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Mean
#          <dbl>       <dbl>        <dbl>       <dbl> <fct>        <dbl>
# 1          5.1         3.5          1.4         0.2 setosa        4.3 
# 2          4.9         3            1.4         0.2 setosa        3.95
# 3          4.7         3.2          1.3         0.2 setosa        3.95
# 4          4.6         3.1          1.5         0.2 setosa        3.85
# 5          5           3.6          1.4         0.2 setosa        4.3 
# ... with 145 more rows

#29

A little off-topic, but if there's work being done on tidyeval learning resources, I'd love a cheat sheet. I often feel like I read through resources and grok the differences between different functions, but then it all falls out of my head later. Being able to just see which basic data types go in and out of each function would be really helpful :slightly_smiling_face:


#30

Have a look at https://www.rstudio.com/resources/cheatsheets/ :wink:


#31

Okay, yeah, that's on me :laughing:


#32

I'm not sure how interesting it is, but I often make some quick functions that are essentially boilerplate code to help with exploratory analysis, especially plotting e.g.:

myhist <- function(.data, field, binwidth = 50) {
    ggplot(.data, aes(!!enquo(field)) +
    geom_histogram(binwidth = binwidth)
}

I can then (re)make histograms (in this example) quickly, with different bins, and easily add other features that I might want to do (e.g. log10 scale x-axis if the data are difficult to see). I've also made similar functions for other types of plot, as well as numerical/statistical summaries of the data, too.


#33

Thanks for sharing! That's very much what I imagine the typical use of tidyeval to be.


#34

3 posts were split to a new topic: provocative question: Will tidyeval kill the tidyverse?


#37

For writing functions that use dplyr/DBI interface:

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
DBI::dbWriteTable(con, "mtcars", mtcars)

filterFun <- function(db, tab, col, val){                   
    col <- rlang::enquo(col)
    
    tbl(db, tab)  %>% 
        filter(!!col > val) %>% 
        collect
}

filterFun(db = con,
          tab = "mtcars",
          col = cyl,
          val = 4)

provocative question: Will tidyeval kill the tidyverse?