Interesting tidy eval use cases

Hi @lionel,

This works really nicely for continuous outcomes, but not when I try to implement something similar for categorical outcomes, The issue is in the gather function, it seems to convert the values to character instead of preserving factors. So when you use complete it doesn't see the Value column as factor. (This is probably straying from the discussion around tidy evaluation though...)

library(tidyverse)
fac_mtcars <- mtcars %>%
  mutate_at(vars(cyl,carb), as.factor) %>%
  select(cyl, carb) %>%
  as_tibble()

table(fac_mtcars$cyl, fac_mtcars$carb)
#>    
#>     1 2 3 4 6 8
#>   4 5 6 0 0 0 0
#>   6 2 0 0 4 1 0
#>   8 0 4 3 6 0 1

## Doesn't honour the `complete` since Value is of type `character`
## rather than `factor`.
fac_mtcars %>%
  group_by(cyl) %>%
  gather("Variable", "Value", carb) %>%
  group_by(Variable, add = TRUE) %>%
  count(Value) %>%
  complete(Value, fill=list(n=0))
#> # A tibble: 9 x 4
#> # Groups:   cyl, Variable [3]
#>   cyl   Variable Value     n
#>   <fct> <chr>    <chr> <dbl>
#> 1 4     carb     1         5
#> 2 4     carb     2         6
#> 3 6     carb     1         2
#> 4 6     carb     4         4
#> 5 6     carb     6         1
#> 6 8     carb     2         4
#> 7 8     carb     3         3
#> 8 8     carb     4         6
#> 9 8     carb     8         1
```

I take it back! Capturing by action is definetly better in that case. Here's a rather elegant way to implement word_dict() by forwarding inputs to transmute() and making use a tibble::deframe() to transform a two-column data frames to a named vector. cc @jennybryan

word_dict <- function(data, word, score) {
  word <- enquo(word)
  score <- enquo(score)

  data %>%
    transmute(!!word, !!score) %>%
    tibble::deframe()
}

This implementation allows simple selections as before:

test_data %>% word_dict(letter, number)
#>     a     b     c     d     e     f     g     h     i     j     k     l
#>  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
#>     m     n     o     p     q     r     s     t     u     v     w     x
#> FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
#>     y     z
#> FALSE FALSE

However, because it is now taking actions, you can transform the vectors on the fly:

test_data %>% word_dict(toupper(letter), number * 10)
#>  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z
#> 10  0  0  0 10  0  0  0 10  0  0  0  0  0 10  0  0  0  0  0 10  0  0  0  0  0
4 Likes

Here's an example to convert user input to a partial Stan code, allowing to unquote variables and expressions (Disclaimer: I'm not familiar with Stan at all, so I don't know how useful this is in actual :wink:). I think tidy eval is particularly interesting when it's used in between some DSL (or other language) and R code.

library(rlang)

generate_stan_model_code <- function(x, y) {
  quos <- enquos(x = x, y = y, .ignore_empty = "all")
  labels <- paste(names(quos), "~", purrr::map_chr(quos, as_label))
  labels <- paste0("    ", labels, ";", collapse = "\n")
  glue::glue("
parameters {
    real<lower=0,upper=1> x;
    real<lower=0,upper=1> y;
}
model {
{{labels}}
}",
    .open = "{{", .close = "}}"
  )
}

generate_stan_model_code(beta(1, 1))
#> parameters {
#>     real<lower=0,upper=1> x;
#>     real<lower=0,upper=1> y;
#> }
#> model {
#>     x ~ beta(1, 1);
#> }
generate_stan_model_code(beta(1, 1), normal(0, 100))
#> parameters {
#>     real<lower=0,upper=1> x;
#>     real<lower=0,upper=1> y;
#> }
#> model {
#>     x ~ beta(1, 1);
#>     y ~ normal(0, 100);
#> }

mu <- 0.1
generate_stan_model_code(normal(!!mu, 1), normal(!!mu, 100))
#> parameters {
#>     real<lower=0,upper=1> x;
#>     real<lower=0,upper=1> y;
#> }
#> model {
#>     x ~ normal(0.1, 1);
#>     y ~ normal(0.1, 100);
#> }

get_mu <- function() runif(1)
generate_stan_model_code(normal(!!get_mu(), 1), normal(!!get_mu(), 100))
#> parameters {
#>     real<lower=0,upper=1> x;
#>     real<lower=0,upper=1> y;
#> }
#> model {
#>     x ~ normal(0.994106655474752, 1);
#>     y ~ normal(0.661887304158881, 100);
#> }

Created on 2019-01-10 by the reprex package (v0.2.1)

3 Likes

I think base::deparse() would be more appropriate than as_label() here. The latter will simplify complex expressions. You'll have to squash the quosure before deparsing. Though I would just capture with enexprs() instead of enquos(), since there's no evaluation going on.

1 Like

Thanks, true. When we want a more nested DSL, maybe quosures will be needed, but I don't come up with a good example...

I've been using this to tidy up data that has subheaders embedded in a data variable. The subheaders are matched with regex and put into their own variable.

The function:

untangle2 <- function(df, regex, orig, new) {
  orig <- dplyr::enquo(orig)
  new <- dplyr::ensym(new)
  to_fill <- dplyr::mutate(
    df,
    !!new := dplyr::if_else(grepl(regex, !!orig), !!orig, NA_character_)
  )
  dffilled <- tidyr::fill(to_fill, !!new)
  dplyr::filter(dffilled, !grepl(regex, !!orig))
}

Example usage:

dat <- tibble::tibble(
  site = c("Wet Season", "a", "b", "Dry Season", "a", "b"),
  rain = c(NA, 52, 41, NA, 12, 9)
)

dat %>% untangle2("Season", site, Season)

I often find myself wanting to use tidyselect helpers to specify a series of columns for which I want to calculate a rowwise sum or mean. I use the below to wrangle rowSums() or rowMeans() into accepting tidyselect helpers. I'm sure it's a bit hacky, but it seems to do the trick!

(I also added a .value argument so I can specify the name of the output column.)

library(tidyverse)

tidyselect_row_sums <- function (.data, ..., .value = "row_sum", na.rm = FALSE) {
  
  dots <- exprs(...)
  value <- sym(.value)
  cols <- select(.data, !!!dots)
  out <- mutate(.data, !!value := rowSums(cols, na.rm = na.rm))
  return (out)
}

iris %>%
  tidyselect_row_means(starts_with("Sepal"), .value = "Sepal.Mean")
# A tibble: 150 x 6
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Mean
#          <dbl>       <dbl>        <dbl>       <dbl> <fct>        <dbl>
# 1          5.1         3.5          1.4         0.2 setosa        4.3 
# 2          4.9         3            1.4         0.2 setosa        3.95
# 3          4.7         3.2          1.3         0.2 setosa        3.95
# 4          4.6         3.1          1.5         0.2 setosa        3.85
# 5          5           3.6          1.4         0.2 setosa        4.3 
# ... with 145 more rows

A little off-topic, but if there's work being done on tidyeval learning resources, I'd love a cheat sheet. I often feel like I read through resources and grok the differences between different functions, but then it all falls out of my head later. Being able to just see which basic data types go in and out of each function would be really helpful :slightly_smiling_face:

2 Likes

Have a look at https://www.rstudio.com/resources/cheatsheets/ :wink:

2 Likes

Okay, yeah, that's on me :laughing:

I'm not sure how interesting it is, but I often make some quick functions that are essentially boilerplate code to help with exploratory analysis, especially plotting e.g.:

myhist <- function(.data, field, binwidth = 50) {
    ggplot(.data, aes(!!enquo(field)) +
    geom_histogram(binwidth = binwidth)
}

I can then (re)make histograms (in this example) quickly, with different bins, and easily add other features that I might want to do (e.g. log10 scale x-axis if the data are difficult to see). I've also made similar functions for other types of plot, as well as numerical/statistical summaries of the data, too.

5 Likes

Thanks for sharing! That's very much what I imagine the typical use of tidyeval to be.

3 posts were split to a new topic: provocative question: Will tidyeval kill the tidyverse?

For writing functions that use dplyr/DBI interface:

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
DBI::dbWriteTable(con, "mtcars", mtcars)

filterFun <- function(db, tab, col, val){                   
    col <- rlang::enquo(col)
    
    tbl(db, tab)  %>% 
        filter(!!col > val) %>% 
        collect
}

filterFun(db = con,
          tab = "mtcars",
          col = cyl,
          val = 4)
1 Like

I was working a Bayesian nonlinear model. I wanted to quickly plot samples from the prior predictive distribution, so I could see what kinds of curves were plausible (before the model saw the data).

I wrote a function that would take sampling statements like asymptote = rbeta(n, 10, 20) where n is an another argument to the function. The nonstandard evaluation in this case capturing the sampling statement and plugging the value of n.

plot_priors <- function(x, n, asymptote = rbeta(n, 10, 20),
                        midpoint = rnorm(n, 0, 1),
                        scale = abs(rnorm(n, .1, .025))) {
  # ...
}

# example use
plot_priors(
  x = 0:54,
  n = 40,
  midpoint = rnorm(n, 2, 2),
  scale = rnorm(n, 10, 5))

I posted a demo of the function and plots on Rpubs.

2 Likes

My most convoluted attempt is:

add_first_by_row <- function(df, cols, new_name, predicate) {
	   cond <- as_mapper(predicate)
	   condq <- enexpr(cond)
	   new_name <- enexpr(new_name) %>% as.character 
	   arg_names <- c(cols, "...") 
	   col_syms <- cols %>% syms
	   
	   args <- rep(0, times = length(cols)+1) %>% set_names(arg_names)
	 	 new_fun <- new_function(args, 
	 	 	           expr(c(!!!col_syms) %>% {.[(!!condq)(.)]} %>% .[1] )
	 	             )
	 	 
	 #	 return(new_fun)
	 	 df %>% mutate(!!new_name := pmap_dbl(df,new_fun))
}

which allows the user to find the a first value in the a row, from a selection of columns, which meets a predicate function (originally my colleage wanted a way to find the first non-NA value, but I figured you could generalise the predicate).

so, you could do

df = tibble(a = c("b", "d", "l", "m"), x = c(NA,NA, 1, 3), y = c(2,NA, 2, 3), z = c(1,2, 0, NA))
vars <- c("x", "y", "z")
df %>% add_first_by_row(vars, new_column, ~ . > 1)
# A tibble: 4 x 5
  a         x     y     z new_column
  <chr> <dbl> <dbl> <dbl>      <dbl>
1 b        NA     2     1         NA
2 d        NA    NA     2         NA
3 l         1     2     0          2
4 m         3     3    NA          3

I use tidy eval a lot, but I often wonder if it's really needed in my use cases or not. But then if you feel comfortable with it, why prefer "standard" eval to "non-standard"?

Here are a couple we used during our Big Data class:

my_mean <- function(.data, x){
  x <- enquo(x)
  summarise(
    .data, 
    !! quo_name(x) := mean(!! x, na.rm = TRUE)
  )
}
de_select <- function(.data, ...){
  vars <- enquos(...)
  vars <- map(vars, ~ expr(- !! .x))
  select(
    .data,
    !!! vars
  )
}

I think your function feels so complex because you're attacking the problem at the slightly wrong level. I think it's easier if you think about writing a vectorised function that you can use in concert with mutate():

library(tidyverse)

df <- tibble(
  a = c("b", "d", "l", "m"), 
  x = c(NA, NA, 1, 3), 
  y = c(2, NA, 2, 3), 
  z = c(1, 2, 0, NA)
)

first_match <- function(.f, ...) {
  tibble(...) %>% 
    transpose() %>% 
    map_dbl(~ purrr::detect(.x, .f) %||% NA)
}

df %>% mutate(new_col = first_match(~ . > 1, x, y, z))  
#> # A tibble: 4 x 5
#>   a         x     y     z new_col
#>   <chr> <dbl> <dbl> <dbl>   <dbl>
#> 1 b        NA     2     1       2
#> 2 d        NA    NA     2       2
#> 3 l         1     2     0       2
#> 4 m         3     3    NA       3

We're planning on thinking about ways for first_match() to have selection semantics so you could simplify further to first_match(~ . > 1, x:y)

1 Like

Ah, transpose, yes I hadn't thought of that. I knew there was a better name out there too, ha! Thanks for the feedback.

One more...

Sometimes you might have rows in your data which are duplicated in some columns, but not others, and you might have a good reason for wanting to merge the un-duplicated rows. Then you could use this function:

concat_groups <- function(data, group_var, cat_var) {

	group_varq <- enquo(group_var)
	cat_varq <- enquo(cat_var)
	cat_var <- as.character(cat_varq) %>% .[2] %>% paste0("s")
	
	data %>% group_by(!! group_varq) %>%
		       summarise(!! cat_var := !! cat_varq %>% unique %>% str_c(collapse = "/"))

}

So, a practical example might be:

site_info <- tibble(site = c("A", "B", "B", "C"),
	                  category = c("school", "office", "shop", "field"))
site_info %>% concat_groups(site, category)
# A tibble: 3 x 2
  site  categorys  
  <chr> <chr>      
1 A     school     
2 B     office/shop
3 C     field

function name probably needs work..