Interesting tidy eval use cases

MikeKSmith · January 9, 2019, 3:04pm

This works really nicely for continuous outcomes, but not when I try to implement something similar for categorical outcomes, The issue is in the gather function, it seems to convert the values to character instead of preserving factors. So when you use complete it doesn't see the Value column as factor. (This is probably straying from the discussion around tidy evaluation though...)

library(tidyverse)
fac_mtcars <- mtcars %>%
  mutate_at(vars(cyl,carb), as.factor) %>%
  select(cyl, carb) %>%
  as_tibble()

table(fac_mtcars$cyl, fac_mtcars$carb)
#>    
#>     1 2 3 4 6 8
#>   4 5 6 0 0 0 0
#>   6 2 0 0 4 1 0
#>   8 0 4 3 6 0 1

## Doesn't honour the `complete` since Value is of type `character`
## rather than `factor`.
fac_mtcars %>%
  group_by(cyl) %>%
  gather("Variable", "Value", carb) %>%
  group_by(Variable, add = TRUE) %>%
  count(Value) %>%
  complete(Value, fill=list(n=0))
#> # A tibble: 9 x 4
#> # Groups:   cyl, Variable [3]
#>   cyl   Variable Value     n
#>   <fct> <chr>    <chr> <dbl>
#> 1 4     carb     1         5
#> 2 4     carb     2         6
#> 3 6     carb     1         2
#> 4 6     carb     4         4
#> 5 6     carb     6         1
#> 6 8     carb     2         4
#> 7 8     carb     3         3
#> 8 8     carb     4         6
#> 9 8     carb     8         1
```

lionel · January 9, 2019, 4:55pm

I take it back! Capturing by action is definetly better in that case. Here's a rather elegant way to implement word_dict() by forwarding inputs to transmute() and making use a tibble::deframe() to transform a two-column data frames to a named vector. cc @jennybryan

word_dict <- function(data, word, score) {
  word <- enquo(word)
  score <- enquo(score)

  data %>%
    transmute(!!word, !!score) %>%
    tibble::deframe()
}

This implementation allows simple selections as before:

test_data %>% word_dict(letter, number)
#>     a     b     c     d     e     f     g     h     i     j     k     l
#>  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
#>     m     n     o     p     q     r     s     t     u     v     w     x
#> FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
#>     y     z
#> FALSE FALSE

However, because it is now taking actions, you can transform the vectors on the fly:

test_data %>% word_dict(toupper(letter), number * 10)
#>  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z
#> 10  0  0  0 10  0  0  0 10  0  0  0  0  0 10  0  0  0  0  0 10  0  0  0  0  0

yutannihilation · January 10, 2019, 5:19am

Here's an example to convert user input to a partial Stan code, allowing to unquote variables and expressions (Disclaimer: I'm not familiar with Stan at all, so I don't know how useful this is in actual ). I think tidy eval is particularly interesting when it's used in between some DSL (or other language) and R code.

library(rlang)

generate_stan_model_code <- function(x, y) {
  quos <- enquos(x = x, y = y, .ignore_empty = "all")
  labels <- paste(names(quos), "~", purrr::map_chr(quos, as_label))
  labels <- paste0("    ", labels, ";", collapse = "\n")
  glue::glue("
parameters {
    real<lower=0,upper=1> x;
    real<lower=0,upper=1> y;
}
model {
{{labels}}
}",
    .open = "{{", .close = "}}"
  )
}

generate_stan_model_code(beta(1, 1))
#> parameters {
#>     real<lower=0,upper=1> x;
#>     real<lower=0,upper=1> y;
#> }
#> model {
#>     x ~ beta(1, 1);
#> }
generate_stan_model_code(beta(1, 1), normal(0, 100))
#> parameters {
#>     real<lower=0,upper=1> x;
#>     real<lower=0,upper=1> y;
#> }
#> model {
#>     x ~ beta(1, 1);
#>     y ~ normal(0, 100);
#> }

mu <- 0.1
generate_stan_model_code(normal(!!mu, 1), normal(!!mu, 100))
#> parameters {
#>     real<lower=0,upper=1> x;
#>     real<lower=0,upper=1> y;
#> }
#> model {
#>     x ~ normal(0.1, 1);
#>     y ~ normal(0.1, 100);
#> }

get_mu <- function() runif(1)
generate_stan_model_code(normal(!!get_mu(), 1), normal(!!get_mu(), 100))
#> parameters {
#>     real<lower=0,upper=1> x;
#>     real<lower=0,upper=1> y;
#> }
#> model {
#>     x ~ normal(0.994106655474752, 1);
#>     y ~ normal(0.661887304158881, 100);
#> }

^{Created on 2019-01-10 by the reprex package (v0.2.1)}

lionel · January 10, 2019, 10:43am

I think base::deparse() would be more appropriate than as_label() here. The latter will simplify complex expressions. You'll have to squash the quosure before deparsing. Though I would just capture with enexprs() instead of enquos(), since there's no evaluation going on.

yutannihilation · January 10, 2019, 12:55pm

Thanks, true. When we want a more nested DSL, maybe quosures will be needed, but I don't come up with a good example...

Luis · January 10, 2019, 9:48pm

I've been using this to tidy up data that has subheaders embedded in a data variable. The subheaders are matched with regex and put into their own variable.

The function:

untangle2 <- function(df, regex, orig, new) {
  orig <- dplyr::enquo(orig)
  new <- dplyr::ensym(new)
  to_fill <- dplyr::mutate(
    df,
    !!new := dplyr::if_else(grepl(regex, !!orig), !!orig, NA_character_)
  )
  dffilled <- tidyr::fill(to_fill, !!new)
  dplyr::filter(dffilled, !grepl(regex, !!orig))
}

Example usage:

dat <- tibble::tibble(
  site = c("Wet Season", "a", "b", "Dry Season", "a", "b"),
  rain = c(NA, 52, 41, NA, 12, 9)
)

dat %>% untangle2("Season", site, Season)

mthieu · January 11, 2019, 6:23pm

I often find myself wanting to use tidyselect helpers to specify a series of columns for which I want to calculate a rowwise sum or mean. I use the below to wrangle rowSums() or rowMeans() into accepting tidyselect helpers. I'm sure it's a bit hacky, but it seems to do the trick!

(I also added a .value argument so I can specify the name of the output column.)

library(tidyverse)

tidyselect_row_sums <- function (.data, ..., .value = "row_sum", na.rm = FALSE) {
  
  dots <- exprs(...)
  value <- sym(.value)
  cols <- select(.data, !!!dots)
  out <- mutate(.data, !!value := rowSums(cols, na.rm = na.rm))
  return (out)
}

iris %>%
  tidyselect_row_means(starts_with("Sepal"), .value = "Sepal.Mean")
# A tibble: 150 x 6
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Mean
#          <dbl>       <dbl>        <dbl>       <dbl> <fct>        <dbl>
# 1          5.1         3.5          1.4         0.2 setosa        4.3 
# 2          4.9         3            1.4         0.2 setosa        3.95
# 3          4.7         3.2          1.3         0.2 setosa        3.95
# 4          4.6         3.1          1.5         0.2 setosa        3.85
# 5          5           3.6          1.4         0.2 setosa        4.3 
# ... with 145 more rows

rensa · January 12, 2019, 12:16am

A little off-topic, but if there's work being done on tidyeval learning resources, I'd love a cheat sheet. I often feel like I read through resources and grok the differences between different functions, but then it all falls out of my head later. Being able to just see which basic data types go in and out of each function would be really helpful

hadley · January 12, 2019, 2:11pm

Have a look at https://www.rstudio.com/resources/cheatsheets/

rensa · January 12, 2019, 11:44pm

Okay, yeah, that's on me

jim89 · January 14, 2019, 1:15pm

I'm not sure how interesting it is, but I often make some quick functions that are essentially boilerplate code to help with exploratory analysis, especially plotting e.g.:

myhist <- function(.data, field, binwidth = 50) {
    ggplot(.data, aes(!!enquo(field)) +
    geom_histogram(binwidth = binwidth)
}

I can then (re)make histograms (in this example) quickly, with different bins, and easily add other features that I might want to do (e.g. log10 scale x-axis if the data are difficult to see). I've also made similar functions for other types of plot, as well as numerical/statistical summaries of the data, too.

hadley · January 14, 2019, 4:04pm

Thanks for sharing! That's very much what I imagine the typical use of tidyeval to be.

EconomiCurtis · January 15, 2019, 3:18pm

3 posts were split to a new topic: provocative question: Will tidyeval kill the tidyverse?

chasec · January 15, 2019, 6:24am

For writing functions that use dplyr/DBI interface:

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
DBI::dbWriteTable(con, "mtcars", mtcars)

filterFun <- function(db, tab, col, val){                   
    col <- rlang::enquo(col)
    
    tbl(db, tab)  %>% 
        filter(!!col > val) %>% 
        collect
}

filterFun(db = con,
          tab = "mtcars",
          col = cyl,
          val = 4)

tjmahr · January 18, 2019, 4:25pm

I was working a Bayesian nonlinear model. I wanted to quickly plot samples from the prior predictive distribution, so I could see what kinds of curves were plausible (before the model saw the data).

I wrote a function that would take sampling statements like asymptote = rbeta(n, 10, 20) where n is an another argument to the function. The nonstandard evaluation in this case capturing the sampling statement and plugging the value of n.

plot_priors <- function(x, n, asymptote = rbeta(n, 10, 20),
                        midpoint = rnorm(n, 0, 1),
                        scale = abs(rnorm(n, .1, .025))) {
  # ...
}

# example use
plot_priors(
  x = 0:54,
  n = 40,
  midpoint = rnorm(n, 2, 2),
  scale = rnorm(n, 10, 5))

I posted a demo of the function and plots on Rpubs.

tom_greenwood · January 22, 2019, 2:45pm

My most convoluted attempt is:

add_first_by_row <- function(df, cols, new_name, predicate) {
	   cond <- as_mapper(predicate)
	   condq <- enexpr(cond)
	   new_name <- enexpr(new_name) %>% as.character 
	   arg_names <- c(cols, "...") 
	   col_syms <- cols %>% syms
	   
	   args <- rep(0, times = length(cols)+1) %>% set_names(arg_names)
	 	 new_fun <- new_function(args, 
	 	 	           expr(c(!!!col_syms) %>% {.[(!!condq)(.)]} %>% .[1] )
	 	             )
	 	 
	 #	 return(new_fun)
	 	 df %>% mutate(!!new_name := pmap_dbl(df,new_fun))
}

which allows the user to find the a first value in the a row, from a selection of columns, which meets a predicate function (originally my colleage wanted a way to find the first non-NA value, but I figured you could generalise the predicate).

so, you could do

df = tibble(a = c("b", "d", "l", "m"), x = c(NA,NA, 1, 3), y = c(2,NA, 2, 3), z = c(1,2, 0, NA))
vars <- c("x", "y", "z")
df %>% add_first_by_row(vars, new_column, ~ . > 1)
# A tibble: 4 x 5
  a         x     y     z new_column
  <chr> <dbl> <dbl> <dbl>      <dbl>
1 b        NA     2     1         NA
2 d        NA    NA     2         NA
3 l         1     2     0          2
4 m         3     3    NA          3

I use tidy eval a lot, but I often wonder if it's really needed in my use cases or not. But then if you feel comfortable with it, why prefer "standard" eval to "non-standard"?

edgararuiz · January 22, 2019, 2:55pm

Here are a couple we used during our Big Data class:

my_mean <- function(.data, x){
  x <- enquo(x)
  summarise(
    .data, 
    !! quo_name(x) := mean(!! x, na.rm = TRUE)
  )
}

de_select <- function(.data, ...){
  vars <- enquos(...)
  vars <- map(vars, ~ expr(- !! .x))
  select(
    .data,
    !!! vars
  )
}

hadley · January 22, 2019, 3:17pm

I think your function feels so complex because you're attacking the problem at the slightly wrong level. I think it's easier if you think about writing a vectorised function that you can use in concert with mutate():

library(tidyverse)

df <- tibble(
  a = c("b", "d", "l", "m"), 
  x = c(NA, NA, 1, 3), 
  y = c(2, NA, 2, 3), 
  z = c(1, 2, 0, NA)
)

first_match <- function(.f, ...) {
  tibble(...) %>% 
    transpose() %>% 
    map_dbl(~ purrr::detect(.x, .f) %||% NA)
}

df %>% mutate(new_col = first_match(~ . > 1, x, y, z))  
#> # A tibble: 4 x 5
#>   a         x     y     z new_col
#>   <chr> <dbl> <dbl> <dbl>   <dbl>
#> 1 b        NA     2     1       2
#> 2 d        NA    NA     2       2
#> 3 l         1     2     0       2
#> 4 m         3     3    NA       3

We're planning on thinking about ways for first_match() to have selection semantics so you could simplify further to first_match(~ . > 1, x:y)

tom_greenwood · January 22, 2019, 3:44pm

Ah, transpose, yes I hadn't thought of that. I knew there was a better name out there too, ha! Thanks for the feedback.

tom_greenwood · January 22, 2019, 4:43pm

One more...

Sometimes you might have rows in your data which are duplicated in some columns, but not others, and you might have a good reason for wanting to merge the un-duplicated rows. Then you could use this function:

concat_groups <- function(data, group_var, cat_var) {

	group_varq <- enquo(group_var)
	cat_varq <- enquo(cat_var)
	cat_var <- as.character(cat_varq) %>% .[2] %>% paste0("s")
	
	data %>% group_by(!! group_varq) %>%
		       summarise(!! cat_var := !! cat_varq %>% unique %>% str_c(collapse = "/"))

}

So, a practical example might be:

site_info <- tibble(site = c("A", "B", "B", "C"),
	                  category = c("school", "office", "shop", "field"))
site_info %>% concat_groups(site, category)
# A tibble: 3 x 2
  site  categorys  
  <chr> <chr>      
1 A     school     
2 B     office/shop
3 C     field

function name probably needs work..