Good explanation of how to use scoped verbs?

dplyr
teaching

#1

Is there some good documentation laying around somewhere about how to use dplyr’s scoped verbs, the ones like summarize_at() and mutate_each(), a vignette-sort of thing on these?

I get the general idea of how to use them, but could stand a solid walk through from somebody who actually knows what they’re talking about instead of me bumbling around the documentation!


#2

Have you seen the what’s available via RDocumentation?


#3

Yeah, but that’s largely just the documentation–it’s fine for technical documentation, but doesn’t give me a good sense of how to use vars () and funs(), ie. when I should and shouldn’t, nor how to reference the data within funs(), etc.

I’m really hoping for something more like a vignette that explains when/how to use each and why the syntax is the way it is…


#4

Ah, I see what you’re saying. There is a short bit of information in the compatibility vignette - Deprecation of mutate_each() and summarise_each(). Probably not exactly what you’re looking for, but could be helpful.


#5

That is helpful. Not comprehensive, but starts to give me a handle on it!


#6

It would be really great if you could give so examples of what you’re having problems with because to me the examples seem fine (but I’m obviously so steeped it in that I can’t see the problem)


#7

Oh but that reminds me that I did write a page giving more details for the class I teach at Stanford: https://dcl-2017-04.github.io/curriculum/manip-scoped.html


#8

Oh, wow–that page was really useful, pretty much exactly what I was looking for! In terms of things I didn’t understand till I read through it, I’d say I was struggling with:

  1. when (and why) to use funs() or not
  2. when (and why) to use vars()
  3. how exactly all the arguments should work (ie. what is a predicate vs the actual function)
  4. how to write a predicate function for one of the_at functions that’s more complex than a single function (seems like every time I reach for these functions it’s because I’ve got some crazy, overly complex idea).

If you were going to generalize that doc for broader use, the only feedback I’d give as I read through it was:

  1. Explicitly mention that you can use . within funs() to reference the column in the function
  2. Further explain ways to control how new columns will be named (particularly when using the _at functions)
  3. Is there a way to write predicate functions for the _if() functions using lambda functions? (Ah! a bit of experimentation reveals that you can use funs() and . there too!!)

Also, filter_all() with all_vars() or any_vars() seems awesome!


#9

To echo @crazybilly I have trouble wrapping my head around when I should use those calls or not.


#10

You only need to use them when you want summarise multiple variables or with multiple functions.


#11

hey @crazybilly,

re: #1 & #2 when and why to use funs()/ vars() - if it is of any help here is a short example of how some of these things save me lots of time & typing.

In my daily workflow i get data coming in that has many columns, say 50 - and a subset of them (like 10) are dates and datetimes unfortunately coded as characters. Think of them as start date, end date, departure date, return date etc. I want to convert them to POSIX datetimes and maybe further extract days of week, days of month and similar features.

So first approach would be to do this separately for each char-date column

# the format is something like "2017-09-16 15:30:00")
my_format <- "%Y-%m-%d %H:%M:%S"

my_data %>%
  mutate(
    col1_date = as.POSIXct(col1_date, format = my_format),
    col2_date = as.POSIXct(col2_date, format = my_format),
    ...
    col10_date = as.POSIXct(col10_date, format = my_format)
  )

In this case it comes really handy that one can simply do

my_data %>%
  mutate_at(
    vars(contains("date"),
    funs(posix = as.POSIXct),
    format = my_format
  )

and be done with all of them in one call.

Some assumptions that make this easier and possible are that all the char-date columns have “date” in their name which make the vars() call simple. This might not be always the case in general but it’s easy to rename such columns by selecting them by hand and appending “date” to their name for example.

Note also the convenience that by supplying the name _posix in the funs() call will result in the new column names having “_posix” appended to their original name automatically.

Furthermore, to get each of the new posix dates columns, day of week for example i could just supply vars(contains("posix")) and funs( wday = lubridate::wday) in another call and get all of their days of week in one go.

Hope this small example helps a bit to show how practical these tools are.

cheers,
david


#12

Are you sure you mean as.POSIXlt()? That shouldn’t work inside mutate, and if it does it’s a bug in dplyr.


#13

ooops sorry. you are right POSIXlt is the list one and doesn’t work with tibbles. I corrected the examples to POSIXct and i removed the unnecessary underscore when naming the functions in funs().

Thanks for all the great packages and teaching Hadley - this community website is a great idea, I’ve learned so many new things already.


#14

My question was, I guess, a bit more in the weeds: why do you have to use vars() instead of just using select()-style calls?

Oh! I just looked at the source code for vars() which is just:

function (...) 
{
  quos(...)
}

So basically, you’re just one step out in a NSE sort of thing–you’re using vars() so mutate_at() knows where to look for the column names you pass it.

Looks like funs() operates on the same principal: use quos() to wrap the function in a quosure so it’s clear where it should be applied (ie. to the original data frame).


#15

Keep in mind that select uses the ... argument to allow multiple inputs. Since the mutate_at family uses ... for additional arguments to the function(s), multiple inputs to .cols have to be wrapped somehow. Hence, vars. Keep in mind that vars can take any of the same style of arguments that select can, such as starts_with.


#16

The scoped helpers have three inputs, each of which can be of arbitrary length: variables, functions, and extra arguments. We need someway to disambiguate between so chose to make vars() and funs() explicit.

We could’ve also used an args() helper to put everything on an equal footer, but chose not to since the scoped verbs are similar in spirit to apply/map functions which use ... for extra args.


#17

Thanks @hadley! That post explained it very well. Now I’m sure these functions will save me a lot of typing =)

-Alex