tidyverse `tbl` object argument style guide

I'm currently writing some content for new R users and want to address the general paradigm that first argument of tidyverse functions is a tbl which supports the use of the pipe.

One thing I've noticed is that there seems to be quite a bit of variation in the naming convention of this first argument. I am wondering if there is some philosophy underlying this.

For example within {dplyr} select(), filter(), mutate() use the argument name .data.
Whereas many of the suffixed versions of the verbs use .tbl—i.e. select_*(), arrange_*(), summarise_*(). Additionally, tally() and count() take the argument name x.

Looking at {tidyr} pivot_wider() and pivot_longer() use the argument name data, hoist() and the unnests use data.

Are these argument names just indicative of when the functions were written and perhaps possibly the author and I'm looking too hard for consistency and pattern?

I forget where I heard or read Hadley say "data first" but that has been a helpful way to explain the advantage of pipes to students. Put another way, writing a line of code is often a process of slowly figuring out what one wants to do as one is doing it. Am I going to look at the head(), str(), or glimpse()? Maybe I want a plot or regression? Even without knowing the sequence of commands to follow, I can confidently begin writing my 'sentence' of code starting with the data. So, my bias would be for the first argument to be simply data. Though your question is about tidy style, this would also be consistent with two foundational base functions plot() and lm() which use data as the data.frame argument.

My understanding of these conventions:

  • .data vs. data: The in-progress tidyverse design guide suggests if a function takes ..., other arguments should be prefixed with .. This explains why the dplyr verbs like filter(), mutate() etc. have .data, but the tidyr functions pivot_wider() and pivot_longer() have data.

  • .data vs. .tbl: It seems like tbl is accidental, e.g. one of the reasons given for retiring sample_n() is:

    The name of the first argument, tbl, is inconsistent with other single table verbs which use .data.

    The mutate_*(), summarise_*() functions are being retired in favor of across(), so that inconsistency should disappear.

    Not sure about count() and tally() seems like these should be .data for consistency...but changing now is probably too disruptive.

I've heard that the tidyverse docs will also start to use "data frame" to describe the type of object that can be passed to this argument, rather than "tbl" or "tibble".

2 Likes

I had never thought about this problem with dots before and now I'm obsessed with it :fearful:

I can't think of a case where it would happen, so maybe it's no big deal, but it feels like it'd still be pretty easy to have argument collision. But, then again, I don't think I've ever had that happen, so maybe not?