Teaching dplyr functions which have base R equivalents

I do a fair bit of R instruction and lean heavily on the tidyverse. In extolling its virtues, I find myself pointing out advantages that are also present in base R functions. A good example would be mutate() and transform(). Both allow the user to create new columns using bare variable names (i.e. not df$Something or df[['Something']]). The function signatures and return values are basically identical. I never use transform() but that's largely a product of habit.

I know it's impossible to cover everything, but I hate the idea that a student will encounter transform() and wonder whether they should favor one function or the other.

2 Likes

Hi @PirateGrunt,

My thoughts are that as long as the student is using piped commands to perform their analysis, they wouldn't need to worry about alternate base R functions because they more likely won't work in such set of piped functions. I would suggest that after introducing mutate() to students in his manner: mutate(df, something = 1), get them used to using df %>% mutate(something = 1) going forward. This way, using piped commands, such as df %>% mutate(something =1) %>% select(something) becomes second nature to them, thus reducing the need for them to wonder if the alternate base R function needs to be used, all because every transformation is possible with tidyverse commands. Again, these are just my thoughts.

1 Like

Some functions like transform() do work in a piped operation. For example, the two operations below will produce (almost) the same output. (The first will return a tibble, whereas the second won't.)

dfResult <- dfInput %>%
    mutate(NewCol = OldCol + 1)

dfResult2 <- %>%
    transform(NewCol = OldCol + 1)

So what response do we give the student who asks "When should I use mutate() and when should I use transform(). I'm leaning towards "They're both fine, but I use mutate(). mutate() will always work with piped operations, but I don't have the same guarantee with transform()". Is the argument any stronger than that?

I think it's a confusing but true fact of learning to code that there are often multiple ways of doing the same (or very similar) things, the nuances of which always sound wishy-washy until you run into them yourself! So much of my code from my earliest projects reflect a mashup of what was used in the examples I was looking at, and whatever happened to stick with me conceptually first (the reasons for which are still unknown to me— I mean, I'm sure I could post-hoc confabulate, but…).

There are absolutely advantages to using and sticking with a paradigm, but I don't think the way to do it is to pretend nothing else exists. That said, showing five different ways to do the same thing (especially when the words themselves are similar) is usually confusing for someone who's still just trying to wrap their heads around what's being done.

And then there's the Narcissism of Small Differences (Lacanian edition, complete with “choice of neurosis”), and the instinct to ferociously defend our choices because of a very basic cognitive dissonance.

No solid answers, just my thoughts. I'm not a teacher, so I don't have the experience to back up what does and does not land with others consistently.

5 Likes

I definitely agree that it's good to stick to one paradigm at the beginning, but also mention other possibilities and make people a bit aware.

I usually start with the big picture in a mix of slides and command line, while letting them install the tidyverse in the background.

Big picture:

  • R has many packages (show some package universe graphic)
  • Some of them are always available and the rest of R is built more or less on top of them. These packages are called base packages. Unfortunately they are a bit quirky. (Show some syntax like iris[1:3,] and some quirks like sample(4:4, 4, T)). One should be aware of these quirks. They mostly exist for historical reasons and backward compatibility.
  • There is a collection of packages that share specific idioms and enable you to most of the stuff that the base packages do. The cool thing about the these packages is that they hide these quirks and you normally don't get strange surprises. They are also very consistent and names are very intuitive. These packages are called the tidyverse.

Go along with RStudio and interactive usage of dplyr and pipe, ggplot2, lm() and broom package...showing a picture of what they just did. The data science workflow.

All together: As a beginner you need to think less about code and see quicker results, when you start by using the tidyverse. Also many advanced users stick with them. Of course you will meet base R and other packages along the way....

I think one usually learns a programming language two times. Once to set things up and getting things done. And a second time to learn best practices.

I usually don't give courses, but I have introduced R many times and made good experiences with this.

4 Likes

For the particular case of transform and subset, I don't think there's any particular point in teaching them, as as the docs warn:

Warning

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting arithmetic functions, [as] in particular the non-standard evaluation of argument transform can have unanticipated consequences.

...which is why baseRs avoid them in favor of [ and [[ (and [<- and [[<-). [ and [[ are still very much worth learning for a tidyverse user, even if they could be avoided to an extent.

While mutate and filter on the surface behave like transform and subset, the former have a robust tidy evaluation NSE framework built in that's consistent with [most of, and eventually all] the rest of the tidyverse. While that won't matter at first to beginners, eventually they'll hit a hard case where it does.

3 Likes

It's hard to imagine getting very far with web data without them! :+1:

I have a small team of analysts who are new to R, and run short weekly workshops for them showing the basics of R. In doing so I show that there are many ways to do things, using base R versions in some cases, but with the main focus being the use of dplyr and associated tidyverse tools for taking base-level incident data and aggregating / summarising at a number of different geographical levels.

I agree entirely with Tazinho's point. For my team, R is a tool which is being used because it provides ways to do things that other tools we use cannot do. Over the past four years of using R, I have got used to the feeling of being lost at times with base R - the differences in the way even basic functions like the apply series work, opaque help files, learning by trying out and failing miserably on many occasions. To keep it simple and show how things work now I use dplyr, as all the base aggregation tasks my team need can in the main be built up using the consistent functions provided by the package.

In some ways it's more of struggle for me to explain to my team why they must use a series of function calls such as select() %>% mutate() %>% group_by() %>% summarise() to do something which in SQL code they do in a single statement (e.g. SELECT a, left(b, 1) as b1, c, sum(d) as total FROM e GROUP BY ... etc), but that may reflect more on the vagaries of SQL than it does on dplyr.

For me, I use mutate in preference to transform, as it's the method provided by dplyr and is built specifically to work consistently with all the other operations dplyr provides. I use non-dplyr functions when there isn't one available in the dplyr or tidyverse set to do so, and find that in general piping works with them too. For consistency, and to make it easier for others to carry on my work if I'm unavailable, I simply stick with dplyr as much as I can.

1 Like

it's worth noting that mutate allows you to use the variables you create within the mutate command. I randomly stumbled on this in the dplyr vignette.

dplyr::mutate() is similar to the base transform(), but allows you to refer to columns that you’ve just created:

mutate(flights,
  gain = arr_delay - dep_delay,
  gain_per_hour = gain / (air_time / 60)
)
4 Likes

That's a good one. I use that loads; I suppose I'd never thought about it being something that transform() doesn't have.

1 Like

I know that it's right there in the name, but does any one have any notion about what these "unintended" consequences might be? The docs seem to suggest that transform()'s implementation of NSE is poorly done or poorly tested or both. Do we have an idea about how the NSE implementation for mutate() is different/better? I have faith that it is, but if faith were all I needed, I wouldn't have needed to learn logic :slight_smile:.

The NSE for mutate() (and most if not all of dplyr) at this point is all in rlang, which is discussed elsewhere and I am definitely a fan of. If you haven't yet, I definitely recommend checking out the programming with dplyr vignette. As one who programs with dplyr with some degree of regularity, I definitely prefer tidyeval to the previous (lazyeval) implementation (which I preferred still to most of the base R stuff).

One important point that has not been noted is that dplyr now dispatches its verbs against back-ends other than local tibbles / data.frames. (i.e. dbplyr, sparklyr, and the like). So although a database back-end may not be important to the student at present, it most certainly will be in research/industry. +1 for dplyr solutions as a result!

local_tibble %>% mutate() is great, but I love that I can execute on a database by using database_tibble %>% mutate(). More reading on that, if you're unfamiliar.

2 Likes

It might be also worth mentioning (here, not in a beginners class:), that mutate can easily implemented (in a (beautiful and creative way) in base R, as is done via Hadley's plyr package.

I think it is one of the most beautiful exercises of Advanced R and I haven't seen this idea before, since it's a loop over a list of expressions changing the environment in which they are evaluated at each iteration.

See 10.3.3 in
https://bookdown.org/Tazinho/Advanced-R-Solutions/non-standard-evaluation.html

I think it's mostly scoping worries, though transform is both laxer and stricter than mutate, and not always how you'd think, e.g.

library(magrittr)
set.seed(47)

some_data <- data.frame(i = 1:4)

# Recycling works...
some_data %>% transform(x = rnorm(2))
#>   i         x
#> 1 1 1.9946963
#> 2 2 0.7111425
#> 3 3 1.9946963
#> 4 4 0.7111425

# ...but not partial recycling
data.frame(i = 1:5) %>% transform(x = rnorm(2))
#> Error in data.frame(structure(list(i = 1:5), .Names = "i", row.names = c(NA, : arguments imply differing number of rows: 5, 2

# Referring to previously created variables doesn't work...
some_data %>% 
    transform(x = rnorm(2), 
              y = x)
#> Error in eval(substitute(list(...)), `_data`, parent.frame()): object 'x' not found

# ...unless they're in different calls
some_data %>% 
    transform(x = rnorm(2)) %>% 
    transform(y = x)
#>   i           x           y
#> 1 1 -0.98548216 -0.98548216
#> 2 2  0.01513086  0.01513086
#> 3 3 -0.98548216 -0.98548216
#> 4 4  0.01513086  0.01513086

x <- 2

# If a global variable by the same name exists, it will grab it even if there's an earlier one in the call...
some_data %>% 
    transform(x = rnorm(2), 
              y = x)
#>   i          x y
#> 1 1 -0.2520459 2
#> 2 2 -1.4657503 2
#> 3 3 -0.2520459 2
#> 4 4 -1.4657503 2

# ...unless the calls are separated, in which case the data frame `x` takes priority
some_data %>% 
    transform(x = rnorm(2)) %>% 
    transform(y = x)
#>   i           x           y
#> 1 1 -0.92245624 -0.92245624
#> 2 2  0.03960243  0.03960243
#> 3 3 -0.92245624 -0.92245624
#> 4 4  0.03960243  0.03960243

# If you want to get a global variable with the same name as a data frame variable, you have to tell it where to look...
some_data %>% 
    transform(x = rnorm(2),
              y = substitute(x, env = globalenv()))
#>   i          x y
#> 1 1  0.4938202 2
#> 2 2 -1.8282292 2
#> 3 3  0.4938202 2
#> 4 4 -1.8282292 2

# ...in which case it doesn't matter how you separate the calls
some_data %>% 
    transform(x = rnorm(2)) %>% 
    transform(y = substitute(x, env = globalenv()))
#>   i          x y
#> 1 1 0.09147291 2
#> 2 2 0.67077922 2
#> 3 3 0.09147291 2
#> 4 4 0.67077922 2

dplyr is stricter about recycling (only length-1 vectors), but handles scoping very similarly. When you're working with a data frame and environment you control, that behavior makes coding very quick. When the code will be operating on an arbitrary, unknown data frame or in an arbitrary environment, that behavior becomes risky for both transform and mutate (Could there be a similarly-named vector in a parent environment?), and quickly leads to either abandoning both in favor of safer [[ syntax or some gymnastic defensive coding.

So it's not that transform's NSE is more risky than mutate's (though it is less powerful; try setting a variable name to a stored string in transform), it's that dplyr prioritizes uses where the author knows what the data looks like (the vast majority of code) and accepts that the programming cases will require a more thorough knowledge of its NSE system, whereas transform just advises users to avoid it for programmatic usage, as controlling its NSE system effectively is significantly more of a pain than using [[ syntax.

Since most of these programmatic cases will come when writing packages, and most of those cases will end up written in base R anyway to avoid a large dependency graph, the actual number of relevant cases for programming dplyr are limited to those packages that build on the tidyverse framework, e.g. tidytext. And occasionally people trying to operate on a terribly arranged data structure, though such an approach is rarely simpler than tidying first.

3 Likes

@jdlong To get that behavior with base R, I've often used within instead of transform. Except for column order of the output, the following are identical:

mutate(iris,
       area=Sepal.Length * Sepal.Width
       avg.dim=sqrt(area)
)
within(iris, {
    area <- Sepal.Length * Sepal.Width
    avg.dim <- sqrt(area)
})

within will also work on a list though, while mutate won't. Sometimes that's handy.

1 Like

Lots of great points in this thread! I would also point out that base R functions that might seem equivalent to functions in the tidyverse may lack some of the "ecoystem advantage". For example, transform has no awareness of grouped data frames. So the following example will produce very different results:

df <- data.frame(x = 1:10, y = rep(c('a', 'b'), each = 5)) 

df %>%
  group_by(y) %>% 
  mutate(z = x / sum(x))

      x      y          z
   <int> <fctr>      <dbl>
 1     1      a 0.06666667
 2     2      a 0.13333333
 3     3      a 0.20000000
 4     4      a 0.26666667
 5     5      a 0.33333333
 6     6      b 0.15000000
 7     7      b 0.17500000
 8     8      b 0.20000000
 9     9      b 0.22500000
10    10      b 0.25000000

df %>%
  group_by(y) %>% 
  transform(z = x / sum(x))

    x y          z
1   1 a 0.01818182
2   2 a 0.03636364
3   3 a 0.05454545
4   4 a 0.07272727
5   5 a 0.09090909
6   6 b 0.10909091
7   7 b 0.12727273
8   8 b 0.14545455
9   9 b 0.16363636
10 10 b 0.18181818

I keep seeing this claim repeated, but usually without specific examples. NSE is a bit dicey in general, and rlang NSE has changed a lot since introduction. rlang and base are different, but that doesn't immediately imply which one is better. Is base really poorly done, or is it convenient for rlang for it to be perceived so?

For example I have sat in on a class where it was claimed that with() as a hazard, but frankly with() usually works quite well.

This doesn't answer your question, but we are trying to clarify the relative "stability" (for lack of a better word) of packages by using lifecycle badges.

[Which, incidentally, you've just reminded me I need to write up more visibly. That'll be rolled out with links from package READMEs/sites/etc after this release of usethis, but I imagine it'd help to expand upon it a bit.]

3 Likes

Sorry, I didn't mean to harp on stability. Mostly I was thinking NSE interfaces tend to get complicated as they are forced to deal with more real-world corner cases. So a new interface is often going to look neater, regardless of what the future may end up being. So one doesn't want to get too excited about early comparisons (favorable or unfavorable).

From Augustine's laws
Law Number XVII: Software is like entropy. It is difficult to grasp, weighs nothing, and obeys the Second Law of Thermodynamics; i.e., it always increases.

2 Likes

I think it is better in the sense that interfaces developed with the tidy evaluation tools in rlang can be using interactively, and can safely be programmed against. I've attempted to describe the dangers of (e.g.) base::subset() at https://adv-r.hadley.nz/evaluation.html#base-subset

OTOH I agree that most of these "dangers" are not encountered in every day interactive use. But they make the functions terminal in the sense that it's ill advised to wrap with(), subset() etc inside another function.