Teaching: install/load packages individually, or use tidyverse package?

I teach a series of short workshops that cover R essentials, data manipulation with dplyr, tidying with tidyr, plotting with ggplot2, and other (bioinformatics) domain-specific topics.

I lead by asking students to install the tidyverse package and loading it at the top of nearly every script. This conveniently loads dplyr, tidyr, readr, and ggplot2, but it introduces a complexity from the beginning -- newcomers are trying to wrestle with R, RStudio, understanding packages, writing code for perhaps the first time. On top of all this, I then need to explain the tidyverse package as a kind of "meta-package" that conveniently installs and loads lots of other packages. And this further obscures the fact that they're using functions from specific packages: filter from dplyr, gather from tidyr, read_csv from readr, etc. You could argue that it doesn't matter in the beginning, but when I later teach other classes with Bioconductor packages, I run into a namespace issue where I have to explain a student needs to use dplyr::filter() instead of the filter() that the Bioconductor package used.

Finally, I've run into a few cases where students using Windows have run into problems with installation/loading, getting that odd error message long the lines of Error : object 'as_factor' is not exported by 'namespace:forcats'.

My question is this: For beginners, is it better to teach installing/loading the tidyverse package, or installing/loading individual packages as needed?

7 Likes

I doubt it matters too much for installation one way or another since this is a one-time thing. However, for loading, my personal preference is definitely to teach loading them individually for a couple of reasons:

  • It's really not very hard or time consuming to do
  • It gets people in the habit that one will likely have to load multiple packages to do a project (and emphasizes you can use more than one)
  • It teaches parsimony -- why load things that aren't needed and might cause namespace conflicts?
  • It's important to know which functions come from which packages, particularly when debugging the conflicts that you explain
  • Most importantly, it develops greater appreciation for the individual laptop sticker symbols :smile:
13 Likes

I have had this exact same conversation with a few others recently, it's certainly an important consideration.

On one hand, if you use tidyverse you do not need to worry about which package each function belongs in, however finding help online can be harder without this piece of information. So one suggestion would be to teach students how to to look in the upper lefthand corner of help to see which package a particular function belongs in.

On the other hand, introduction of tidyverse first requires discussion of a meta-package, and I find that burdensome on day one.

The approach I have been taking is:

  • Introduce individual packages first to complete new useRs (and this usually means for a while it's only a few of the packages within tidyverse that they'll be using, most likely only dplyr and ggplot2), I introduce the individual packages. We specifically load them and not tidyverse. Then, later (in the semester/workshop/whatever), when we need to add a few more of the packages to our workflow, I'd introduce tidyverse then, almost as a shortcut, but also a ecosystem.
  • For an audience that is familiar with what a package is, I'd go with introducing tidyverse from the beginning.

Alternatively: If their laptops are already covered in stickers, introduce tidyverse since it's only one sticker (h/t to @hadley for this very important consideration).

13 Likes

I agree with the other replies so far, but I would also add one more thought. For someone more experienced, tidyverse is convenient, but you only appreciate the convenience because you are tired of repeatedly loading the same packages one at a time. For a beginner, it's more confusing because it's not obvious what packages are being loaded and what they do. It introduces a solution to a problem they don't actually have yet.

4 Likes

I agree with Emily.

I wouldn't recommend teaching tidyverse for the same reason that I don't recommend using it: it loads too many things unnecessarily, and obscures where symbols come from.

I find it occasionally helpful to look what other languages are doing. By doing so we find that some languages generate warnings, and some even errors for references/libraries/modules that are loaded but never used. The consensus seems to be that you should only load what you actually use.

In teaching, specifically, we want to emphasise to students to be explicit and specific in what they tell the computer.

library(tidyverse) trades explicitness and specificity for a (very) minor convenience.

9 Likes

Does the guidance to teach students to be explicit and specific also imply that we should teach students to always disambiguate names by explicitly writing dplyr::filter, dplyr::select, etc., so they don't become confused later on if they import other packages and produce collisions in the global namespace?

1 Like

@jonathan-g i typically teach this when the issue first comes up. I'll make mention of it when loading a package and you get the mask warnings, but I don't typically show students how to do this until it's necessary.

1 Like

This is fantastic advice. I've been having students just load tidyverse but so far it's only been more confusing. It was helpful for installing everything at once, but I'm going to start having them load individual packages as needed now.

1 Like

Ideally, yes. Again, look at what other languages are doing; you'll find that this is a big focus, e.g. in Python, C++ or JavaScript. Of course R had its own share of idioms when it comes to loading packages but it's no secret that I believe R has a lot to learn from other languages in this regard (see my package ‹modules› on GitHub).

2 Likes

The modules package looks potentially very useful. I have been frustrated for a long time that there was no simple R equivalent of Python's import ... as, and it looks like modules gives a good approximation.

1 Like

Plus, from a teacher perspective, it's almost like "teach Riemann sums before integrals" -- once they've mastered the actual concepts, then you can surprise them with an easier approach but leave them with a better understanding of the underlying.

(Although, personally, beyond being a teaching tool, I just prefer to take the "load what I need" route in my own work.)

1 Like

I really like all thoughts written here, but don't you think that the answer is depends on students? If you have quick workshop where you could probably see somebody who is not familiar with R, I'd propose to use explicit library. But for my students (that mostly use Python) on Data Science, it is better to show the nice tool and spend some time explaining all possibilities.

1 Like

I think it depends on the students, the timeframe, and the tools being used. If you are teaching a short workshop for new users to R that installed R/RStudio just prior to class, and you are just using the tidyverse, then it may even be worthwhile to have them start any scripts with:

if (!require("tidyverse"))
  install.packages("tidyverse")
library(tidyverse)

It can be explained as a "header to make the functions we'll want available." That way, even if they end up needing to reinstall, they won't get errors about not having the package available. That's probably only appropriate for very beginning users that you won't have a longer time to work with, however.

1 Like

I think that's important if you're teaching R programming. However, most people are not teaching R as a programming language, but are instead teaching R as tool for doing data analysis. I think you can teach explicitness and precision much later in the course - in the early part of the course you should emphasising data science concerns, not programming concerns.

I don't have a strong opinion on whether or not you should teach library(tidyverse), but personally I find it much easier when I'm doing a data analysis to load the set of packages that I'm mostly likely to use in a single line of code. I think that at least suggests you should teach it at some point during the course.

13 Likes

I frequently find myself using this approach at work, where not everybody necessarily has a consistent set of packages (and it means anyone running the script from a clean setup doesn't see a whole bunch of errors, which puts them off - and I'm trying to encourage R use).

However, it's not very concise. Is there a single base function that does all these steps that I'm not aware of?

Take a look at the pacman package. Then, you can use a loading header like this:

if (!require("pacman")) install.packages("pacman"); library(pacman)
p_load(tidyverse, devtools, lme4)

As a "bonus" (for the types of users you are talking about), it will also suppress startup messages. In addition, it has easy methods for auto-installing/loading github packages.

4 Likes

Thanks, that's exactly what I was looking for.

@klmr I just wanted to echo what hadley said (above) without getting too out of scope. I am not a programmer, though I write code.

Also, ~related, I studied science/scientists/systems of knowledge production at university (STS). And, I've become riveted by this ongoing discussion around the how (and target whom) of learning R. Paul E. Johnson's Rchaeology: Idioms of R Programming and the Win-Vector Blog come to mind, for example, as discourse around R Programming best practices.

Though I don't think the R-user vs. R-programmer dichotomy is an impenetrable boundary (many people, like me, become interested in the abstractions as a result of applied use), I do think the audiences/users are different, and have different "motivated skill gaps." This is totally fine! Packages, interfaces, and learning approaches are about design and communication.

Designers are not typical users, which can be an obstacle to good design (~loosely adapting ideas from Donald A. Norman's The Design of Everyday Things here). Successful user-centered design hinges on understanding who that user is. This is the point in meta discussions of R where it seems as though the disagreement is really a matter of ill-defined terms. So, though I don't have the answers, I wanted to pipe in early on in what I imagine will be a longer thread with a plea to do so! (If only because I find it so meta interesting :smirk:!)

Sorry, not as short as I thought I could be there…:flushed:

10 Likes

I find this a difficult question to answer because people new to R tend to have very different backgrounds.

For experienced programmers (or really even with any computer science coursework), I would not recommend using tidyverse for training (though would use as an installation helper).

For someone new to programming/scripting, but experienced in data analysis... it depends. If they have a background with other statistical tools like JMP or RapidMiner or even just Excel, it might be helpful. I've found those people do better with a soft learning curve, such as starting to use R through tools like Rcmdr or Rattle. Even though they provide the GUI, they also out the raw R source code for you so you can start to follow along. Everyone I know who started this way eventually ditched the GUI tools within a couple months. For these people, adding in the tidyverse package seems appropriate.

For the person both new to programming and new to data analysis, I would side with most answers so far and teach from the ground up, including how to install and load packages only as needed.

1 Like

I think evidence suggests that bottom-up is generally a suboptimal way to teach. You're better off sketching out the big picture first, then filling in the details over time. To borrow an analogy from the excellent Making learning whole, when teaching baseball it's better to first teach little-league - you don't want to start with a history of baseball, then the physics of bat-ball interaction, then three weeks of batting practice.

14 Likes