Teaching: install/load packages individually, or use tidyverse package?

To borrow from the principles of backward design in lesson planning, start with your end result in mind. At the end of the workshop, what do you want learners to know/do/understand? From there, build backwards.

There are cohorts of learners for whom understanding individual packages and their interactions is going to enhance their learning experience, just as there are cohorts for whom being able to load the tidyverse so that they can focus on another area of content is a better approach.

Knowing your learners and their backgrounds, and aligning that to the end goals of the workshop or course, will help to determine which approach to take.

Personally, when I'm working with learners new to R and data analysis, I start with the tidyverse. It's quick, easy, and allows us to jump into things like data visualization early on. It's been my experience that having learners be successful in creating something (even a simple line graph) early and often develops a sense of buy-in and interest, making later conversations around packages much easier.

10 Likes

I agree with @jessemaegan's thoughts. When I introduce students to data visualization in R, I initially teach them to begin all projects with library(tidyverse), because it is a quick and easy way to get right into things that are rewarding -- such as quickly filtering datasets and making aesthetically pleasing graphs. Another pedagogical advantage to library(tidyverse) is that it allows me to be more flexible during lectures. For example, I may plan to present a particular data analysis that uses only ggplot2 and dplyr, but during that lecture a student might suggest an alternative analysis that is most easily performed using functions from tidyr. If I've only asked students to install ggplot2 and dplyr, it is tough to go down that road together as a class, since we need to stop to download, install, and load an additional package. If they've loaded the entire tidyverse, an unplanned digression is easier.

It isn't until later in the course that we think explicitly about the constituent packages and whether it is necessary or desirable to load all of them for a given project. This approach is informed by my goal, which is to give students the tools and skills necessary to analyze and visualize data, not to turn them into master programmers. Other classes may have different goals, including teaching students to write maximally efficient scripts, which may motivate a different approach.

5 Likes

This year I've taught R/RStudio/Tidyverse to ~ 90 Undergrads and 30 graduate students. With the UG students, I just went with library(tidyverse) and walked them straight into making a figure. Graduate students I went bottom up. I have to say that the top down "whole learning" works great. I love it when the first faceted plot with colored points pops up on their screens after 20 minutes and you hear "wooooowwww" echo around the room!

4 Likes

I like that analogy. I'm going to start calling the tidyverse package the T-ball of data analysis. :wink:

I'm going to have to rethink how I go about teaching R and data analysis now...

I tend to start with a more problem solving approach -- what is the problem the user needs to solve that learning R helps accomplish. In my experience, that answer is usually dplyr+ tidyr and the concepts in your paper on tidy data, so that's what I typically start with. Now that you mention it, I do think that implies assumptions of knowing what to do with data in the right shape on my part.

While it can be a long learning curve, the tidyverse package definitely makes it a shorter, more pleasant one. Thanks for everything you've contributed to make it so.

5 Likes

I'm in my second round of teaching R right now. Last year, all the students worked on either University lab computers or personal computers. We spent most of the first class setting up R Studio, installing packages, etc. And then the entire semester we seemed to be plagued by issues with libraries (especially on the university computers).

This time, I set up an R Studio Server, gave each of the students an account. and loaded all of the libraries I anticipated needing. This let me skip even install.packages at the start and most of the students have benefitted from not having to deal with understanding all of the mechanics yet.

In that same vein, I haven't taught them the word "tidyverse" yet, even though they use all the tools. it's a level of abstraction I don't want them worrying about yet. I'll bring it up later.

I've debated going down the RStudio server path; getting things set up at the start is super painful! I haven't done it yet, because the drawback is that the students don't walk away from the class with a working data analysis installation on their computer.

However, I'm about to deliver a short (4 hour) workshop on getting started with R, and if I spend 2 hours getting people's computers working it'll be a huge waste. So my current plan is to set up RStudio server on AWS, and then if anyone can't get things started up smoothly I can just give them a log in and away we go.

To me this is a distraction that can be saved for later in the course. I like @stephenturner's approach of teaching it only when it first comes up; that's when it will be easiest for students to see why it is helpful.

2 Likes

Like any shortcut, it can help and hinder. I teach in the software carpentry sessions, alongside training grad students and other analysts that want to learn.

It is a fantastic way to get them to dive right in and not worry too much about installation/set-up of R libraries that can be daunting to people who at best are not programmers (myself included). Users can quickly learn the useful patterns and approaches that underly the tidyverse, and do so in a way that produces humane code.

The hindrance comes in the middle, when learners start to branch out. They quickly encounter package conflicts and are at a loss, or they start to delve into packages and have a hard time identifying where the individual functions may come from. It introduces a potential stumbling block that I've seen frustrate some users to the point of walking away.

Part of teaching is removing those obstacles and minimizing potential frustrations to keep momentum, so to speak. I do this with the just-in-time instruction about what the tidyverse actually is, why it is useful and how to skirt the issues. Lots of good advice about teaching is in this thread (@hadley, @jessemaegan) and one of my most dog-eared references that I use to reflect on and improve my teaching is How Learning Works.

2 Likes

Thanks for the book reference. It helps that it is included in my Safari Books Online subscription.

Also good idea to teach tidyverse_conflicts() which should help students if they're not finding the right functions:

suppressPackageStartupMessages({
  library(tidyverse)
  library(MASS)
  library(Hmisc)
})

tidyverse_conflicts()
#> ── Conflicts ─────────────────────────────
#> * combine(),    from dplyr, masks Hmisc::combine()
#> * filter(),     from dplyr, masks stats::filter()
#> * is_null(),    from purrr, masks testthat::is_null()
#> * lag(),        from dplyr, masks stats::lag()
#> * matches(),    from dplyr, masks testthat::matches()
#> * select(),     from dplyr, masks MASS::select()
#> * src(),        from dplyr, masks Hmisc::src()
#> * summarize(),  from dplyr, masks Hmisc::summarize()

Another option is to use the strict package:

suppressPackageStartupMessages({
  library(tidyverse)
  library(MASS)
  library(strict)
})

select
#> Error: [strict]
#> Multiple definitions found for `select`.
#> Please pick one:
#>  * MASS::select
#>  * dplyr::select

This pattern for resolving conflicts is so useful that I'll probably eventually pull out into a separate package.

10 Likes

From the perspective of someone who is self taught in R without prior programming experience, I can say that I found the shortcut of tidyverse to be very helpful. As a complete novice, it is really difficult to get into the console and try to think what to do. tidyverse gives a wide but knowable range. It also provides the tools that you need to do the work. For instance, it can seem overly burdensome to a newbie to have to load both readr and dplyr just because you want to read in some data.

My approach was to just load tidyverse and do everything possible with the tools there. Only more recently have I worried more about what each package does. Once you have been working with the functions, it is much less of a cognitive load to think about which functions go with which packages and why.

7 Likes

I think I'm with this. I teach an intro-to-R-and-SAS course, and I like being able to say "start with library(tidyverse)" and then no-one has to worry about anything. With my students, I'm not planning to disambiguate much (though I'll be adding other packages later).

yeah, I ran into the MASS-dplyr select conflict last year. I ended up explaining that dplyr::select was the way to get at the select we wanted (and why: "because MASS also has a select"). Which select I got depended on which package I had loaded first, which was all rather unsatisfactory (from a teaching point of view).

I feel like this illustrates that there's a duty of care to alert new users that most everything can be done in base, even if it's sometimes more complicated. read.csv has a couple quirks, but it's still a very usable function for a new user. I'm not recommending teaching base first (I still have yet to get reshape to ever do what I want), but an openness to base R will make becoming an effective tidyverse user much easier.

Agreed. I agree that base should not be seen as bad, though to a certain extent this is what I did. However, I think that being hemmed in by the tidyverse was useful. Then, as I have done more with R and became more confident with the tidyverse, it has been easier to go back and learn more about base.

For newbies especially, the plethora of packages is easily overwhelming. As a beginning, I essentially treated tidyverse as the extent of R. After I became more comfortable with tidyverse, I could more easily integrate other packages, including doing things through base.

1 Like

This idea of a separation between base R and tidyverse is an idea I've seen Hadley push back on a few times. I think he's right: it's literally impossible to use the tidyverse without base. <-, c, %in%, as.character, list, seq, library, max, lm, etc. are all base R functions that everyone uses constantly, but a lot of new users mentally divorce them from their less-frequently used counterparts.

I'm not even saying students need to learn more base: the reality is that by the time a user gets a grasp of the tidyverse, she already knows a lot of base R. Helping students acknowledge that fact makes it much easier for them to see a new function like which.max and have the confidence to add it to their repertoire.

3 Likes

My view is that if you are starting to learn R, you should begin only with tidyverse.

I have been using R for some years, and today regularly perform various kinds of analyses. Everything is so simple with tidyverse - why make it harder for people to get started? I recall one concrete example in last few months when I had to google a base R alternative.

To tidyverse, add in the RStudio cheat sheets, and you can immediately do 80-90%+ of what you need.

As an instructor, you can talk about base R along with the idea of packages - it is not a complicated concept to explain. But move on to the practical stuff, quickly.

Of course you'll be using <-, c, %in% etc... but these are details. For the stuff you need to actually do, start with tidyverse verbs.

I pity people who I run into who are learning R and their instructors/courses teach them base verbs - tidyverse was invented to make common tasks simpler. The syntax is close to perfect (I would argue), it is readable, it is fast - and it is a joy to use.

If you are teaching/learning R to get stuff done, why not minimize the time to getting useful results?

4 Likes

I think you're missing my point. Those aren't the details, they're the core of R: regardless of grammar, everyone uses them.

I'm not saying you should teach how to use base R by itself or not teach the tidyverse, I'm saying that by teaching the tidyverse you are teaching base R, and it's helpful for students to understand that so as not to limit themselves to half of a false dichotomy.

1 Like

I do several things when teaching beginners to try and make the learning curve as shallow as possible:

  • Insist everyone uses an Rstudio project; that way issues about paths etc are minimised. I reinforce the need for best-practice file naming and directory structure (see guides by Bryan and Broman). "Repeat after me: the desktop is not a good place to store files".

  • I now start with the "tidyverse" and use package tidyverse to try and get all users moving forward as soon as possible at the same level. It's always a battle to ensure that everyone is using the latest version of R and Rstudio.

  • I use the babynames, dplyr, and ggplot2 packages to show people how popular their own name is over time. They also seem to enjoy checking out celebrity names e.g. Kylie, Hilary.

  • Later the same packages can be used to ask more complex questions like "when I was born what were the most popular names - what might my parents have called me?".

  • I have found the 'flights' and 'titanic' datasets are also very good for engaging students (they know the 'story' and are interested in the interpretation).

  • There are, of course, still plenty of stumbling blocks:
    Script v. Function v. Package v. Library v. Repository
    Left assignment v. Right assignment v. =
    System library v. User library
    ...Add your favourite here.

3 Likes