I can’t decide which packages to use

4 posts were split to a new topic: NULL factor levels with readr data-read

Coming in to this late, but...

I've been teaching an applied statistics course using R (for students who know a bit of statistics but no coding), and I've gradually moved away from base and to a tidyverse approach. For me, this is because it makes more sense, and I think it's easier to learn (and to see what you're doing). I gather from what you write that the goal is to get students to do statistics, so I think you want coding that gets in the way as little as possible.

I don't even talk about tidyverse vs base (although I do point out that some things are "base r" later on, so that people don't expect them to react like tibbles).

My reaction to your list of topics is "that's an awful lot to put into one book". I would eg. limit clustering to say one hierarchical and one not (say kmeans), and have a reason why you would use one rather than the other. Otherwise your readers may get overwhelmed.

The packages I use have mostly been around a while. My recollection is that I use

  • tidyverse (without talking about the subpackages by name, at least until much later)
  • MASS
  • car (MASS and car between them do a lot of things like discriminant analysis and MANOVA)
  • ggrepel (for labelling points on plots)
  • survival and ggsurv (I do survival analysis later on)
  • broom

I bring in bits of the tidyverse as I need them: the read_ functions, ggplot, gather/spread, select/filter/mutate/slice, the map functions later on. There is inevitably some data handling that creeps in, and I think my students should see that (I have a lot of "extra" stuff that includes things like that).

You mention ggplot vs base for things like residual plots: I think there's nothing wrong with eg asking students to do

ggplot(fit, aes(x=.fitted, y=.resid)) + geom_point()

because this is no more than a scatterplot of residuals vs fitted values, and I think it's good for students to know that this is what a residual plot is. (I even show broom::augment to make a single data frame containing the data and the regression stuff so that one can easily make a plot of the residuals against an explanatory variable.)

I guess my take is that I want my students to have some tools they can build with (eg. "here is how you make a scatterplot", with "a residual plot is just a scatterplot", so that as long as they know what a residual plot is, they know how to make it).

My random thoughts.

1 Like

I think this is a really interesting discussion! In general, my philosophy is only show one syntax. Mixing and matching gets very confusing for novices. I am all-in on tidyverse, and agree with people like David Robinson who say Don't teach built-in plotting to beginners (teach ggplot2). This is the way I've gone with all of my courses, after a few years of being convinced by formula syntax (see my syntax cheatsheet for more of a comparison).

All that being said, I do think that it gets harder to make the case for non-base graphics if you're focused deeply on modeling, as it sounds like you are. There are so many models that come with base plotting methods, like you've mentioned. If you run a lm(), you can plot() that model object to get the four basic diagnostic plots with almost no coding. If you build a tree model, you can just plot() that model object and get a base plot of the model. Of course, you can do those same plots in ggplot2, but it takes more work. I like Ken's comment about having students know what a residual plot is, and the fact that making the ggplot drives that home. But, I think with other types of plots, there might not be the same payoff.

I'll take this opportunity to plug this github repo, where some colleagues and I have attempted to modernize some of the R code shown in the Introduction to Statistical Learning book by James, Witten, Hastie and Tibshirani. Our initial goal was to re-write everything in tidyverse style, but I think we lost steam. If you look at the CART lab, it still uses base plotting. On the other hand, if you look at the first lab, you can see there's a long appendix of code that was included in the book, but isn't relevant if you take the tidyverse approach (things like making your own vectors and matrices by hand, using c() and matrix(), for example). I think it's clear in that lab how the tidyverse version is more human-readable and much more interesting. Another place you might want to look is Baumer, Kaplan, and Horton's Modern Data Science with R. They have a chapter on statistical learning, and it is mostly tidyverse style. However, it looks like they're still using base graphics for trees.

All that is to say that I think writing a tidyverse-style modeling book would be extremely useful. But, the tools might not be out there. And, you'd want to make sure the payoff was there for students!

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.