Should I update all my R packages frequently? Yes/No? Why?

mauro_lepore · March 5, 2018, 7:06pm

With beginners in mind, What is the best resource you can think of to answer this question:

Or phrased a little differently, What is a cost-effective way to manage packages dependencies for users (not developers) of R packages?

I think @JennyBryan is a big supporter of frequent updates.

mishabalyasin · March 5, 2018, 7:17pm

With availability of Docker and MRAN, there is little reason to not update your packages as frequently as possible.

In the end, it is always about tradeoffs. Either you are not updating and then you have relatively uneventful experience with your projects not breaking, but you might be doing something that someone already did in upstream package. Or you are updating all the time (e.g., whenever you start new project/analysis) and risk breaking some older work, but you are not reimplementing something that people already solved upstream.

It is up to a user to decide what is the tradeoff here.

jdblischak · March 5, 2018, 8:13pm

One consideration is how fast the packages you use are updated. I recently found a bug in some code that used purrr. I had followed the documentation and the script ran fine on most of my machines. One day I was using a computer that I update much less frequently, and I got an error. I spent a bunch of time trying to debug it and reading the documentation, but in the end the solution was that I needed to update the purrr package on that machine.

jennybryan · March 5, 2018, 10:59pm

My motivation for recommending that people update R, RStudio, and their packages at the start of every course I teach is ... I see so many time capsules! As in, systems that appear to be frozen circa 2016 or thereabouts.

So I wouldn't say that everything needs to be bleeding edge, but some reasonable definition of current.

Why?

First, we don't want to bump into bugs that have already been fixed. I also show and use features in packages/functions that have come about more recently.

Second, even when we run into trouble (which happens with new stuff!), discovering and fixing bugs in current packages seems like a very worthwhile activity. Much more so than figuring out how to get an old version of X to work smoothly with an old version of Y. I like to spend my troubleshooting time on things that are likely to benefit the most people, going forward in time.

Finally, I think there's also an element of "if it hurts, do it more often". This post describes this idea in the context of software development:

You will always eventually have a reason that you must update. So you can either do that very infrequently, suffer with old versions in the middle, and experience great pain at update. Or admit that maintaining your system is a normal ongoing activity, and do it more often.

alistaire · March 6, 2018, 12:46am

Given a general mandate for at least some level of backwards compatibility among package maintainers, I've seen very few bugs introduced by updating. Most of the ones I can think of (by_row/by_slice/dmap moving to purrrlyr, dbplyr getting split out from dplyr) were really just relocations that didn't require significant changes to code (though the former was a sort of deprecation). Even when things are deprecated, they tend to stay that way for a good while before they ever get removed, so updating is more likely to generate warnings than errors.

On the other hand, I have seen a lot of bugs that have been fixed by updates. If you're in a production environment, and just want your code to run when called in two years, sure, lock down an environment in a VM. If you're doing interactive exploration, though, there's little reason not to update. Even GitHub versions of most packages are usually reliable, as significant changes are typically tested in branches first.

There are a couple times you must update:

If you're going to open a GitHub issue, always install the current development version first to check the bug hasn't already been fixed.
If you're updating R itself, you should update your packages. Depending on your installation, R will make a new directory for packages anyway, so you either have to relink your old set or reinstall them. If you're going to relink, update and rebuild, or you'll likely get a lot of annoying warnings later to do so.
If you're installing a new package and it's not working right, make sure its dependencies are updated.

The one caveat I'll add is that it's totally legitimate to wait to update a package that doesn't have a compiled version on CRAN yet for your OS. Sometimes building from source requires outside resources that can be a pain to wrangle (e.g. sf), so if you can afford to wait, your life will be a lot easier. When updating from CRAN, you should get a prompt that lets you say no before it tries to compile anything, though.

As for concrete numbers, I'd guess I update every week or so. I wouldn't look askance at updating every month or so. If it's been 6 months to a year for someone who uses R daily, I'd expect a good reason (and they do exist).

mauro_lepore · March 6, 2018, 1:38pm

Thank you all! Now I have a blog post where to point my users to:

jennybryan · March 6, 2018, 7:22pm

Nice post!

There are definitely situations where people have good reason to not update or to do so infrequently or at specific milestones. But I personally encounter many more who suffer from the opposite: updating too infrequently, without any specific reason or because they fear disruption (which makes it worse in the end).

nutterb · March 6, 2018, 8:01pm

As for concrete numbers, I’d guess I update every week or so. I wouldn’t look askance at updating every month or so. If it’s been 6 months to a year for someone who uses R daily, I’d expect a good reason (and they do exist).

As an example of a good reason, my facility is under strict configuration control procedures. Given the toxicity of what we work with (and the nature of the public to freak out about mistakes) we won't tolerate anything unexpected happening due to a casual library update. To update my library requires the following steps:

submitting an IT request and getting it approved
updating the library in a development environment
Testing every report and application we use in production in the development environment and looking for new warnings, messages, and errors. (Sometimes, updating a package requires changes the CSS on an app, which doesn't get picked up by any of those)
Once I'm happy with the behavior in the development environment, I load the new library into a testing environment where a group of end users tests everything to verify that it still functions to specification
The Quality department reviews all of the changes and verifies that all of the updates I've made are properly documented and installed.
Finally, I may release the new library into the production environment.

This is a bit of a painful process, so I only do it if I am upgrading R itself. The last time I upgraded R and the library, it was a month long process.

Incrementally, I may update a package (and sometimes dependencies) when I need a new feature. Updating a single package requires the same process, but is much less painful, since I'm usually only testing one feature at a time.

On the other hand, when I do my own personal package development, or when I've taught R classes, I always update to the most recent version of R and update all the packages.

chasec · March 7, 2018, 3:46am

I think the above points are super great!

I also wanted to add that not updating makes it more likely your current projects will not be repeatable by someone else on another system (sans-packrat or similar), especially by beginners who will likely be working on a fresh-install with all the latest package versions.

mauro_lepore · March 11, 2018, 2:24am

Good point! At any given time there is a practically infinite combination of old packages but only one combination of the latest version of those same packages.

oren · March 11, 2018, 1:24pm

Updating is easy, the tricky aspect of updating (for data scientists) is how to ensure the research you conduct remains reproducible in the future... Also in a field rife with fudging "the technique" new adopters will only trust software does not add more fudging of the results all by itself.

Each time you update anything you run the risk of the code breaking or worse the results of the analysis changing in some subtle way . I don't know many people who are rerunning all their R code after each update which would be needed to discover which package broke the code. Also you need crazy tests to ensure your analysis is still rock solid. (Even if you are the only coder.)

I think that in work (to contrast the classroom) many packages are used as black boxes - we want dependable results to use in exploring data and as more of these are in use a break can happen - an api is dropped, a function moves to its own package and now you have to explore the package and not the environment.

So once you have "legacy" that work don't update unless you can go back, have the time/money to fix any breakages, have highly testable code...

That said I love docker and it does allow to start new projects with a fresh new environment.