How much should package dependencies be limited

tidyverse
recommendations
packages

#1

I have just finished the first version of my first package and have had a somewhat philosophical question about package dependencies or packages listed in Imports. I get the basic idea that limiting the number of packages in Imports is a good goal, but how far should one go with this? In particular, I am thinking about tidyverse set of packages. On the one side, I often see people saying that tidyverse packages such as dplyr or rlang introduce a lot of dependencies, but on the other it seems somewhat reasonable to think that users within the tidyverse will already have such packages installed. Would love to hear thoughts on this topic from the community.


#2

As usual, it depends :slight_smile:

But seriously, it really does depend on your set of users. I would say, that your assumption about having tidyverse installed for most interactive use-cases is going to be correct. But if you are planning your package to be used in production environments then for sure fewer dependencies is the goal.

My take on it is that some packages are too convenient to really try and eliminate them (like dplyr, for example), so I don't consider them to be "bad" dependencies that need to be removed at all costs.


#3

Try to minimize the number of packages listed as Dependencies. Telling people what they should load isn't friendly, so only do it when necessary. A good example: if your package uses dplyr in functions, then just import dplyr. If your package is primarily new features for ggplot2 graphics, then include ggplot2 in Dependencies.

By listing a package as a dependency (Import or Dependencies), it means a couple things:

  1. Anyone installing your package must also install the dependency (which you've written about).
  2. Anyone loading your package must also load all dependencies, even if only into "background" namespaces.
  3. Your package's code is affected by dependencies.

#1 and #2 can be problems if somebody uses your package in a session-specific Docker or VM situation: every time a user starts a session, R is installed, then all required packages are installed and loaded. The more packages there are, the more time this takes. And people are impatient.

But #3 is, IMO, the most important point. As the package maintainer, you need to keep up with changes in all dependencies. Those can break your package. If the dependencies have dependencies (and so on), that's more chances for breakage. Also, if you import entire packages in your NAMESPACE file, you need to watch out for name clashes. Reducing dependencies can save you some trouble as the maintainer.


#4

As in many things in computing there is a tradeoff.

By having your package depend on another package you get

  • Features for free, and because the code already exists
  • Bug testing, as the package already has (at least a few) users who have already caught some bugs

However you also

  • Must adapt your code if the dependency changes
  • May not be using all (most) of the features of the dependency, so there is some overhead

If you do not have many dependencies your code will likely have

  • Less features
  • Potentially more bugs, as you will need to implement functionality yourself (that is not as well tested by users)

But it will be

  • Stable
  • Will have no unused functionality

How this weighs out really depends both on the maintainer and the audience of the package. If your audience is other package developers and are more comfortable writing functionality from scratch you can use fewer dependencies. If your audience is mainly users (who will already have most of the packages installed) and you have less experience it is ok to have (relatively) more dependencies.

Also how you weight a specific dependency also really depends on how much functionality you are using from the package and how long the package takes to build plus how many dependencies the additional package has.

If you are depending on dplyr and only really using it for subsetting data.frame's maybe you could switch to using base functionality without much trouble. But on the other hand if you are depending on dplyr and using it for the SQL translations for different databases there really isn't any alternative.

Also some packages (such as rlang) really are meant to be easily used in packages, rlang is written in C (which is generally much faster to compile than complex C++) and has no additional dependencies. Other packages (like dplyr and stringi) are much heavier and take much longer to compile from source.


#5

Thanks @nwerth, @jimhester, and @mishabalyasin. This is really helpful and clarifies a couple of things I had heard about or thought about but not fully grasped. It is also good to know, and makes sense, that rlang is meant to be used in packages.


#6

Interesting thread! :clap:

@jessesadler You might also enjoy this recent blog post by Scott Chamberlain: https://recology.info/2018/10/limiting-dependencies/