How much should package dependencies be limited

jessesadler · September 23, 2018, 11:36pm

I have just finished the first version of my first package and have had a somewhat philosophical question about package dependencies or packages listed in Imports. I get the basic idea that limiting the number of packages in Imports is a good goal, but how far should one go with this? In particular, I am thinking about tidyverse set of packages. On the one side, I often see people saying that tidyverse packages such as dplyr or rlang introduce a lot of dependencies, but on the other it seems somewhat reasonable to think that users within the tidyverse will already have such packages installed. Would love to hear thoughts on this topic from the community.

mishabalyasin · September 24, 2018, 7:58am

As usual, it depends

But seriously, it really does depend on your set of users. I would say, that your assumption about having tidyverse installed for most interactive use-cases is going to be correct. But if you are planning your package to be used in production environments then for sure fewer dependencies is the goal.

My take on it is that some packages are too convenient to really try and eliminate them (like dplyr, for example), so I don't consider them to be "bad" dependencies that need to be removed at all costs.

nwerth · September 24, 2018, 3:04pm

Try to minimize the number of packages listed as Dependencies. Telling people what they should load isn't friendly, so only do it when necessary. A good example: if your package uses dplyr in functions, then just import dplyr. If your package is primarily new features for ggplot2 graphics, then include ggplot2 in Dependencies.

By listing a package as a dependency (Import or Dependencies), it means a couple things:

Anyone installing your package must also install the dependency (which you've written about).
Anyone loading your package must also load all dependencies, even if only into "background" namespaces.
Your package's code is affected by dependencies.

#1 and #2 can be problems if somebody uses your package in a session-specific Docker or VM situation: every time a user starts a session, R is installed, then all required packages are installed and loaded. The more packages there are, the more time this takes. And people are impatient.

But #3 is, IMO, the most important point. As the package maintainer, you need to keep up with changes in all dependencies. Those can break your package. If the dependencies have dependencies (and so on), that's more chances for breakage. Also, if you import entire packages in your NAMESPACE file, you need to watch out for name clashes. Reducing dependencies can save you some trouble as the maintainer.

jimhester · September 24, 2018, 3:37pm

As in many things in computing there is a tradeoff.

By having your package depend on another package you get

Features for free, and because the code already exists
Bug testing, as the package already has (at least a few) users who have already caught some bugs

However you also

Must adapt your code if the dependency changes
May not be using all (most) of the features of the dependency, so there is some overhead

If you do not have many dependencies your code will likely have

Less features
Potentially more bugs, as you will need to implement functionality yourself (that is not as well tested by users)

But it will be

Stable
Will have no unused functionality

How this weighs out really depends both on the maintainer and the audience of the package. If your audience is other package developers and are more comfortable writing functionality from scratch you can use fewer dependencies. If your audience is mainly users (who will already have most of the packages installed) and you have less experience it is ok to have (relatively) more dependencies.

Also how you weight a specific dependency also really depends on how much functionality you are using from the package and how long the package takes to build plus how many dependencies the additional package has.

If you are depending on dplyr and only really using it for subsetting data.frame's maybe you could switch to using base functionality without much trouble. But on the other hand if you are depending on dplyr and using it for the SQL translations for different databases there really isn't any alternative.

Also some packages (such as rlang) really are meant to be easily used in packages, rlang is written in C (which is generally much faster to compile than complex C++) and has no additional dependencies. Other packages (like dplyr and stringi) are much heavier and take much longer to compile from source.

jessesadler · September 24, 2018, 4:56pm

Thanks @nwerth, @jimhester, and @mishabalyasin. This is really helpful and clarifies a couple of things I had heard about or thought about but not fully grasped. It is also good to know, and makes sense, that rlang is meant to be used in packages.

maelle · October 10, 2018, 10:43am

Interesting thread!

@jessesadler You might also enjoy this recent blog post by Scott Chamberlain: https://recology.info/2018/10/limiting-dependencies/