Same data in multiple packages or independent data package?

I'm writing to ask about some best practices in package development. I am developing a couple packages that are intended to be used for the same type of (linguistic) data. I think it makes sense to split the functions up into two packages because each contains a cohesive group of functions that perform a specific task. I've prepared some data to be used for vignettes and example code (as well as other teaching demonstrations outside of these two packages). Since it very well may be the case that both packages will be loaded within the same script, I don't want to create a clash in the datasets.

So my question is this: is it better to duplicate the data across the two packages, or would it make more sense to create a third, data-only package that the other two would import for vignettes? I'm inclined to make the data-only package, because I may create additional packages in the future. Does the data-only package need to be on CRAN for the others to import it in the DESCRIPTION file or can it just be on GitHub?

I've prepared some data to be used for vignettes and example code (as well as other teaching demonstrations outside of these two packages). Since it very well may be the case that both packages will be loaded within the same script, I don't want to create a clash in the datasets.

Maybe consider creating the data package, include it as a remote dependency for each of the two main packages, and within each package you can decide whether to reexport any data sets from the data package that you want loaded with either package.

Does the data-only package need to be on CRAN for the others to import it in the DESCRIPTION file or can it just be on GitHub

Your package can have non-CRAN dependencies, such as packages hosted on GitHub. You can use the Remotes: section of the DESCRIPTION file to declare these dependencies. See here for more about this. Note that you cannot submit a package to CRAN if the dependencies aren't also on CRAN.

Thank you for your response! I think I might go ahead with the data package, get it up on CRAN eventually, but in the meantime, link the other packages to the GitHub version. Didn't know that was possible!

That is not quite correct. As long as you only Suggest the package, make it available on a CRAN-like environment and make sure that everything works without the package being installed. See Hosting Data Packages via... The R Journal for a full write-up.

1 Like

Yes, this is true and an important clarification!