Distributing data with a package

The standard method of distributing data with your package is in R/sysdata.R. If the dataset is sufficiently large, that will cause problems with CRAN. Ideally, it would be great if the package could download data silently as install time, but I don't think that is possible. Is there a good way to distribute package data, but still comply with CRAN guidelines?

For example, rnaturalearth which stores some data within a separate package rnaturalearthhires. It then asks to install rnaturalearthhires with devtools::install_github(). Technically that works, but it's somewhat cumbersome. You may
also end up with two packages out of sync if the data changes over time. You may as well just have only the GitHub package.

Is there a good solution to this problem?

1 Like

There are a number of options.

  1. You can can put the data in its own package. This is a good option if the data is not very large and it does not change frequently.
  2. You can provide a function in the package that downloads the data. E.g. the webdriver::install_phantomjs() function downloads software. Not the same as data, but very similar. I am sure that there are packages that download data.
  3. You can download the data at install time. There are various ways to do this. You could have a .R file in /data that downloads and bundles the data at install time, if you specify BuildResaveData: no in DESCRIPTION.
    You can also download and bundle from a configure script. This does not have to be an autoconf script, it can be a shell script. See e.g. the configure and configure.win files in the ps package: ps/configure at main · r-lib/ps · GitHub (They do not actually download any data, but show you how to call R from ./configure.).
    The main issue with this approach is that many companies are behind firewalls and have their own CRAN mirror, but they'll not let you download arbitrary files from the internet.
2 Likes

Does that comply with CRAN standards? Also, do you have an example of that somewhere? I couldn't find any documentation for BuildResaveData.

Yes. In fact this is in the repository policy, see https://cran.r-project.org/web/packages/policies.html:

Downloads of additional software or data as part of package installation or startup should only use secure download mechanisms (e.g., ‘https’ or ‘ftps’). For downloads of more than a few MB, ensure that a sufficiently large timeout is set.

Well, here is an example that parses the data at install time from a file, IDK any example that actually downloads it: RSiena/allEffects.R at master · cran/RSiena · GitHub

If you choose this method, then make sure that you set options(timeout) and that you use download.file(..., mode = "wb").

It is documented in WRE:

A package can control how its data is resaved by supplying a ‘BuildResaveData’ field (with one of the values given earlier in this paragraph) in its DESCRIPTION file.

(Writing R Extensions)

You need it, otherwise R CMD build will download and bundle the data.

Btw. if you download the data at install time, then it will also be bundled into binary packages, for better or worse.

1 Like

I just tested it and it does seem to work. Curiously, I had to set LazyData: true.

The resulting objects are exported (which makes sense for data/ content). Is there a way to control that?

AFAIR there isn't. You could put the data into sysdata.rda (see R-exts). or into /inst, from a configure file. Data from /data/*.R will be exported.

1 Like

It's nice that data/*.R is automatically executed, but if I want to prevent export I guess I'll have to try configure. It looked complicated for ps, but it probably can be much simpler if it's just executing an R file.

Here is a much simpler configure example for (dev) purrr, that only adds a new file to /man: purrr/configure at main · tidyverse/purrr · GitHub

Btw. you only need to download the data once per installation. ./configure will run twice for older R versions on Windows I believe, once for 64 bit R and once for 32 bit R. So you could check if the data file is already there in /inst and if it is then you skip the download.

TBH I never tried adding data from ./configure but I believe it should work.

Also, you can possibly put it in sysdata.rda, instead of /inst.

1 Like

Thank you so much for the purrr example. It's a very popular package and the file was added in the last few month, so it should be based on the latest standards.

I modified the example and it seems to work. They put the executed R script under tools/, which it turns out is "the preferred place for auxiliary files needed during configuration". However, I got an R cmd check warning:

❯ checking top-level files ... WARNING
  A complete check needs the 'checkbashisms' script.
  See section ‘Configure and cleanup’ in the ‘Writing R Extensions’
  manual.

This was solved by adding checkbashisms to $PATH. Hopefully this won't be an issue with any remote checks (CRAN, R-hub, GitHub Actions, etc.).