Packages with large amount of data

Hi experts,

I have a question about strategies to use a package with data that will exceed the size allowed by CRAN. (The package has a large amount of spatial data).

At this moment I only thought of having a pure data package in github that is a dependency or needs to be installed as part of the installation instructions. However, remotes::install_github() timeouts, even though I can download the repo's zip file from the browser without any issues.

Does anyone have strategies to deal with this type of situation?

Thanks,
Carlos

When I was dealing with exactly the same use case - the package in question was CRAN - Package RCzechia - I ended up storing the offending data remotely on AWS and downloading them via a generic downloader function.

Issues I had to address were:

  • graceful fail in case of internet resources not available - a non negotiable CRAN requirement, and since their servers are not that powerful and heavily loaded they often timeout
  • caching the downloaded files on user machines - I ended up caching in tempdir (i.e. once per session) but others have opted for a more permanent caching (I believe {tigris} uses permanent caching). Again, this has implications with regards to CRAN policy
  • download methods - I work on Linux, so curl comes to me naturally, but a lot of users live on Windows which is a world apart; there were some inconsistencies in download methods

I have a bunch of functions that look like this RCzechia/kraje.R at master · jlacko/RCzechia · GitHub (the low res object is small enough to pass CRAN requirements and is internal, high res is remote) and a dot prefixed function like this RCzechia/downloader.R at master · jlacko/RCzechia · GitHub that does the downloading (and graceful failing as required).

2 Likes

Following the R philosophy of lazy evaluation, I would put the data in an S3 bucket or other repository and then provide a function

get_sf_data <- function () read.csv("url")
1 Like

To build upon the previous answers, a R-hub blog post presented some strategies for "data outside your package" How to distribute data with your R package - R-hub blog

The example of rnaturalearth might be especially relevant for you as it's a geospatial one see https://twitter.com/southmapr/status/1262759210946682888

For #rnaturalearth I made 3 packages, 2 on CRAN, 1 not, rnaturalearth has methods and small example data, rnaturalearthdata has medium res data, rnaturalearthhires has hires data and is hosted by @rOpenSci because too big for CRAN.

Regarding permanent caching it is explained in Persistent config and data for R packages - R-hub blog you can use the rappdirs package, or if your package depends on R above version 4, you can use tools::R_user_dir().

2 Likes

Thanks everyone for your answers!

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.