Package with "big" data - how should I make data set available and conform to CRAN's requirements?

Hi everyone,

I am developing a package with a data set I scraped from a website and I would like to make it available on CRAN.

However, the compressed data is already higher then 30 mb.

I was wondering, what would be the recommended way to include the data in my package?

My idea was host to the data on Github, for example, and then have a function in the package to download the data file. However, I have no idea if this is a recommend practice in these types of situations.

Thank you for the help!

1 Like

When facing a similar issue I resolved it in the way you describe (i.e. a hosting the dataset online - Amazon S3 in my case - and having a function serve it).

In my case this was suggested by CRAN maintainers themselves (in a comment along the lines "take your huge data elsewhere" :slight_smile: )

Some things for your consideration:

  • it is likely the dataset will be downloaded a lot, even for niche packages (a lot of CRAN traffic is CI testing), this may impact your choice of hosting provider
  • CRAN does not really care where your external data is stored, but expects you to access it via https
  • CRAN discourages you from caching the dataset locally (tempdir is OK, but that is the limit)
  • bear in mind that utils::download.file() can have platform related issues, resulting in your file being unreadable; I have found curl::curl_download() more reliable (your mileage may vary)
5 Likes

Thank you very much for your response.

Also, thanks for the very useful advice. I had no idea about utils::download.file() could be problematic, for example.

1 Like

Glad to be of service!

I have only learned about the download.file issues myself the hard way = by having a download fail for a package already released (I am on Linux and the specifics of Windows are sometimes lost on me).

It can be tweaked by specifying the method and mode arguments of the function call (instead of relying on auto mode), but having got burned I opted for cURL instead.

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Two other ideas from a recent blog post I wrote on the R-hub blog.

2 Likes

Thank you for your response! Very useful! I ended up setting a remote installation function that the user calls when the package is installed.

2 Likes