Data size limits for packages?


I’ve been developing a package for college sports and just fell upon this wealth of data on reddit and wanted to incorporate it within my package. The purpose of the package is to provide access to college statistics through various online resources or just data hosted within the package.

Just curious are there any good practices when including data? My main concern is the amount of data I want to provide. I’ve got data from 2001 - 2017. I think the data will amass to be greater than 1GB or so.
Any suggestions on how to handle this?


I’m not sure about your exact concern, too much data for the user, or too big of a package for CRAN.

If it’s CRAN you’re worried about, this is an example of how to host (via GitHub) and use a data package separate from your main package:


From the Cran Repository Policy:

Packages should be of the minimum necessary size. Reasonable compression should be used for data (not just .rda files) and PDF documentation: CRAN will if necessary pass the latter through qpdf.
As a general rule, neither data nor documentation should exceed 5MB (which covers several books). A CRAN package is not an appropriate way to distribute course notes, and authors will be asked to trim their documentation to a maximum of 5MB.

Where a large amount of data is required (even after compression), consideration should be given to a separate data-only package which can be updated only rarely (since older versions of packages are archived in perpetuity).

Similar considerations apply to other forms of “data”, e.g., .jar files.

1 GB may not be feasible for CRAN. At the very least, it would have to be incorporated into a separate package, and you would probably have to come up with a plan for incremental updates. Otherwise, updating the existing package to include 2018 data next year would require another 1 GB download for people and more than double the storage required for the package by CRAN mirrors.