Best practices for packages that need to download and store large (~1 GB) data sets

I am writing a package that uses publicly available data which is appended several times a year. In addition to downloading it, there is a cleaning process which takes a while to run. I don't think it is in the user's best interest to have to download and clean the data on each new session. There are about 45 individual data tables.

So my question is, what are the best practices for this scenario? The options I'm considering are:

  1. Download, clean, and store locally and permanently all the tables during package installation (or on first load)?

  2. Download, clean, and store locally and permanently each individual table as it is requested by the user if it has not be done previously

  3. (I'm not sure if this is possible) have a separate package that includes the cleaned data tables. Though this means that I would need to update it when the external source data is updated, and then users would need to update that package as well. If this is a good option, I could use some advice on how to implement it.

  4. Something better I'm not aware of?

For options 1 and 2, I could use some advice on best practices for where to locally and permanently store the data. Should it go into the package dir that is a subdir of the library directory?

Thanks much for advice!

2 Likes

This is a really great question!

I can't say I know definitively what the best practice here is, but it is almost certainly not requiring them to re-download all the data constantly.

That said, here are my thoughts on what I would try if it were my problem.

Is your package simply a data package (its exists solely to provide a user with access to the data) or does it also include a host of custom functions for analyzing that data?

If it is just a data package I would every three months or so (more often if the data is updated more frequently or if it is mission critical to always have the most up to date data) collect and clean the data myself and update the included data frames inside the package.

If your package has a functional component to it, I would break the package into two components (code and data). That way users don't need to update huge datasets when they update the package.

One little known thing you can do is you can actually alter the installed package files after installation. I am doing this in a package I am writing for an intro R class I am TAing.

Feel free to skip the below section, only provided for context.


My use case is I have a project template for them to use for their homework assignments, which builds an R markdown template file for their homework and a .bib file for their citations and makes a default project structure for them. In the new project wizard I ask for their name and university ID to populate the author field in the YAML header, but I wanted them to be able to save these fields and not need to re-enter them on each homework. Since these defaults are populated from a .dcf file and to my knowledge cannot be computed at runtime (someone correct me if I'm wrong), what I do instead is if they decide to save their information, I locate the .dcf file in the package directory and edit the text.
You can see this in action in my build_homework_file.R.
I locate the package directory on (at the time of this writing) on line 37 with,

pkg_path <- system.file(package = "UCLAstats20")

The updating happens (again at the time of this writing) on lines 134-143 with,

if (dots[["save"]]) {
    dcf_file <- dir(pkg_path,
                    pattern = "\\.dcf",
                    recursive = TRUE,
                    full.names = TRUE)
    dcf <- readLines(dcf_file)
    dcf_updates <- grep("Label: Name|Label: UID|Label: Save as Defaults|Label: Bibliography File", dcf) + 1
    dcf[dcf_updates] <- paste("Default:", c(dots[["student"]], dots[["uid"]], dots[["bib_file"]], "On"))
    writeLines(dcf, dcf_file)
  }

What I would try to do is write a function which can:

  1. Identify the last update time of the data set.
  2. Pull from online only new records since the latest update.
  3. Augment the existing internal data set with the new records.

You could set up a separate github repo to house the data. I'm imagining you'd do something like have monthly patches hosted, which you would clean and upload yourself. Then when they update their data set it would grab all of the new patch files, rbind() them, and overwrite the data file in the package directory...

Thanks for the detailed response, this is helpful and has given me some things to think about. I think it is both a data package and a functional package, so perhaps splitting it up is the way to go.

On the other hand, there may be many tables that the user doesn't want and shouldn't be downloaded until needed in order to conserve disk space. I don't see an easy way to do that with a data package since it would have to have all the data.

If I decide to download and store permanently on demand, is it kosher to save it to a subdir of the package path?

Disclaimer: I find such questions so interesting that I wrote a whole blog post about it. :grin:

Download, clean, and store locally and permanently all the tables during package installation (or on first load)?

I think 1 is rare, at least I don't even remember seeing this happening, and I'm not sure you'd be allowed to.

Download, clean, and store locally and permanently each individual table as it is requested by the user if it has not be done previously

This sounds like a good idea. The data could be stored in an app dir. However you'll also need to think about storing the date at which it was downloaded to compare it to the date of the data latest update, if I follow correctly? You might also find webmiddens, or at least ideas behind it, relevant.

I'm not sure if this is possible) have a separate package that includes the cleaned data tables. Though this means that I would need to update it when the external source data is updated, and then users would need to update that package as well. If this is a good option, I could use some advice on how to implement it.

There is an R journal article about this strategy. Hosting Data Packages via drat: A Case Study with Hurricane Exposure Data, by Brooke Anderson and Dirk Eddelbuettel](https://journal.r-project.org/archive/2017/RJ-2017-026/index.html)

Good luck!

The ipeds package for college statistics and USCensus2010 suite take different approaches here, so either way is valid imho.

If the data will be transient, I'd put it in /tmp. Otherwise it will have to be stored somewhere longer-lived.

For desktop users, please consider checking the XDG env vars before creating a bespoke folder in their home directory - see also https://wiki.archlinux.org/index.php/XDG_Base_Directory

Saving the data into the package folder is also a good option.

This topic was automatically closed after 45 days. New replies are no longer allowed.


If you have a query related to it or one of the replies, start a new topic and refer back with a link.