This is a really great question!
I can't say I know definitively what the best practice here is, but it is almost certainly not requiring them to re-download all the data constantly.
That said, here are my thoughts on what I would try if it were my problem.
Is your package simply a data package (its exists solely to provide a user with access to the data) or does it also include a host of custom functions for analyzing that data?
If it is just a data package I would every three months or so (more often if the data is updated more frequently or if it is mission critical to always have the most up to date data) collect and clean the data myself and update the included data frames inside the package.
If your package has a functional component to it, I would break the package into two components (code and data). That way users don't need to update huge datasets when they update the package.
One little known thing you can do is you can actually alter the installed package files after installation. I am doing this in a package I am writing for an intro R class I am TAing.
Feel free to skip the below section, only provided for context.
My use case is I have a project template for them to use for their homework assignments, which builds an R markdown template file for their homework and a .bib file for their citations and makes a default project structure for them. In the new project wizard I ask for their name and university ID to populate the author field in the YAML header, but I wanted them to be able to save these fields and not need to re-enter them on each homework. Since these defaults are populated from a .dcf file and to my knowledge cannot be computed at runtime (someone correct me if I'm wrong), what I do instead is if they decide to save their information, I locate the .dcf file in the package directory and edit the text.
You can see this in action in my build_homework_file.R.
I locate the package directory on (at the time of this writing) on line 37 with,
pkg_path <- system.file(package = "UCLAstats20")
The updating happens (again at the time of this writing) on lines 134-143 with,
if (dots[["save"]]) {
dcf_file <- dir(pkg_path,
pattern = "\\.dcf",
recursive = TRUE,
full.names = TRUE)
dcf <- readLines(dcf_file)
dcf_updates <- grep("Label: Name|Label: UID|Label: Save as Defaults|Label: Bibliography File", dcf) + 1
dcf[dcf_updates] <- paste("Default:", c(dots[["student"]], dots[["uid"]], dots[["bib_file"]], "On"))
writeLines(dcf, dcf_file)
}
What I would try to do is write a function which can:
- Identify the last update time of the data set.
- Pull from online only new records since the latest update.
- Augment the existing internal data set with the new records.
You could set up a separate github repo to house the data. I'm imagining you'd do something like have monthly patches hosted, which you would clean and upload yourself. Then when they update their data set it would grab all of the new patch files, rbind()
them, and overwrite the data file in the package directory...