I am developing a package with a data set I scraped from a website and I would like to make it available on CRAN.
However, the compressed data is already higher then 30 mb.
I was wondering, what would be the recommended way to include the data in my package?
My idea was host to the data on Github, for example, and then have a function in the package to download the data file. However, I have no idea if this is a recommend practice in these types of situations.
When facing a similar issue I resolved it in the way you describe (i.e. a hosting the dataset online - Amazon S3 in my case - and having a function serve it).
In my case this was suggested by CRAN maintainers themselves (in a comment along the lines "take your huge data elsewhere" )
Some things for your consideration:
it is likely the dataset will be downloaded a lot, even for niche packages (a lot of CRAN traffic is CI testing), this may impact your choice of hosting provider
CRAN does not really care where your external data is stored, but expects you to access it via https
CRAN discourages you from caching the dataset locally (tempdir is OK, but that is the limit)
bear in mind that utils::download.file() can have platform related issues, resulting in your file being unreadable; I have found curl::curl_download() more reliable (your mileage may vary)
I have only learned about the download.file issues myself the hard way = by having a download fail for a package already released (I am on Linux and the specifics of Windows are sometimes lost on me).
It can be tweaked by specifying the method and mode arguments of the function call (instead of relying on auto mode), but having got burned I opted for cURL instead.
you can cache data locally in an app dir. as long as the user is asked for permission; as to how make the data available and download it, as mentioned earlier in this thread there are many options. Useful packages: rappdirs, hoardr. New R function: tools::R_user_dir() (from R4.0, for older R one would need the backports package).