Proper file management in package (textdata)

I have been working on a package textdata with the goal to allow datasets to be downloaded and stored on disk and loading in as needed, instead of including them inside packages. This goal is to be able to provide larger files not easily hosted on CRAN, and deal with licenses issues such as this one.

The main idea is based on the way keras handles datasets such as with the function keras::dataset_cifar10() . The functions will prompt with information about the dataset, including information about size and license so the user can make an informed decision if they want to download.

Most of the magic happens in load_dataset which will check if the file is already downloaded and load it if it is and download it if it isn't.

I have two main problems.

  1. I need assistance to make sure the path creation is done correctly (it works on my MacOS but I don't have knowledge or means to test on other operation systems). Specifically I'm worried about these two lines.
  2. I would like the user to have access to information regarding where datasets have been saved on their computer. textdata allows the user to change the directory where the datasets are stored and it would nice to be able to do a full deletion of all datasets if you want to "uninstall".

CC @Max, @julia

3 Likes

For the first problem, I suggest you use the fs package, since this package contains file system functions that:

  • are internally consistent (function names, arguments, etc)
  • behave the same across platforms

You can get the fs package from CRAN, and the development version at https://github.com/r-lib/fs

name_path <- paste0(dir, data_name, "/", name, collapse = "")
folder_path <- paste0(dir, data_name, "/", collapse = "")

Using fs, you can write this as:

fs::path(dir, data_name)

Your second question asks about deletion of files. With fs you can use one of the dir_ functions, e.g.

fs::dir_delete(...)

For a full comparison of the fs functions and base R functions, refer to the vignette at https://fs.r-lib.org/articles/function-comparisons.html

4 Likes

In terms of where to actually store the data, I agree that you should make this user configurable, but at the same time should provide sensible OS-specific defaults.

Currently you are using "~/.textdata" which isn't unreasonable and certainly has plenty of precedents. That said, Linux/MacOS/Windows all have recommended locations for user data that differ from this. This is abstracted for you in the rappdirs package (https://cran.r-project.org/web/packages/rappdirs/index.html). So, for example if you used the user_cache_dir() function you'd get the following on varioius OS's:

• Mac OS X: ‘~/Library/Caches/’
• Unix: ‘~/.cache/’, $XDG_CACHE_HOME if defined
• Win XP: ‘C:\Documents and Settings<username>\Local Settings\Application Data<AppAuthor><AppName>\Cache’
• Vista: ‘C:\Users<username>\AppData\Local<AppAuthor><AppName>\Cache’

2 Likes

When I was researching a similar problem (datasets too big to fit a CRAN sized package) I ended up loading them via functions from an AWS S3 bucket (fronted by CloudFront cache) and caching them locally in tempdir(). Tempdir provides a useful abstraction from OS specifics, but lasts for the duration of R session only.

The reason was that the CRAN repository policy discourages you from writing to user filespace (with a possible exception allowed in interactive session based on user agreement, but I did not want to go into discussing things with the CRAN maintainers).

Also note that the CRAN policy expects you to download additional data via https, not http.

2 Likes

Thank you everyone!
I put everything to use

1 Like