How to manage common data files between multiple projects in R?

projects

#1

So here is my case. I’ve been looking SO for severals days now, but I don’t find a proper answer.

TLDR: How to manage projects with “external” datasets? Not possible to use in-project datasets due to size + other things.

I usually follow the project “workflow” system, storing data and results + figures in the current project directory. More or less how R4DS suggests here and many other people. However, this only works well for small projects with small data files (at least in my case).

I find it very difficult to maintain such projects when the data…

  1. Is really big in size and shared between different projects (it is not efficient to store it multiple times in each project).
  2. Is in a different drive with more memory (code being stored in SSD, data in HDD)
  3. could be “sensible” so I want to keep it outside of a git project.
  4. I would also like to have the ability to somehow define multiple “recipes” of datasets. Something like, use these data and run the analysis, check the figures, do the same with other data, see the figures, reproduce the first figure with no pain.
  5. This is maybe the most complex thing but it would be cool if I could define a way to load each dataset in this recipe definition and therefore be able to load multiple datasets using different approaches/functions/arguments (let’s say .csv and .tsv loaded with different arguments in read_delim()).

I understand an easy solution would be to move to smaller projects using intermediate data from another smaller project and so on. I am currently trying to do that but still I will reach an endpoint when I still need raw files directly from the disk. Plus some of my “analysis pipelines” are in the very early stages so I have a lot of development and therefore instability in most part of the code.

So, my question is, has someone find him/herself in something similar? Do you have a magic solution or at least some kind of workflow to manage these problems?

PS: The first thing I thought was data soft links (still considering it). It works well for 1, 2 and 3, however I need compatibility between windows and linux/OSX and I am not sure how that would work at all. I could somehow reproduce the links in different platforms I guess.

PS2: My second option was to store absolute file paths in json format. Also solving 1, 2 and 3 and (likely) 4 (5 also if well implemented but with a bit of work). There are many things I don’t like about this solution though like the use of absolute paths (using a link is a bit similar in a sense that if you move the original data file you would need to remap all the links). The compatibility issue is still a thing in this case but I find it easy to migrate (using something like file.path() I could change only the drive fields of the path and keep the rest as it is).


#2

If you want to be able to share a repeatable, well-defined, platform-agnostic environment, then I would strongly encourage you to go look at Docker:

With regards to where your data is located, you can use the package here to help ease file path issues:

Otherwise, if the data is truly large, then I’d recommend hosting it on a provider like data.world
where you can choose to host it publicly or private.

I hope this helps.


#3

My suggestion is to consider to offload your data to a database server.

You can connect to it via DBI / dplyr from different projects (there are a ton of walktroughs on the net, this one is my favorite https://db.rstudio.com/) and you can even offload some operations to the database. It will stay persistent across all your projects and can be accessed by other users.

My personal favorite is Postgres hosted on AWS, but there are other options. They need not be expensive (most of the providers have “free tier” options).


#4

I could not agree more with this advice.

Depending on your use case, you could also use some caching mechanism like with storr.