So here is my case. I've been looking SO for severals days now, but I don't find a proper answer.
TLDR: How to manage projects with "external" datasets? Not possible to use in-project datasets due to size + other things.
I usually follow the project "workflow" system, storing data and results + figures in the current project directory. More or less how R4DS suggests here and many other people. However, this only works well for small projects with small data files (at least in my case).
I find it very difficult to maintain such projects when the data...
- Is really big in size and shared between different projects (it is not efficient to store it multiple times in each project).
- Is in a different drive with more memory (code being stored in SSD, data in HDD)
- could be "sensible" so I want to keep it outside of a git project.
- I would also like to have the ability to somehow define multiple "recipes" of datasets. Something like, use these data and run the analysis, check the figures, do the same with other data, see the figures, reproduce the first figure with no pain.
- This is maybe the most complex thing but it would be cool if I could define a way to load each dataset in this recipe definition and therefore be able to load multiple datasets using different approaches/functions/arguments (let's say .csv and .tsv loaded with different arguments in
I understand an easy solution would be to move to smaller projects using intermediate data from another smaller project and so on. I am currently trying to do that but still I will reach an endpoint when I still need raw files directly from the disk. Plus some of my "analysis pipelines" are in the very early stages so I have a lot of development and therefore instability in most part of the code.
So, my question is, has someone find him/herself in something similar? Do you have a magic solution or at least some kind of workflow to manage these problems?
PS: The first thing I thought was data soft links (still considering it). It works well for 1, 2 and 3, however I need compatibility between windows and linux/OSX and I am not sure how that would work at all. I could somehow reproduce the links in different platforms I guess.
PS2: My second option was to store absolute file paths in json format. Also solving 1, 2 and 3 and (likely) 4 (5 also if well implemented but with a bit of work). There are many things I don't like about this solution though like the use of absolute paths (using a link is a bit similar in a sense that if you move the original data file you would need to remap all the links). The compatibility issue is still a thing in this case but I find it easy to migrate (using something like
file.path() I could change only the drive fields of the path and keep the rest as it is).