Where to put a downloaded data file for external data

Hi,

I am currently trying to build an internal package for our lab. This includes creating an internal lookup table from three TSV files I had to manually get. There's really no easy way to get that data so I would like to store that data in the repo, create a script to generate the internal data file in data_raw/ and then work from there. However, I am a bit confused about where to put these raw data files.

One of my thoughts is to put that in a separate raw/ directory and put the files there and ignore that in RBuildignore. In that case, I am unsure if I am following best practices by accessing said data in a normal script way.

This is quite confusing for me and I want to build good habits from the start. Any help will be appreciated.

When looking for best practices the Data chapter of R Packages is as close to truth revealed as it gets; I recommend you give it a read (it is not too long).

The key questions you need to answer for yourself are:

  • do you intend to distribute the data as TSV file, or R object? I suggest R objects
  • do you intend to let users see the data in their environment pane? Or use it internally by the functions of your package?

My personal preference is to have lookup tables internal, so what I use is something along the following lines somewhere in data-setup.R. This file would live in data-raw directory and be sourced manually at package build time.

lookup_table <- readr::read_tsv("./somewhere-on-filesystem/table.tsv")

usethis::use_data(lookup_table,
                  internal= T,
                  overwrite = T)

Note that the preference for internal datasets is a personal one, and you can omit the internal = T and allow your users to access the tables directly / they may need to use data(lookup_table) depending on your lazydata settings.
If you decide to not make your data internal you should seriously consider documenting it though.

2 Likes

You are welcome!

And yes, your raw tsv file should definitely be 1) version controlled and 2) build ignored.

Except for these there are no other hard & fast rules here, but my recommendation would be to have both the lookup table (as tsv) and the R script used to run use_data() live in a data-raw directory, which in turn would be in .Rbuildignore.

Having raw data in data-raw is not a requirement but merely a convention. But if I were in your place this is what I would do.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

Thank you for the reply. Reading my question again, I realise I was terribly unclear. However, your answer is definitely helpful.

Thank you for linking the R Packages book. I should have mentioned that my confusion did come from there.

Since it is a lookup table, it needs to be internal. It is of no direct use to the end-user. That's precisely why we are building the package.

My problem is related to source control, as I would like to check that dataset into the git repo. My solution for now is to have a separate raw/ directory that's added to Rbuildignore. Thank you.