Data Repositories


#1

I seem to remember reading about a repository that would contain datasets that R users might find helpful to create apps or as learning tools.

Is there such a one? I might occasionally have data to add to it


#2

I don’t know of a central data repository but you are able to create data packages and upload them to CRAN. The NHLData package is one like this, it is a data package which contains scores from every game between 1917(the league start) and the end of the 2015-2016 season.

They are relatively easy to create, so I would suggest that.


#3

The datasets package has more than one might imagine.

I thought I remembered there being several more packages in the Awesome R Data Packages section than there are (it only has two, engsoccerdata and gapminder). There are definitely many more than that (incl. Hadley’s 4 here: https://blog.rstudio.com/2014/07/23/new-data-packages/), but it could be a :+1: repo to contribute to if we round some up here.

Another neat view of datasets in various R packages here, too:
https://vincentarelbundock.github.io/Rdatasets/datasets.html


#4

I think that is a great idea!

I am exploring using BigQuery, google’s serverless database as a general data repository for a number of reasons:

  • It has UI from which data can be stored or queried
  • Very fast - joining two files in PubChem, 100 million chemical structures and 70 million names took less than 3 minutes without having to define an index
  • Very cheap. There is no fee for the server it is hosted on, rather there is a small fee for storing data (10Gb free, $0.02 for each additional Gb - i.e. 1TB for $20 per month) and a fee for querying the data (1Tb free, $5 per additional TB)
  • It has a rest API (and many clients) including R
  • Metadata can be used to describe the dataset.
  • All datasets can be referenced with unique URL

My initial code is available here which allows uploading but at the moment there is nothing available for searching/browsing.