Best database to work with R and Rstudio

database
sql
sparklyr

#1

I've been using R with RStudio for mac already for a while, and I wonder if there're any better ways of storing data than just having csv files lying around. I have a rather large dataset (3 million rows with 15 variables per quarter), spanning for 20 years.

Now with data.table, it's not that slow to have everything in csv, but I still need to import each file every time I need to query data. Meaning that when doing a temporal analysis things can get messy memory wise.

I thought of SQL, as it is the most known/used database, but I wonder if there's something simpler to use and quicker to interact with when using R. Probably something which uses Spark?


#2

You can use sparklyr to handle out-of-memory data. As for storage purpose, feather and fst formats are quite efficient in terms of being imported into R. If you really want to put your data in SQL database, take a look at Postgresql.


#3

It depends on what you need to do. But when I need to deal with lots of smaller files and don't want to deal with a database, I really like the feather package and format.

"Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow" Hadley Wickham 2016-03-29


#4

Now I have 80 csv files, organised per quarter. The idea would be to have it in a format, which would allow me querying the database using Spark or even dplyr, without having to load each file into the memory.
It works fine for a couple of quarters, but when looking at a 10 year period, it starts to get messy.


#5

sergeant :package: can be of help here

It brings Apache Drill to R. It allows to use SQL to deal with very files that do not fit in memory.

There is this topic about out-of-memory that could help you. It mentions sergeant too.


#6

I'm new to databases and have recently been playing around with RSQLite and have been pleasantly surprised at the ease of use in creating and using with dplyr.
Resources about that and others:
https://db.rstudio.com/databases/sqlite/
https://db.rstudio.com/dplyr/#connecting-to-the-database