Natural DB design/ choice with R

databases

#1

There are a number of powerful interfaces allowing R to interface with major database types, but has anyone written in an opinionated way about which DBs are more or less natural to use with R, maybe in a comparative way?

The question is general, motivated by a specific use-case: A small organization which owns some internet-connected industrial equipment is interested in some visualization/ dashboarding, w data from many pieces of equipment, updated at high freq time intervals (data available 10-100x per second). Therefore, I have not so wide (about 150 - 200 pieces of equipment, ie variables), but really long data (~300k observations per day).

For now, the data is nice and rectangular- I could just use csvs, though I’m intrigued with the fst and feather packages. I’m worried, however, that I need to take a longer view and anticipate growth in the system, and perhaps try to do this in a more “mature” way.

Then again, maybe simpler is better, and I should avoid the overhead of a a DB data repo.

More context: I’m an intermediate R user, and love to do analysis with the many powerful tidyverse tools, and sometimes Shiny, but I know little beyond the most basic SQL commands, and almost nothing about DB admin.

Any ideas are appreciated.


#2

This is a great question. I have to admit that I am quite biased on this topic, being an avid Postgres user. When I need a database, though, I find myself reaching for either SQLite or PostgreSQL, depending on the situation. SQLite resides in a single file and is by far the easiest to set up, but PostgreSQL is not that much harder (in my experience), has fantastic documentation, and is enormously powerful when necessary. PostgreSQL also has support for windowing functions and CTEs, which can be useful for more advanced querying and analysis. Plus, both are open source products.

Both also have solid database connectors, although the semantics between the two can vary at times (even though both use DBI). Also, the future connector for Postgres (RPostgres) is not on CRAN yet, but RPostgreSQL can normally get the job done.

Lastly, an article from the tidyverse that I find very helpful is here. At the very end is a tidbit on this topic in particular (which happens to agree with my bias :smile:). Lots more information there to enjoy though!

If you don’t already have a database, here’s some advice from my experiences setting up and running all of them. SQLite is by far the easiest to get started with, but the lack of window functions makes it limited for data analysis. PostgreSQL is not too much harder to use and has a wide range of built-in functions. In my opinion, you shouldn’t bother with MySQL/MariaDB: it’s a pain to set up, the documentation is subpar, and its less featureful than Postgres. Google BigQuery might be a good fit if you have very large data, or if you’re willing to pay (a small amount of) money to someone who’ll look after your database.

All of these databases follow a client-server model - a computer that connects to the database and the computer that is running the database (the two may be one and the same but usually isn’t). Getting one of these databases up and running is beyond the scope of this article, but there are plenty of tutorials available on the web.

EDIT: On your particular use case, I would recommend PostgreSQL, as it would prepare you for increasing data volume and scaling. It’s easy to get started with, but you would have plenty of room to grow (many companies use Terabyte PostgreSQL databases with efficiency, although that would probably suggest requiring a dedicated DBA).


#3

I don’t have extensive experience in connecting R with databases, but I have had no problem in connecting R to an Oracle server and to a Hadoop cluster via Hive and Impala.

Therefore you should probably concentrate on the factors mentioned by @cole, i.e. the ease of setting up the database, its functionality and performance.