Best practices for internal package database API

Trowic · January 23, 2019, 5:30pm

Hi,

I am looking for advices to create some helpfull functions in a corporate internal package to connect and query a MS SQLServer database and a MySQL server.

I have read many times the website https://db.rstudio.com/ but i'm unable to figure out a good way to start.

Should i dive into OO programming and create a class that have methods to connect and query the db ?
Should i just write functions like :

my_conn <- function(db, user, pass){
  con <- DBI::dbConnect(odbc::odbc(),
                      Driver   = "[driver's name]",
                      Server   = "[server's path]",
                      Database = db,
                      UID      = user,
                      PWD      = pass,
                      Port     = 1433)
  con
}

How should i handle frequently used queries ? I have thought of functions like

my_super_query <- function(con, param1){
 airport <- dbSendQuery(con, "SELECT * FROM airports WHERE faa = ?")
 dbBind(airport, list(param1))
 dbFetch(airport)
}

and documenting the query in the function documentation, but is there better ways to do it ?

jdlong · January 23, 2019, 6:27pm

This is a really good question and you probably won't be surprised to learn the answer is, "Well, it depends!"

Your example for a my_conn function is good. For my internal use I added the expectation that user names and passwords for the DB would be stored in environment variables. You also may have to deal with schemas inside your connection, depending on your situation.

I notice you're not using dplyr syntax in your examples. You really should read up on dplyr in databases

In my use case, I have two big star schemas I use over and over. They have something like 10 joins in each schema. So have a function called get_warehouse_data(con, start_schema_name). The get_warehouse_data takes a connection, builds all the joins using dplyr syntax, does the joins and returns a tibble back that's a pointer to the database. Then I can take the resulting tibble and run it through dplyr functions to get it down to what I want. The nice thing about this is that no data is returned to R until the very end. And it allows you and your other users to stay in dplyr and not have to context switch between R and SQL.

Trowic · January 24, 2019, 8:49am

Thank you for your advice.

I am using dplyr, but not for databases because i couldn't until now (i am considering to switch from {RODBC} to {odbc}).
I am also wondering if it would be better to use parametrized queries or glue::glue_sql(). I think i should be consistent and not use both syntax.