Storing user database connection information for package use

c_12345 · September 19, 2018, 4:11pm

I'm creating an R package where the user will first have to connect to a database using the DBI and odbc packages. Subsequent functions will then need to query the database database and then conduct some sort of analysis. So the workflow for the user is (1) connect to the database; (2) do something that depends on the connection. e.g.

connect_to_db()
analyze_using_db_connection()

I'm not sure of the best setup to make things easy on the user and myself, the developer. As I see it, here are my options:
(1) for every function in the package have a con argument where the user passes the connection (this seems less than ideal for anyone using the package)
(2) Use some sort of class system like R6 and store the connection as a data field. All subsequent functions (methods) that require the connection are member of the class (this seems like an overkill use of OOP)
(3) Have an argument for the connection in each function that needs the connection. If the connection is missing, have the function look for a connection con in the parent environment and check that con is the right type with each function call (I don't love this situation either; it seems tenuous to me)

Or is there another preferred method when writing a package.

Any advice would be greatly appreciated

ajd · October 13, 2018, 2:59pm

I came here looking for an answer to more or less the same question. So, I don’t have an answer for you, but I can share what I’m currently doing to get your feedback and perhaps it helps you too.

Inspired by the Quandl package, which stores the API key in options via a package confit function, I’ve done the following:

create a config function at the package level that connects to the DB using JDBC or ODBC. Call this something like pkgDbConnect().
this config function stores the connection object in options and can be accessed using getOption(‘pkgConn’).
create another utility or config function called something like checkConn() that checks for the presence of this ‘pkgConn’ option and throw an error if not present or set incorrectly.
Now in your analyze_using_db_connection() function, add the checkConn() function at the start so an error is thrown and execution stopped if not connected to the DB and continue with the analysis if he check passes.

How does that sound to you? The user would need to run the pkgDbConnect() after loading the package using library() before they can use any of the package functions that depend on the database. If they don’t do this, they’ll see a message telling them to connect to DB first.

chris.prener · October 13, 2018, 3:32pm

Another point of inspiration might be the tidycensus package, which stores a user’s api key. Unlike what you describe, the user does not have to reinvoke this each time they open the package - once they have setup the key they can use the package across multiple R sessions.

jdlong · October 13, 2018, 4:17pm

I've been struggling with this on an internal package at work. Here's what I've settled on:

storing user credentials in environment variables so that scripts can pull username and pwd without ever storing those in scripts.

I then have the functions that set up the connections which the user has to assign to a connection. e.g.:

pwd <- Sys.getenv('YOUR_PWD')
uid <- Sys.getenv('YOUR_UID')

con <- connect_our_db(uid, pwd)

connect_our_db has logic to figure out if it's on win, osx, or linux and choose the driver appropriately. It reads username and pwd from environment variables and barfs with a meaningful message if they are not there.

then each function that uses a connection depends on the connection being passed to it. e.g.:

out_table <- my_magic_function(con, param)

When I initially started the project I was doing stuff like setting a con object in the global environment from within functions. Then I would call the global con object from within a function without passing the object as a parameter. This approach blew up in my face when I wanted to start using R Studio's new "source as job" feature. It did not play well with my non-functional style. I secretly knew that calling into the global environment from within functions was a bad idea, I had just never had it bite me. So I refactored my code so that functions ONLY ever interact with things explicitly passed to them. This made everything so much easier to maintain and debug.

so basically I'm saying it's a good idea to do what @c_12345 felt was less than ideal:

(1) for every function in the package have a con argument where the user passes the connection (this seems less than ideal for anyone using the package)

one problem you may run into is that your code is being used along with other code and other connections. It's a really fragile arrangement to make your code implicitly dependent on an object name in the global environment.

chris.prener · October 13, 2018, 4:47pm

The package I mentioned, tidycensus, stores the api keys in .Renviron.

jdlong · October 13, 2018, 4:52pm

Yeah that makes total sense to me. I was responding to you but also referencing things said above you. That was bad thread hygiene on my part! Sorry for the confusion

chris.prener · October 13, 2018, 5:08pm

No worries! Just wanted to clarify (and good excuse to pull up the source for that function).

Do you have any security concerns re storing a username and password in .Renviron @jdlong?

hoelk · October 13, 2018, 5:51pm

I manage this issue using the secret package which uses public/privat key encryption to store senstive data. This is much better than storing the credentials in plain text.

The exact implementation depends a bit on your usecase. If its always the same database I would do something like this:

If you call your function, it looks in the secrets vault whether user credentials are saved for the current user. If not the functions ask you to enter the credentials and they are stored encrypted somewhere (where again depends on your environment) or aborts and asks you to call something like set_credentials_function().

jdlong · October 13, 2018, 8:54pm

maybe I'm missing something here and you or @hoelk can set me straight if I am. But if an attacker can read my .Renviron then my system is powned already. The exception would be if we're on a system where home directories are publicly readable. So I have always treated .Renviron as a safe place. What do you guys think?

chris.prener · October 13, 2018, 9:28pm

I don’t have an opinion really @jdlong - question was out of genuine curiosity. Your logic makes sense to me. It is interesting to know about the secret package though. Tucking that away in my “someday this could be useful” file. Thanks @hoelk!

FWIW I only have api keys in there that don’t have serious security implications, so it isn’t something I’m particularly worried about.

hoelk · October 13, 2018, 11:37pm

hmm if an attacker has read access to your home and the private key as well as the secrets are there, I guess you could have just put it in plaintext as well.