Should I translate legacy SQL to R/dbplyr?

At my new organizaion, I've inherited a mountain of complex SQL queries that we use for regular (weekly/monthly) reporting and I'm thinking about the value of translating these queries into dplyr/dbplyr syntax. Bascially, is it worth it to complete rewrite these queries to make my R brain happy or should I just tidy up the preexisting SQL a bit and be on my way?

My general workflow is somthing like:

  1. Run SQL query with specified parameters
  2. Aggregate/filter/reshape/visualize data using R
  3. Export tables/graphs/reports to PDF/HTML/PPT

Ideally, all of this is wrapped in a single R Markdown document.

I'm leaning towards no on converting the queries to dplyr syntax, but I'd love input from others who have encountered similar scenarios.

1 Like

If the SQL queries are too complex you are going to expend too much time looking for walk around solutions to make it work with dbplyr syntax and since you are already using .Rmd files, I think you would be better letting the sql queries alone on its sql chunks and spend your time in more fun and productive things.

1 Like

I'd leave it be, although I bet you really really want to scratch that itch.
Aside from @andresrcs's valid comment on time spent translating complicated stuff that works (if it's ain't broke don't fix it kind of thing I guess), the other thing I observe is that SQL is more universal than R in general and dbplyr in particular. Translating everything into dbplyr, you're running the risk of ending up with a process that only a chosen few can follow.

4 Likes

I don't know the right answer for you, but I've been through some similar processes and can share a few observations that I've made.

The way I've incrementally ported SQL into dplyr is to first make sure I understand the logic of the SQL before changing anything. Then if the SQL has some logical breaks I put each "related lump" of SQL in a file and call it in a parameterized way using glue_sql from the glue package. Then I'm left with most of my extraction code feeding into tibbles via SQL. Then I wrote my reporting bits in RMarkdown.

So at first pass maybe my code is 90% SQL and 10% R. That's a good start. Then I look for design patterns in all my SQL. If there's a lot of repeated logic then I'd consider making a function that abstracts that logic away. One easy win I've repeated a few times is to have a function that does ALL the joins for a given star schema and returns a tibble that's a pointer to that joined up stuff. Cuts all that preamble business from all my queries. Then I pass that big lazy tibble through a number of filter statements to get it down to a reasonable size, maybe do some summation and then collect. I'll incrementally refactor to move things into functions and then tie them together with dplyr data pipes.

I used to have a "boil the ocean" approach where I'd just try fully refactor and solve everything. I've changed my approach to one of "incremental improvement" as I've aged. I think keeping code running but refactoring it bit by bit is usually (but not always) a better model.

Good luck!

4 Likes

@taras makes a really good point. Before you refactor anything make sure you know what benefit you're getting for your efforts. In my example I'm getting the benefit of abstraction that makes the code more concise and easier to maintain. But if all you need to do is run existing code, slap that junk in a text file and do parameter substitution with glue and don't look back!

4 Likes

Also, building on @taras comment, using sql ensures you that you have a plan B in the case that your R working environment or workflow fails, I like to have some sql queries at hand for the cases when my shiny apps are off-line for some annoying reason, that way I can at least take a look into my data using any sql client even on my phone.

3 Likes

I want to stress out that I sympathize with you deeply. Refactoring old code is my favorite form of procrastination. But for me at least, it's just that - procrastination. And so I'm trying to stay away from this guilty pleasure.

1 Like

let me throw out an idea that I sort of jumped past. Think about whether some scheduled nightly ETL snapshots make sense in your work flow. I've gotten good milage out of creating nightly extraction processes that do the "run SQL with specified parameters" every night and cache them in a table with some tagging.

The advantage of this for me is:

  1. I have history cached of exactly the state of things at a point in time. This is handy if you're reporting from a system where junk changes underneath you.
  2. Even for systems where my DB is handling the snapshots I use this method if the ETL takes longer than a few minutes. Basically I'm caching query results.
  3. If I'm going to use the output of the query in multiple places then having it cached in a table is very handy. I have situations where I generate Rmd reports from the data but some analysts want to look at the same data in a spreadsheet. I just have them hook Excel to my cached results table and pull from there. Same data my reports use, so everyone has "one view of the truth"

I do this even for stuff I only use quarterly. The idea being if I can run it nightly and I peek at it periodically and make sure it looks sane, then it lowers my WTF count at the end of the reporting period. And we all want a lower WTF count during times of stress.

1 Like

@andresrcs, @taras, @jdlong -- thank you very much for the input. I really like the idea of incremental improval here. One reason I was thinking about refactoring to d(b)plyr was that it would force me to develop a deep understanding of the data structures and queries I've been given.

But, I think I'll be able to add more value in the short term by automating and prettifying the analysis and presentation components of the reports and then as a medium term goal, I can return to the SQL and work on improving the queries and/or moving some of the process into if dbplyr if and when it makes sense.

This is my first role that involves ongoing reporting as opposed to one off analysis, so I have a lot to learn about these processes and workflows and I'm very glad to have this place as an essential resource to hear from others who are much more experienced!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.