When dbplyr does not translate functions to SQL

dbplyr

#1

I am new to databases and getting up to speed with dbplyr and the idea of conducting data transformation inside a cloud database and just pulling down the results. It makes sense to me in theory, but I'm getting stuck in cases where dbplyr does not (yet) translate some R code to an SQL statement. Here's a simple example:

  library(tidyverse)
  library(dbplyr)
  library(RPostgres)
  library(DBI)

# to make a reproducible example database
# does not work for me because postgresHasDefault() == FALSE
#   con <- dbConnect(RPostgres::Postgres(), dbname = "testdb")
#   copy_to(con, rownames_to_column(mtcars), name = "mtcars")

# works outside of a database
  mtcars %>% 
    tibble::rownames_to_column() %>%
    mutate(nWords = lengths(strsplit(rowname, "\\W+"))) %>% # problem for dbplyr
    filter(nWords < 2)

As I understand it, dbplyr does not translate strsplit():

Failed to prepare query: ERROR: function strsplit(text, unknown) does not exist

This is a specific example of a general problem for me. I want to do some manipulation on a large table before I collect the results, but sometimes the manipulation code I write in R does not translate to SQL via dbplyr.

I could use some advice about how to approach these problems.

Any function that dplyr doesn’t know how to convert is left as is.

Is the answer to figure out the right SQL equivalent and use translate_sql()?


#2

As far as I know, you can simply leave the SQL command in your code and the dbplyr translation leaves it as is. Therefore you don't need to use translate_sql(). This has worked for me so far.


#3

Maybe a dumb question here. Do I need to then figure out the SQL equivalent of strsplit() and just use that? Is that what you are saying?


#4

I think so, yes, but somebody else may have a better solution.


#5

Thanks again, @martin.R.

I'm trying to decide how to proceed in general based on this specific example. Maybe folks will have some thoughts about the approach. Some parameters to consider:

  • you have a cloud database of 500k rows that you expect to double in size in 6 or 12 months
  • you want to query the data, do some manipulations, filter/select based on manipulations, and pipe the results to a dashboard
  • ideally your query uses 100% live data, but an hourly cache would be fine

Is it worth figuring out how to translate all pre-dashboard R visualization code to SQL to run on the database if there is no dbplyr equivalent? Or would you just pull down the data every X minutes and run everything in R (query+manipulation+visualization)? Or maybe a third option that is a no brainer for software engineers (but novel to non-engineers)?


#6

Hi, this specific need is hard to solve for because if starts getting into the area of Text Analysis with SQL (and R). We essentially need to "tokenize" each word within the field (which is what the \\W+ is effectively doing) and then perform an analysis on the results (<2). Postgres seems to offer some solutions by creating an array based on the regex result, but I'm not certain how would those looks to dbplyr after converted. So, in essence, yes, the idea is to find an equal function to strsplit(), but my impression is that this will actually be a combination of multiple SQL calls within PostGres.