Summary() but on the database side of the wire

dplyr
dbplyr

#1

Hey y’all I’m sitting here on a Saturday night thinking about summarizing data, as one does. I regularly have a big table in Amazon Redshift which I’d love to get summary data from describing each field the way summary() in base R or skim() does. But my data is big enough that sucking it into R would take hours. So I’m thinking about using dplyr to cook up some code to effectively execute summary stuff, but do it over in the cloud and return only what’s needed for the output. (I was inspired by this blog post discussing all the different summary functions)

Before I start writing crappy code in anger, do any of you know of anyone or any packages already doing this sort of foolishness? I really hate reinventing the wheel if possible.

many thanks!


#2

You could create an issue in skimr. It uses dplyr so it shouldn’t be too difficult to implement something similar using dbplyr.


#3

good idea… I didn’t realize that skimr uses dplyr. I’ll investigate.

I think the big challenge with pushing the logic to the DB is much of the logic would have to be re-implemented using the limited syntax of SQL. So, for example, to calculate quantiles one would have to count all records, then get the value at different points in the distribution… e.g. get the value at 1/5 n then 2/5 n …


#4

Nice, I woudl also like to see if there’s any code or package to do this. Just for the specific case of quantile, I think you can use percentile_cont (available since Postgres 9.4) to compute quantile, not sure if it’s as precise as the R version of quantile though.

library(tidyverse)
library(dbplyr)
library(RPostgres)
library(DBI)

con <- dbConnect(RPostgres::Postgres(), dbname = "testdb")
copy_to(con, rownames_to_column(mtcars), name = "mtcars")

req <- "SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY mpg) AS q1 FROM mtcars"
tbl(con, sql(req))
##      q1
##   <dbl>
## 1  15.4

mtcars %>%
  summarise(q1 = quantile(mpg, probs = 0.25))
##       q1
## 1 15.425

On a side note, it would be great if someone can point me to mecanism to add specific aggregate function to an already existing DBI backend. Using the help from sql_variant, I came up with this

pg_quantile <-  function(x, probs, ...) {
  stopifnot(length(probs) == 1) ## 1 parameters supported
  sql(str_glue("PERCENTILE_CONT({probs}) WITHIN GROUP (ORDER BY {x})", x = x, probs = probs))
}

postgres_agg <- sql_translator(.parent = base_agg,
                              quantile = pg_quantile)

postgres_var <- sql_variant(
  base_scalar,
  postgres_agg,
  base_no_win
)

conn <- structure(
  list(),
  class = c("TestCon", "DBITestConnection", "DBIConnection")
)

sql_translate_env.TestCon <- function(x) postgres_var
translate_sql(quantile(x, probs = 0.5), con = conn, window = FALSE)
## <SQL> PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY `x`)

It will be great to have a way to use this new translation directly with dplyr, any doc available ?

Thanks


#5

I love the idea of using percentile_cont because it simplifies things, and I’m lazy. But then the package would be dependant on postgres and not generally applicable to all SQL backends. I’m inclined to limit myself to the SQL functions supported in dblyr, just for compatibility. Although that will be more work to code up.


#6

Morning, for the histogram you could use dbplot . It provides a quick and easy ggplot2 output, or it can return the bin+count in a data.frame, all operated within the database. It works with any database that dbplyr works with.


#7

neat project! Thank you. Is it possible to get the bin values from the plot? Are those accessible to the user?


#8

Thanks! And yes, all of the underpinning calculations are availabe in their own discrete functions: https://github.com/edgararuiz/dbplot#calculation-functions


#9

Thanks Edgar. that’s very helpful. I’ll take a look at your code in the near future and see what I can cobble together.


#10

Integrating the dbplot-style calculations into skimr would be great! Getting skimr-style summaries via dbplyr against Redshift and the like would make EDA so much easier.


#11

that’s exactly the use case I had in mind, @kylelundstedt!