dbplyr not translating standard functions (eg sum) to sql successfully

Hi, all!

I am trying to summarise some data within duckdb using dplyr/dbplyr, and I am running into problems with functions not successfully translating into sql.

I understand that not everything is possible (I have read SQL translation • dbplyr), but I am simply asking for sums after filtering for a given condition -- this seems likely translatable to sql via dbplyr, I would think.

The crux of the code/issue is:

my_summary_2 <- con %>% 
  tbl("duck_df") %>% 
  summarise(across(
    .cols = everything(),
    .fns = list(
      n = ~ n(),
      n_na = ~ sum(is.na(.x)),
      mean = ~ mean(.x, na.rm = TRUE),
      n_lt_p = ~ sum(.x < p, na.rm = TRUE),
      pct_lt_p = ~ mean(.x < p, na.rm = TRUE) * 100
    )
  ))


my_summary_2 %>% collect() # boo!

With errors:

Error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! rapi_prepare: Failed to prepare query SELECT
  COUNT(*) AS x_n,
  SUM((x IS NULL)) AS x_n_na,
  AVG(x) AS x_mean,
  SUM(x < p) AS x_n_lt_p,
  AVG(x < p) * 100.0 AS x_pct_lt_p,
  COUNT(*) AS y_n,
  SUM((y IS NULL)) AS y_n_na,
  AVG(y) AS y_mean,
  SUM(y < p) AS y_n_lt_p,
  AVG(y < p) * 100.0 AS y_pct_lt_p,
  COUNT(*) AS z_n,
  SUM((z IS NULL)) AS z_n_na,
  AVG(z) AS z_mean,
  SUM(z < p) AS z_n_lt_p,
  AVG(z < p) * 100.0 AS z_pct_lt_p
FROM duck_df
Error: Binder Error: No function matches the given name and argument types 'sum(BOOLEAN)'. You might need to add explicit type casts.
	Candidate functions:
	sum(DECIMAL) -> DECIMAL
	sum(SMALLINT) -> HUGEINT
	sum(INTEGER) -> HUGEINT
	sum(BIGINT) -> HUGEINT
	sum(HUGEINT) -> HUGEINT
	sum(DOUBLE) -> DOUBLE

LINE 3:   SUM((x IS NULL)) AS x_n_na,

I know this is possible in SQL somehow (COUNT case when...) but I am not veyr comfortable in SQL and would rather stay in dplyr-land if possible. Is there some way I need to rewrite my summary function that will enable appropriate translation?

Thank you!

Full code:

library(dplyr)
library(duckdb)
library(dbplyr)

possible_numbers <- na_if(c(1:4), 4) 

df <- data.frame(
  x = sample(possible_numbers, size = 10, replace = TRUE),
  y = sample(possible_numbers, size = 10, replace = TRUE),
  z = sample(possible_numbers, size = 10, replace = TRUE)
)

p <- 2

df %>% 
  summarise(across(
    .cols = everything(),
    .fns = list(
      n = ~ n(),
      n_na = ~ sum(is.na(.x)),
      mean = ~ mean(.x, na.rm = TRUE),
      n_lt_p = ~ sum(.x < p, na.rm = TRUE),
      pct_lt_p = ~ mean(.x < p, na.rm = TRUE) * 100
    )
  ))



con = dbConnect(duckdb(),
                dbdir = "test.duckdb",
                read_only = FALSE)
# duckdb_register(con, name = "duck_df", df = df)

dbWriteTable(con, "duck_df", df)

my_summary_1 <- con %>% 
  tbl("duck_df") %>% 
  summarise(across(.cols = everything(),
                   .fns = list(mean = ~ mean(.x, na.rm = TRUE))))
  

my_summary_1 %>% collect() # yay, works!



my_summary_2 <- con %>% 
  tbl("duck_df") %>% 
  summarise(across(
    .cols = everything(),
    .fns = list(
      n = ~ n(),
      n_na = ~ sum(is.na(.x)),
      mean = ~ mean(.x, na.rm = TRUE),
      n_lt_p = ~ sum(.x < p, na.rm = TRUE),
      pct_lt_p = ~ mean(.x < p, na.rm = TRUE) * 100
    )
  ))


my_summary_2 %>% collect() # boo, errors!

There are two challenges.

  1. duckdb doesnt know to sum(Booleans)
    I deal with it explicitly by saying TRUE = 1 and FALSE = 0 , and duckdb can sum those.
  2. it can't acces the environement variable p from within the depths of the dynamic function calls (across()) etc (so it seems to me) so I made it explicit by first adding p_ value to every observation so its like any other variable
my_summary_2 <- con %>% 
  tbl("duck_df") %>% mutate(
    p_ = .env$p
  ) %>% 
  summarise(across(
    .cols = everything(),
    .fns = list(
      n = ~ n(),
      n_na = ~ sum(case_match(is.na(.x),
                              TRUE ~ 1L,
                              FALSE ~0L)),
      mean = ~ mean(.x, na.rm = TRUE),
      n_lt_p = ~ sum(case_match(.x < p_,
                                TRUE ~ 1L,
                                FALSE ~0L), na.rm = TRUE),
      pct_lt_p = ~ mean(case_match(.x < p_,
                                   TRUE ~ 1L,
                                   FALSE ~0L), na.rm = TRUE) * 100
    )
  ))


my_summary_2 %>% collect()
1 Like

Thank you so much, Nir! That works beautifully and I now know a bit more about case_match/case_when and their sql counterparts, as well as data masking.

Thanks so much for taking the time to help me out

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.