I wanted to use dbplyr to push a compute down to a Postgres database. It will time out if I try to download the table it's a public cloud database, so this is my story.
I wanted to see which date columns were mostly empty. Thus, I wanted to calculate the percentage NA for each column with keywords in the column name.
Note: I am using lazy evaluation with a live database connection.
Googled Stackoverflow Common Solution
library(dplyr) x %>% summarise_each(funs(100*mean(is.na(.))))
The problem with this is the genereated Postgres sql
SELECT SUM(colname is NULL) ... is not a valid operation with the version of Postgres I am using. (v11)
data_tbl %>% select(contains("date"), contains("month"), contains("year")) %>% mutate_all( funs( case_when( is.na(.) ~ 1, TRUE ~ 0))) %>% summarise_all(mean)
This generates perfect SQL for my use.
I thought i'd share and save the world some trouble for the next person.