I wanted to use dbplyr to push a compute down to a Postgres database. It will time out if I try to download the table it's a public cloud database, so this is my story.
I wanted to see which date columns were mostly empty. Thus, I wanted to calculate the percentage NA for each column with keywords in the column name.
Note: I am using lazy evaluation with a live database connection.
Googled Stackoverflow Common Solution
library(dplyr)
x %>%
summarise_each(funs(100*mean(is.na(.))))
The problem with this is the genereated Postgres sql SELECT SUM(colname is NULL) ...
is not a valid operation with the version of Postgres I am using. (v11)
Final Solution
data_tbl %>%
select(contains("date"), contains("month"), contains("year")) %>%
mutate_all(
funs(
case_when(
is.na(.) ~ 1,
TRUE ~ 0))) %>%
summarise_all(mean)
This generates perfect SQL for my use.
I thought i'd share and save the world some trouble for the next person.