dbplyr Postgres na percentage count trick

I wanted to use dbplyr to push a compute down to a Postgres database. It will time out if I try to download the table it's a public cloud database, so this is my story.

I wanted to see which date columns were mostly empty. Thus, I wanted to calculate the percentage NA for each column with keywords in the column name.

Note: I am using lazy evaluation with a live database connection.

Googled Stackoverflow Common Solution

 library(dplyr)
 x %>% 
   summarise_each(funs(100*mean(is.na(.))))

source

The problem with this is the genereated Postgres sql SELECT SUM(colname is NULL) ... is not a valid operation with the version of Postgres I am using. (v11)

Final Solution

data_tbl %>%
  select(contains("date"), contains("month"), contains("year")) %>%
  mutate_all(
    funs(
      case_when(
        is.na(.) ~ 1,
        TRUE ~ 0))) %>%
  summarise_all(mean)

This generates perfect SQL for my use.

I thought i'd share and save the world some trouble for the next person.

2 Likes

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.