Using filter_at with multiple variable groups

dbplyr
sql

#1

Using filter_at() with a database is powerful since one call to this function can generate a lot of SQL code particularly if you need to filter on many variables. For example:

filter_at(flights_db, vars(contains("time")), all_vars(. > 0))

generates

SELECT *
FROM "flights"
WHERE (((((("dep_time" > 0.0) AND ("sched_dep_time" > 0.0)) AND ("arr_time" > 0.0)) AND ("sched_arr_time" > 0.0)) AND ("air_time" > 0.0)) AND ("time_hour" > 0.0))

If I have two groups of variables that I want to filter on I can combine them using AND by filtering twice.

flights_db %>% 
     filter_at(vars(contains("delay")), all_vars(. < 0)) %>% 
     filter_at(vars(contains("time")), all_vars(. > 0))

However I do not know how to combine these filters using OR logic. If the data are not in a database then I could get the desired result using this code:

bind_rows(
          filter_at(flights, vars(contains("delay")), all_vars(. < 0)),
          filter_at(flights, vars(contains("time")), all_vars(. > 0))
     ) %>%
distinct()

I’m wondering if there is a way to get this same result using a database and some dplyr syntax I don’t know about.

I’m imagining some kind of syntax like this

flights_db %>% 
     filter_at(all_vars(vars(contains("delay")), . < 0) | all_vars(vars(contains("time")), . > 0) )

Is there any way to get this result from a database using filter_at() and without creating two separate queries and combining the results?

Note: I’m using the nycflights13::flights data in a postgres database.


#2

I think the following work:

Setup:

#> sorry to be excessive with the setup
library(dbplyr);library(dplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
copy_to(con, nycflights13::flights, "flights",
        temporary = FALSE, 
        indexes = list(
          c("year", "month", "day"), 
          "carrier", 
          "tailnum",
          "dest"
        )
)
flights_db <- tbl(con, "flights")

# and here it is
filter_at(flights_db, vars(c(contains("delay"), contains("time"))), all_vars(. > 0)) %>% show_query()
#> <SQL>
#> SELECT *
#> FROM `flights`
#> WHERE ((((((((`dep_delay` > 0.0) AND (`arr_delay` > 0.0)) AND (`dep_time` > 0.0)) AND (`sched_dep_time` > 0.0)) AND (`arr_time` > 0.0)) AND (`sched_arr_time` > 0.0)) AND (`air_time` > 0.0)) AND (`time_hour` > 0.0))

The second argument .vars accepts “A list of columns generated by vars(), or a character vector of column names, or a numeric vector of column positions.


#3

Thanks for the suggestion. Unfortunately this does not solve the problem because it does not allow you to use a different condition with each group of variables. The desired result should include rows where all “delay” columns are less than zero OR all “time” columns are greater than zero.

I think your suggestion would result in rows where all “delay” and “time” variables are greater than zero.