Inequality constraints in dplyr join

jlacko · May 25, 2020, 7:28pm

Is there an elegant equivalent of the SQL between statement in {dplyr}?

I am facing a situation of pairing two data frames - one with daily snapshots, and another in SCD2 historization stereotype (i.e. left hand side has date_valid field, and right hand side has valid_from and valid_to fields).

The canonical SQL approach would be to inner join them on lhs.date_valid between rhs.valid_from and rhs.valid_to, or perhaps lhs.date_valid >= rhs.valid_from and lhs.date_valid < rhs.valid_to.

As familiar as I am with the SQL way of doing this routine task I struggle with finding a practical {dplyr} approach.

To further complicate things I would prefer to do the task without taking on additional dependencies.

mfherman · May 25, 2020, 9:32pm

One option is the fuzzyjoin package. See this thread for a couple examples:

Tidy way to range join tables, on an interval of dates tidyverse

You can do it with the fuzzyjoin package, which implements various not quite exact matching joins in dplyr syntax. library(tidyverse) library(fuzzyjoin) df1 <- tibble::tribble( ~id, ~category, ~date, 1L, "a", "7/1/2000", 2L, "b", "11/1/2000", 3L, "c", "7/1/2002" ) %>% mutate(date = as.Date(date, format = "%m/%d/%Y")) df2 <- tibble::tribble( ~category, ~other_info, ~start, ~end, "a", "x", "1/1/2000", "12/31/2000", "b", "y", "1/1/2001", "12/31/2001", "c", "z", "1/1/2002", "12/31/2002" ) %>% mutate_at(vars(start, end), as.Date, format = "%m/%d/%Y") fuzzy_left_join( df1, df2, by = c( "category" = "c…

If you’re working with large datasets, you’re probably better off with foverlaps() from data.table.

jlacko · May 26, 2020, 10:44pm

I was somewhat reluctant to introduce additional dependencies, so I have made do with a very convoluted cross join followed by a filter - not pretty, but works.

Having said that I will be following dplyr:: join_by() development closely...

github.com/tidyverse/dplyr

join_by(): Syntax for generic joins

opened 08:46AM - 08 Nov 16 UTC

closed 07:09PM - 09 May 22 UTC

krlmlr

feature tables 🧮

https://github.com/hadley/dplyr/issues/557#issuecomment-53483154 and https://git…hub.com/hadley/dplyr/issues/378#issuecomment-212609049 propose a syntax for generic and rolling joins: ```r left_join( FundMonths, Returns, join_by(FundID == FundID, yearmonth > gmonth + 3, yearmonth <= gmonth + 15) ) left_join( events, days, join_by(collector_id == collector_id, event_timestamp >= largest(day)) ) ``` As usual, this should be powered by an SE version `join_by_()`. We can pass this to the SQL engine (and perhaps to data tables) with relatively little work, the main challenge will be to implement this for data frames.