Inequality constraints in dplyr join

Is there an elegant equivalent of the SQL between statement in {dplyr}?

I am facing a situation of pairing two data frames - one with daily snapshots, and another in SCD2 historization stereotype (i.e. left hand side has date_valid field, and right hand side has valid_from and valid_to fields).

The canonical SQL approach would be to inner join them on lhs.date_valid between rhs.valid_from and rhs.valid_to, or perhaps lhs.date_valid >= rhs.valid_from and lhs.date_valid < rhs.valid_to.

As familiar as I am with the SQL way of doing this routine task I struggle with finding a practical {dplyr} approach.

To further complicate things I would prefer to do the task without taking on additional dependencies.

One option is the fuzzyjoin package. See this thread for a couple examples:

If you’re working with large datasets, you’re probably better off with foverlaps() from data.table.

2 Likes

I was somewhat reluctant to introduce additional dependencies, so I have made do with a very convoluted cross join followed by a filter - not pretty, but works.

Having said that I will be following dplyr:: join_by() development closely...

2 Likes