Tidy way to range join tables, on an interval of dates

You can do it with the fuzzyjoin package, which implements various not quite exact matching joins in dplyr syntax.

library(tidyverse)
library(fuzzyjoin)

df1 <- tibble::tribble(
  ~id, ~category,       ~date,
  1L,       "a",  "7/1/2000",
  2L,       "b", "11/1/2000",
  3L,       "c",  "7/1/2002"
  ) %>%
  mutate(date = as.Date(date, format = "%m/%d/%Y"))

df2 <- tibble::tribble(
  ~category, ~other_info,     ~start,         ~end,
  "a",         "x", "1/1/2000", "12/31/2000",
  "b",         "y", "1/1/2001", "12/31/2001",
  "c",         "z", "1/1/2002", "12/31/2002"
  ) %>%
  mutate_at(vars(start, end), as.Date, format = "%m/%d/%Y")

fuzzy_left_join(
  df1, df2,
  by = c(
    "category" = "category",
    "date" = "start",
    "date" = "end"
    ),
  match_fun = list(`==`, `>=`, `<=`)
  ) %>%
  select(id, category = category.x, other_info, date, start, end)
#> # A tibble: 3 x 6
#>      id category other_info date       start      end       
#>   <int> <chr>    <chr>      <date>     <date>     <date>    
#> 1     1 a        x          2000-07-01 2000-01-01 2000-12-31
#> 2     2 b        <NA>       2000-11-01 NA         NA        
#> 3     3 c        z          2002-07-01 2002-01-01 2002-12-31

Created on 2018-04-30 by the reprex package (v0.2.0).

The extra argument, in the fuzzy_left_join() function, match_fun, allows you to define the matching criterion for each pair of columns as a function. In this case, we want category == category, date >= start, and date <= end.

22 Likes