Join against the most recent date (if it exists)

I'm struggling with this type of join in R, and I'm not quite sure how to do it.

My data frame is like this:

all_data <- structure(list(pk1 = c("1", "1", "1"),
                           pk2 = c("101", "101", "101"), 
                           pk3 = c(111111, 222222, 333333), 
                           date_created = structure(c(18494, 18527, 18564), class = "Date")), 
                      row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))

What I'd like to do is try to join it on itself, but only join if there is a most recent transaction that exists.

Ideally my output would look like this:

  pk1   pk2    pk3.x date_created.x  pk3.y date_created.y
  <chr> <chr>  <dbl> <date>          <dbl> <date>        
1 1     101   111111 2020-08-20         NA NA            
2 1     101   222222 2020-09-22     111111 2020-08-20    
3 1     101   333333 2020-10-29     222222 2020-09-22

I struggle with this because I can't indicate a conditional join (i.e. join with these primary keys only if the date_created.x > date_created.y). I would appreciate a dplyr approach, although at this point, I feel that data.table might be the route I need to pursue.

A coworker pointed out to me that I didn't even need to join it at all (probably a holdover of my SQL approaches). A grouped by lag() works perfectly:

all_data %>%
  group_by(pk1, pk2) %>%
  mutate(prior_pk3 = dplyr::lag(pk3, n = 1, default = NA),
         prior_date_created = dplyr::lag(date_created, n = 1, default = NA))

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.