I'm struggling with this type of join in R, and I'm not quite sure how to do it.

My data frame is like this:

all_data <- structure(list(pk1 = c("1", "1", "1"),
                           pk2 = c("101", "101", "101"), 
                           pk3 = c(111111, 222222, 333333), 
                           date_created = structure(c(18494, 18527, 18564), class = "Date")), 
                      row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))

What I'd like to do is try to join it on itself, but only join if there is a most recent transaction that exists.

Ideally my output would look like this:

  pk1   pk2    pk3.x date_created.x  pk3.y date_created.y
  <chr> <chr>  <dbl> <date>          <dbl> <date>        
1 1     101   111111 2020-08-20         NA NA            
2 1     101   222222 2020-09-22     111111 2020-08-20    
3 1     101   333333 2020-10-29     222222 2020-09-22

I struggle with this because I can't indicate a conditional join (i.e. join with these primary keys only if the date_created.x > date_created.y). I would appreciate a dplyr approach, although at this point, I feel that data.table might be the route I need to pursue.

A coworker pointed out to me that I didn't even need to join it at all (probably a holdover of my SQL approaches). A grouped by lag() works perfectly:

all_data %>%
  group_by(pk1, pk2) %>%
  mutate(prior_pk3 = dplyr::lag(pk3, n = 1, default = NA),
         prior_date_created = dplyr::lag(date_created, n = 1, default = NA))

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.