dplyr joins: dealing with multiple matches (duplicates in key column)

Hello,
I am trying to join two data frames using dplyr. Neither data frame has a unique key column. The closest equivalent of the key column is the dates variable of monthly data. Each df has multiple entries per month, so the dates column has lots of duplicates.

I was able to find a solution from Stack Overflow, but I am having a really difficult time understanding that solution. Can you help me find a simpler solution that is easier for beginner level users to understand?

Here is a simple reproducible example:

df_1 <- data_frame(
  dates = c(as.Date("2018-07-01"), as.Date("2018-06-01"), as.Date("2018-06-01"), as.Date("2018-05-01")),
  x1 = c(10L, 11L, 12L, 13L),
  text1 = c("text a", "text b", "text c", "text d")
)

df_2 <- data_frame(
  dates = c(as.Date("2018-07-01"), as.Date("2018-06-01"), as.Date("2018-05-01"), as.Date("2018-04-01")),
  x2 = c(20L, 21L, 22L, 23L),
  text2 = c("text aa", "text bb", "text cc", "text dd")
)

Notice that rows 2 & 3 in df_1 both refer to "2018-06-01" (i.e. a duplicate in the key column, other columns have different data)

# A tibble: 4 x 3
  dates         x1 text1 
  <date>     <int> <chr> 
1 2018-07-01    10 text a
2 2018-06-01    11 text b
3 2018-06-01    12 text c
4 2018-05-01    13 text d
> df_2
# A tibble: 4 x 3
  dates         x2 text2  
  <date>     <int> <chr>  
1 2018-07-01    20 text aa
2 2018-06-01    21 text bb
3 2018-05-01    22 text cc
4 2018-04-01    23 text dd

If I do a simple left_join, I get this:

> df_combined_1 <- left_join(df_1, df_2, by = "dates") 
> df_combined_1
# A tibble: 4 x 5
  dates         x1 text1     x2 text2  
  <date>     <int> <chr>  <int> <chr>  
1 2018-07-01    10 text a    20 text aa
2 2018-06-01    11 text b    21 text bb
3 2018-06-01    12 text c    21 text bb
4 2018-05-01    13 text d    22 text cc

rs

I want a joined data frame that is something like this:

df_need <- data_frame(
  dates = c(as.Date("2018-07-01"), as.Date("2018-06-01"), as.Date("2018-06-01"), as.Date("2018-05-01")),
  x1 = c(10L, 11L, 12L, 13L),
  text1 = c("text a", "text b", "text c", "text d"),
  x2 = c(20L, 21L, NA, 23L),
  text2 = c("text aa", "text bb", NA, "text dd")
)
df_need
> df_need
# A tibble: 4 x 5
  dates         x1 text1     x2 text2  
  <date>     <int> <chr>  <int> <chr>  
1 2018-07-01    10 text a    20 text aa
2 2018-06-01    11 text b    21 text bb
3 2018-06-01    12 text c    NA <NA>   
4 2018-05-01    13 text d    23 text dd

Here is the Stack Overflow solution that seems to match exactly what I am looking for:

Is it possible to create a solution that is (a) a bit easier to understand for beginners (b) uses the purr package or some other tidyverse solution?

Thanks in advance for any comments and guidance.

1 Like

Here's one approach. I add a row number within each group of dates, so that only the first June from df_2 will be joined to the first June entry from df_1.

left_join(df_1 %>% group_by(dates) %>% mutate(id = row_number()),
          df_2 %>% group_by(dates) %>% mutate(id = row_number()), 
          by = c("dates", "id"))

# A tibble: 4 x 6
# Groups:   dates [?]
  dates         x1 text1     id    x2 text2  
  <date>     <int> <chr>  <int> <int> <chr>  
1 2018-07-01    10 text a     1    20 text aa
2 2018-06-01    11 text b     1    21 text bb
3 2018-06-01    12 text c     2    NA NA     
4 2018-05-01    13 text d     1    22 text cc

Alternatively, if you want to capture every row that exists in either table, you could use full_join:

full_join(df_1 %>% group_by(dates) %>% mutate(id = row_number()),
          df_2 %>% group_by(dates) %>% mutate(id = row_number()), 
          by = c("dates", "id"))

# A tibble: 5 x 6
# Groups:   dates [?]
  dates         x1 text1     id    x2 text2  
  <date>     <int> <chr>  <int> <int> <chr>  
1 2018-07-01    10 text a     1    20 text aa
2 2018-06-01    11 text b     1    21 text bb
3 2018-06-01    12 text c     2    NA NA     
4 2018-05-01    13 text d     1    22 text cc
5 2018-04-01    NA NA         1    23 text dd
10 Likes

Thank you @jonspring ! This is perfect. Solved my problem :slight_smile:

Hi Piranha,

If jonspring solved your issue, you might want to accept his answer so that we know that this question is resolved.

Thanks! :slight_smile:

Done. Forgot to check that box earlier. Thanks :slight_smile:

1 Like