Joining a dataframe on multiple columns when dealing with missing values

I would like to use a left_join, but the data I'm working with suffers from missing data. This prevents me from choosing a suitable key to perform to join on. My original dataset is about 1800 rows and the dataset to join with is about 6000 rows.
In the example data below, you can see that a join should be possible because the information is all there. The problem is that there is not one column that exactly matches the to-be-joined dataframe.

Piping multiple left_joins also doesn't seem to work, because the left_join changes the dataframe columns.

Does someone have more experience with joining dataframes when dealing with missing or incomplete data? Or do I have to create different subsets and merge those together?

library(dplyr)

# base dataframe
df1 <- tibble(
  var1 = c("A", NA, "C"),
  var2 = c(1, 2, NA),
  var3 = c(NA, "bear", "zebra")
)
## Data to be joined
df2 <- tibble(
  var1 = c("A", "B", "C"),
  var2 = c(1, 2, 3),
  var3 = c("dog", "bear", "zebra"),
  var4 = c("new", "info", "joined")
)

## A normal join doesn't show the desired results
df1 %>%
  left_join(y = df2)

## selecting var1 as the key, will miss the information from the second row
df1 %>%
  left_join(y = df2, by = "var1")

## piping multiple dataframe gives an error
df1 %>%
  left_join(y = df2, by = "var1")  >%
  left_join(y = df2, by = "var2") 

Here is 'greedy' match on var1 or var2 approach

df1 %>%
  crossing(y = df2) %>%
  filter(var1 == y$var1 |
         var2 == y$var2) 

plus more cleanup

df1 %>%
  crossing(y = df2) %>%
  filter(var1 == y$var1 |
         var2 == y$var2) %>%
  mutate(var4 = y$var4) %>%
  select(-y)
1 Like

Thanks! This gets me in the right direction. My original dataset is not super large (about 1800 rows) and the dataset to check for join is about 6000 rows. So this approach may not work for very large datasets.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.