I would like to use a left_join, but the data I'm working with suffers from missing data. This prevents me from choosing a suitable key to perform to join on. My original dataset is about 1800 rows and the dataset to join with is about 6000 rows.
In the example data below, you can see that a join should be possible because the information is all there. The problem is that there is not one column that exactly matches the to-be-joined dataframe.
Piping multiple left_joins also doesn't seem to work, because the left_join changes the dataframe columns.
Does someone have more experience with joining dataframes when dealing with missing or incomplete data? Or do I have to create different subsets and merge those together?
library(dplyr)
# base dataframe
df1 <- tibble(
var1 = c("A", NA, "C"),
var2 = c(1, 2, NA),
var3 = c(NA, "bear", "zebra")
)
## Data to be joined
df2 <- tibble(
var1 = c("A", "B", "C"),
var2 = c(1, 2, 3),
var3 = c("dog", "bear", "zebra"),
var4 = c("new", "info", "joined")
)
## A normal join doesn't show the desired results
df1 %>%
left_join(y = df2)
## selecting var1 as the key, will miss the information from the second row
df1 %>%
left_join(y = df2, by = "var1")
## piping multiple dataframe gives an error
df1 %>%
left_join(y = df2, by = "var1") >%
left_join(y = df2, by = "var2")