Combining two like data frames

jrdubbleu · April 26, 2023, 1:52pm

I have several data frames that I want to combine based on a common key value, "case_id". main_df has the original version of all the columns, some of the columns have missing values. df_a, df_b, df_c, df_d, have the case_id column, all of the original versions of the columns, but with the missing values filled in. I have tried to merge, join_left, reduce, bind_cols all of these data frames together but I keep ending up with new versions of the columns with some suffix denoting that the column already existed. I just want the columns with the missing values filled in with the new values from df_a thru df_d. Where am I going wrong?

AlexisW · April 26, 2023, 5:46pm

The standard function won't do that because it's not trivial: what if there are different non-NA values in two of the dataframes? What if there are no non-NA values in any of the dataframes? These require some form of decision, the package won't make it for you. So you need to do it in two steps: first, you join the dataframes, then, you build the updated column without NAs.

The exact way to do it will depend on what your actual data looks like, which we don't know since you didn't provide a reprex. So first, let's build one based on your description.

library(tidyverse)

# create two dataframes with different sets of NA
df_a <- tibble(case_id = 1:6,
               my_column = c("A", NA, "C", "D", NA, NA))

df_b <- tibble(case_id = 1:6,
               my_column = c(NA, "B", "C", NA, "E", NA))

df_a
#> # A tibble: 6 × 2
#>   case_id my_column
#>     <int> <chr>    
#> 1       1 A        
#> 2       2 <NA>     
#> 3       3 C        
#> 4       4 D        
#> 5       5 <NA>     
#> 6       6 <NA>
df_b
#> # A tibble: 6 × 2
#>   case_id my_column
#>     <int> <chr>    
#> 1       1 <NA>     
#> 2       2 B        
#> 3       3 C        
#> 4       4 <NA>     
#> 5       5 E        
#> 6       6 <NA>

# join them
left_join(df_a, df_b, by = "case_id")
#> # A tibble: 6 × 3
#>   case_id my_column.x my_column.y
#>     <int> <chr>       <chr>      
#> 1       1 A           <NA>       
#> 2       2 <NA>        B          
#> 3       3 C           C          
#> 4       4 D           <NA>       
#> 5       5 <NA>        E          
#> 6       6 <NA>        <NA>

# join them, then create a column that summarizes the two others
left_join(df_a, df_b, by = "case_id") |>
  mutate(my_column_updated = if_else(is.na(my_column.x),
                                     my_column.y,
                                     my_column.x))
#> # A tibble: 6 × 4
#>   case_id my_column.x my_column.y my_column_updated
#>     <int> <chr>       <chr>       <chr>            
#> 1       1 A           <NA>        A                
#> 2       2 <NA>        B           B                
#> 3       3 C           C           C                
#> 4       4 D           <NA>        D                
#> 5       5 <NA>        E           E                
#> 6       6 <NA>        <NA>        <NA>

# join them, then create a column that summarizes the two others,
# and keep only that column
left_join(df_a, df_b, by = "case_id") |>
  mutate(my_column = if_else(is.na(my_column.x),
                                     my_column.y,
                                     my_column.x)) |>
  select(case_id, my_column)
#> # A tibble: 6 × 2
#>   case_id my_column
#>     <int> <chr>    
#> 1       1 A        
#> 2       2 B        
#> 3       3 C        
#> 4       4 D        
#> 5       5 E        
#> 6       6 <NA>

^{Created on 2023-04-26 with reprex v2.0.2}

This is one example of a simple case, you might need to adapt it to your actual situation.

system · June 7, 2023, 5:46pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.