Merging 2 dataframes and replacing NA values

zhanhaopeh · May 31, 2019, 5:00pm

I have 2 dataframes (x and y) with similar column names, and I would like to merge the 2 dataframes by the "ID" column. Also, I would like to merge them based on the following conditions:

For columns that are present in both dataframes, replace NA values with the non-NA values in either dataframe.
If the ID row is absent in the original dataframe (x), then create a new record below.

x <- data.frame(
    ID = c(1,2,3),
    S1 = c(1, NA, NA),
    S2 = c(2, 2, 2)
)

y <- data.frame(
    ID = c(1, 2, 3, 4),
    S1 = c(1, 1, 1, 1),
    S3 = c(3, 3, 3, 3)
)

and I would like to merge these 2 dataframes into a final dataframe that looks like this:

final <- data.frame(
    ID = c(1, 2, 3, 4),
    S1 = c(1, 1, 1, 1),
    S2 = c(2, 2, 2, NA),
    S3 = c(3, 3, 3, 3)
)

I have tried using full_join and bind_rows but to no avail. It seems that doing a full_join would result in the creation of S1.x and S1.y as separate columns, but doesn't replace the missing values.

Is there a way that I can achieve this via tidyverse? If not, would appreciate any other solutions. Thanks!

JohnMount · May 31, 2019, 5:39pm

The concept is called "coalesce" in SQL. One idea is: land both the .x and .y columns and then pick between them. Here is some (admittedly non-tidyverse) code to do this.

library("wrapr")

x <- data.frame(
  ID = c(1,2,3),
  S1 = c(1, NA, NA),
  S2 = c(2, 2, 2)
)

y <- data.frame(
  ID = c(1, 2, 3, 4),
  S1 = c(1, 1, 1, 1),
  S3 = c(3, 3, 3, 3)
)

final <- merge(x, y, by = "ID", all = TRUE)
final$S1 <- final$S1.x %?% final$S1.y

knitr::kable(final)

ID	S1.x	S2	S1.y	S3	S1
1	1	2	1	3	1
2	NA	2	1	3	1
3	NA	2	1	3	1
4	NA	NA	1	3	1

The coalescing behavior is the designed behavior of rqdatatable's "natural join".

library("rqdatatable")
#> Loading required package: rquery

x <- data.frame(
  ID = c(1,2,3),
  S1 = c(1, NA, NA),
  S2 = c(2, 2, 2)
)

y <- data.frame(
  ID = c(1, 2, 3, 4),
  S1 = c(1, 1, 1, 1),
  S3 = c(3, 3, 3, 3)
)

final <- natural_join(x, y, 
                      by = "ID",
                      jointype = "FULL")

knitr::kable(final)

ID	S1	S2	S3
1	1	2	3
2	1	2	3
3	1	2	3
4	1	NA	3

zhanhaopeh · June 1, 2019, 11:55am

Thanks John, this is exactly what I was looking for. However, I have some colnames that has spaces in it which I have used `` to deal with. And I'm getting the following error message when I try to perform the natural_join function:

Error in prepareAlias(alias, strict = strict) : 
  wrapr:let alias value not a valid name: " My Colname with Spaces "

Any idea how I can deal with this error?

JohnMount · June 1, 2019, 1:53pm

Ah, sorry about that. wrapr/rqdatatable are designed only for "tame" column names (not requiring quotes). So these solutions don't apply to your data.

One can try dplyr::full_join() plus dplyr::coalesce() (but be aware dplyr::coalesce() has its own issues: dplyr::coalesce(c(NA,NA), c(2, 3)) errors out, and you have to successfully guess what the changed column names are going to be):

library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

x <- data.frame(
  ID = c(1,2,3),
  S1 = c(1, NA, NA),
  S2 = c(2, 2, 2)
)
colnames(x) <- c("ID", " S1 ", "S2")

y <- data.frame(
  ID = c(1, 2, 3, 4),
  S1 = c(1, 1, 1, 1),
  S3 = c(3, 3, 3, 3)
)
colnames(y) <-c("ID", " S1 ", "S3")

final <- full_join(x, y, by = "ID")
final$S1 <- dplyr::coalesce(final[[" S1 .x"]],  final[[" S1 .y"]])

knitr::kable(final)

ID	S1 .x	S2	S1 .y	S3	S1
1	1	2	1	3	1
2	NA	2	1	3	1
3	NA	2	1	3	1
4	NA	NA	1	3	1

zhanhaopeh · June 4, 2019, 6:03am

Noted, thanks for the clarification. I think I will stick to using natural_join function from rqdatatable and rename my colnames names instead. Cheers!

system · June 11, 2019, 6:03am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.