Why does NA match NA when joining two dataframes?

ablack3 · April 16, 2019, 8:19pm

Since NA is not unique (i.e. one NA does not necessarily equal another NA) it seems to follow that when performing a join the rows should not match when both keys are NA. However when merging with dplyr NA will match NA when merging. What is the reasoning behind this behavior?

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df1 <- tibble(key = c("A", "A", NA), value = 1:3)
df2 <- tibble(key = c("B", "B", NA), value = 4:6)
inner_join(df1, df2, by = "key")
#> # A tibble: 1 x 3
#>   key   value.x value.y
#>   <chr>   <int>   <int>
#> 1 <NA>        3       6

^{Created on 2019-04-16 by the reprex package (v0.2.1)}

Yarnabrina · April 17, 2019, 12:12am

I don't know the answer, but I think it's not only for dplyr. Here's a part from the documentation of match:

Exactly what matches what is to some extent a matter of definition. For all types, NA matches NA and no other value. For real and complex values, NaN values are regarded as matching any other NaN value, but not matching NA , where for complex x , real and imaginary parts must match both (unless containing at least one NA ).

You may be interested in the following SO thread:

nwerth · April 17, 2019, 1:32pm

From the SO thread posted by Yarnabrina, you can also see a way base R allows preventing NA from matching itself. dplyr has something similar (detailed further below in this post). The match function has an incomparables = parameter, so we can do this:

match(NA, NA)
# [1] 1
match(NA, NA, incomparables = NA)
# [1] NA

The base function merge has the same argument:

d1 <- data.frame(x = c(1, NA), y = c("a", "b"))
d2 <- data.frame(x = c(1, NA), z = c("A", "B"))
merge(d1, d2)
#    x y z
# 1  1 a A
# 2 NA b B
merge(d1, d2, incomparables = NA)
#   x y z
# 1 1 a A

The problem is, incomparables isn't allowed for multi-column merges:

d1 <- data.frame(x1 = c(1, NA), x2 = c(TRUE, NA), y = c("a", "b"))
d2 <- data.frame(x1 = c(1, NA), x2 = c(TRUE, NA), z = c("A", "B"))
merge(d1, d2, incomparables = NA)
# Error in merge.data.frame(d1, d2, incomparables = NA) : 
#   'incomparables' is supported only for merging on a single column

For dplyr, the *_join functions can take the na_matches = parameter. From the join.tbl_df documentation:

na_matches

Use "never" to always treat two NA or NaN values as different, like joins for database sources, similarly to merge(incomparables = FALSE). The default, "na", always treats two NA or NaN values as equal, like merge(). Users and package authors can change the default behavior by calling pkgconfig::set_config("dplyr::na_matches" = "never").

sharlagelfand · April 17, 2019, 2:00pm

oh, this is AWESOME. thanks so much for pointing this out @nwerth.

Here's a look of how it looks/can be used in dplyr:

library(dplyr, warn.conflicts = FALSE)

x <- tribble(
  ~a, ~b,
  1, NA,
  2, 2
)

y <- tribble(
  ~b, ~c,
  NA, 5,
  2, 3
)

x %>% inner_join(y, by = "b")
#> # A tibble: 2 x 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1     1    NA     5
#> 2     2     2     3

x %>% inner_join(y, by = "b", na_matches = "never")
#> # A tibble: 1 x 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1     2     2     3

system · May 8, 2019, 2:00pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.