Should `_join()` functions match NA values by default?

alexzajichek · June 9, 2023, 3:05pm

I just realized that join functions match NA values by default. This was a surprise to me as I would have never expected this to be default behavior. Thoughts?

dplyr::left_join(
  x = tibble::tibble(a = c(1, 2, NA)),
  y = tibble::tibble(
    a = c(1, NA),
    b = c("val1", "val2")
  ),
  by = "a"
)
#> # A tibble: 3 × 2
#>       a b    
#>   <dbl> <chr>
#> 1     1 val1 
#> 2     2 <NA> 
#> 3    NA val2

dplyr::left_join(
  x = tibble::tibble(a = c(1, 2, NA)),
  y = tibble::tibble(
    a = c(1, NA),
    b = c("val1", "val2")
  ),
  by = "a",
  na_matches = "never"
)
#> # A tibble: 3 × 2
#>       a b    
#>   <dbl> <chr>
#> 1     1 val1 
#> 2     2 <NA> 
#> 3    NA <NA>

sessionInfo()
#> R version 4.2.1 (2022-06-23 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22000)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] fansi_1.0.4      utf8_1.2.3       dplyr_1.1.2      digest_0.6.31   
#>  [5] withr_2.5.0      R6_2.5.1         lifecycle_1.0.3  magrittr_2.0.3  
#>  [9] reprex_2.0.2     evaluate_0.20    pillar_1.9.0     rlang_1.1.1     
#> [13] cli_3.6.0        rstudioapi_0.14  fs_1.6.1         generics_0.1.3  
#> [17] vctrs_0.6.2      rmarkdown_2.20   tools_4.2.1      glue_1.6.2      
#> [21] xfun_0.37        yaml_2.3.7       fastmap_1.1.0    compiler_4.2.1  
#> [25] pkgconfig_2.0.3  htmltools_0.5.4  tidyselect_1.2.0 knitr_1.42      
#> [29] tibble_3.2.1

^{Created on 2023-06-09 with reprex v2.0.2}

nirgrahamuk · June 9, 2023, 3:11pm

whatever is documented is fine, and the user can control what should happen, I see no issues here.
The documentation itself partially justifies this decision;

"na", the default, treats two NA or two NaN values as equal, like %in%, match(), and merge().

"never" treats two NA or two NaN values as different, and will never match them together or to any other values. This is similar to joins for database sources and to base::merge(incomparables = NA).

alexzajichek · June 9, 2023, 3:47pm

Fair enough. I was just thinking it might make sense for the default behavior to match what would be expected, but maybe I’m the only one that would expect it.

I agree it’s all in the documentation, mostly just a convenience thing to only have to use an extra argument in the exception cases where I want to force NA values to be treated as the same, since in general they are not (necessarily) equal.

nirgrahamuk · June 9, 2023, 3:54pm

You can define your own functions and have them use different defaults. Here is an example of that.

library(purrr)
myleft_join <- partial(.f=dplyr::left_join,na_matches="never")

myleft_join(data.frame(a=NA),data.frame(a=NA,b="b"))
left_join(data.frame(a=NA),data.frame(a=NA,b="b"))

alexzajichek · June 9, 2023, 6:45pm

Thanks. Certainly agree there are technical work arounds , no problem there. Just caught up in the presumption that NA values are equal unless explicitly told otherwise, whereas the the opposite would be the behavior I’d expect. Less technical, more of a design thing

system · June 30, 2023, 6:46pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.