Merge with multiple reference tables

MyKo101 · September 29, 2021, 11:50am

I have two reference tables ,ref1 and ref2. and a dataset, df. All are data.tables

ref1 has variables "V1", "V2" and "x"
ref2 has variables "V1" and "x"
df has variables "V1" and "V2"

I want to extract the values of "x" from ref1 and ref2 based on the appropriate values for "V1" and "V2". There is no overlap between "V1" values in ref1 and ref2.

Some MWE data:

ref1 <- data.table(
  V1 = c("a","a","a","b","b","b","e","e"),
  V2 = c("1","2","3","4","5","6","10","11"),
  x = c(43,66,12,19,92,30,44,56)
)

ref2 <- data.table(
  V1 = c("c","d"),
  x = c(100,200)
)

df <- data.table(
  V1 = c("a","b","c","d","f"),
  V2 = c("2","4","7","9","14")
)

target <- data.table(
  V1 = c("a","b","c","d","f"),
  V2 = c("2","4","7","9","14"),
  x = c(66,19,100,200,NA)
)

jrmuirhead · September 29, 2021, 1:04pm

Hi @MyKo101 ,
Nice MWE. I'm sure there's a simpler way, but two joins of ref1 and ref2 with df and then finding the non-NA values via coalesce works.

library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library("data.table")
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last

ref1 <- data.table(
  V1 = c("a","a","a","b","b","b","e","e"),
  V2 = c("1","2","3","4","5","6","10","11"),
  x = c(43,66,12,19,92,30,44,56)
)

ref2 <- data.table(
  V1 = c("c","d"),
  x = c(100,200)
)

df <- data.table(
  V1 = c("a","b","c","d","f"),
  V2 = c("2","4","7","9","14")
)


df1 <- ref1[df, on = c("V1", "V2")]
df2 <- ref2[df, on = c("V1")]

(df3 <- df1[df2, on = c("V1", "V2")] %>%
  mutate(x = coalesce(x, i.x)) %>%
  select(V1, V2, x))
#>    V1 V2   x
#> 1:  a  2  66
#> 2:  b  4  19
#> 3:  c  7 100
#> 4:  d  9 200
#> 5:  f 14  NA

target <- data.table(
  V1 = c("a","b","c","d","f"),
  V2 = c("2","4","7","9","14"),
  x = c(66,19,100,200,NA)
)

identical(df3, target)
#> [1] TRUE

^{Created on 2021-09-29 by the reprex package (v2.0.1)}

system · October 20, 2021, 1:04pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.