find matches between two groups of variables

Espanino · June 12, 2022, 2:48pm

Hi everybody

I would like to find matches between two groups of variables and create a binary variable that indicates whether such matches exist or not.

Example dataset:

df <- data.frame(groupA_1 = c(1, 2, 1, 2, 1),
                 groupA_2 = c(2, 3, 2, 3, 2),
                 groupA_3 = c(3, 4, 5, 6, 7),
                 groupB_1 = c(1, 8, 3, 2, 1),
                 groupB_2 = c(10, 11, 12, 13, 14),
                 groupB_3 = c(15, 16, 17, 18, 19))

If any of the groupA-variables have a match in any of the groupB-Variables, the new column match is assigned the value 1, otherwise the value is 0.

The final dataset should look like this:

  groupA_1 groupA_2 groupA_3 groupB_1 groupB_2 groupB_3 match
1        1        2        3        1       10       15     1
2        2        3        4        8       11       16     0
3        1        2        5        3       12       17     0
4        2        3        6        2       13       18     1
5        1        2        7        1       14       19     1

My attempt to achieve this looks something like this:

df <- df %>%
  mutate(match = case_when((groupA_1 == groupB_1 |
                             groupA_1 == groupB_2 |
                             groupA_1 == groupB_3) |
                             (groupA_2 == groupB_1 |
                                groupA_2 == groupB_2 |
                                groupA_2 == groupB_3) |
                             (groupA_3 == groupB_1 |
                                groupA_3 == groupB_2 |
                                groupA_3 == groupB_3)) ~ 1,
         TRUE ~ 0)

Apart from this not working (I have not looked too much into why not, since I do not intend to keep the code in anyway close to this if at all possible), its also extremely susceptible to mistakes and terribly inefficient.

My actual dataset has a lot more variable groups and I would have to do this repeatedly for all of them. I feel like there should be an easier and more efficient way to achieve this. Does anyone have a solution?

Thank you in advance.

FJCC · June 12, 2022, 3:36pm

Here is one solution. It returns TRUE and FALSE instead of 1 and 0. You can wrap the any() in as.numeric if you need 1 and 0.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data.frame(groupA_1 = c(1, 2, 1, 2, 1),
                 groupA_2 = c(2, 3, 2, 3, 2),
                 groupA_3 = c(3, 4, 5, 6, 7),
                 groupB_1 = c(1, 8, 3, 2, 1),
                 groupB_2 = c(10, 11, 12, 13, 14),
                 groupB_3 = c(15, 16, 17, 18, 19))
df <- df |> rowwise() |> 
  mutate(match=any(c_across(groupA_1:groupA_3) %in% 
                     c_across(groupB_1:groupB_3)))
df
#> # A tibble: 5 x 7
#> # Rowwise: 
#>   groupA_1 groupA_2 groupA_3 groupB_1 groupB_2 groupB_3 match
#>      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl> <lgl>
#> 1        1        2        3        1       10       15 TRUE 
#> 2        2        3        4        8       11       16 FALSE
#> 3        1        2        5        3       12       17 FALSE
#> 4        2        3        6        2       13       18 TRUE 
#> 5        1        2        7        1       14       19 TRUE

^{Created on 2022-06-12 by the reprex package (v2.0.1)}

Espanino · June 20, 2022, 1:01pm

Thank you very much ! The problem with this solution is that it also matches NA. If there is an NA in any of the columns groupA_1:groupA_3 and an NA in any of the columns groupB_1:GroupB_3, the output will be TRUE. I want this to be FALSE instead, is there a simple fix?

FJCC · June 20, 2022, 1:48pm

I looked up the help section of the %in% function and wrote a modified version that ignores NA, I think.

library(dplyr)

df <- data.frame(groupA_1 = c(1, NA, 1, 2, 1),
                 groupA_2 = c(2, 3, 2, 3, 2),
                 groupA_3 = c(3, 4, 5, 6, 7),
                 groupB_1 = c(1, 8, 3, 2, 1),
                 groupB_2 = c(10, NA, 12, 13, 14),
                 groupB_3 = c(15, 16, 17, 18, 19))

MyMatch <- function(x, table) match(x, table, nomatch = 0,
                                    incomparables = NA) > 0

df <- df |> rowwise() |> 
  mutate(Match=any(MyMatch(c_across(groupA_1:groupA_3), 
                     c_across(groupB_1:groupB_3))))
df
#> # A tibble: 5 x 7
#> # Rowwise: 
#>   groupA_1 groupA_2 groupA_3 groupB_1 groupB_2 groupB_3 Match
#>      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl> <lgl>
#> 1        1        2        3        1       10       15 TRUE 
#> 2       NA        3        4        8       NA       16 FALSE
#> 3        1        2        5        3       12       17 FALSE
#> 4        2        3        6        2       13       18 TRUE 
#> 5        1        2        7        1       14       19 TRUE

^{Created on 2022-06-20 by the reprex package (v2.0.1)}

Espanino · June 21, 2022, 7:59am

You're fantastic, thank you so much!

system · June 28, 2022, 8:00am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.