How to get common (vice versa) and uncommon rows from two df in R

,

I have 2 dataframe. Now, I want to get the common and uncommon rows of this two dataframe.

Here, common rows can be vice versa. This means I am searching for common between the two datasets that can have changed in the order of matching columns. For example, row number 2 of the df_1 is ID_2 ID_3 and df_2 is ID_3 ID_2. They are not common in terms of column value but they are common if we do not consider the column value.

After getting the common and uncommon I want to add another column label as 1 for common and 0 for uncommon.

The code I am using

result <- df_2 %>% 
  dplyr::inner_join(df_1, by = c("nodeA" = "query", "nodeB" = "target")) %>% 
  dplyr::mutate(GROUP = 1) %>% 
  dplyr::union(df_2 %>% 
                 dplyr::inner_join(df_1, by = c("nodeB" = "query", "nodeA" = "target")) %>% 
                 dplyr::mutate(GROUP = 2)) %>%
  mutate(label=1)

The output I am getting

  nodeA nodeB new_ssp  ssp GROUP label
1  ID_1  ID_2    0.50 0.50     1     1
2  ID_3  ID_4    0.80 0.80     1     1
3  ID_3  ID_2    0.90 0.90     2     1
4  ID_1  ID_6    0.09 0.09     2     1

Expected output (here, ? means I don't know. It can be 1 or 2 does not matter)

  nodeA nodeB new_ssp  ssp GROUP label
1  ID_1  ID_2    0.50 0.50     1     1
2  ID_3  ID_4    0.80 0.80     1     1
3  ID_3  ID_2    0.90 0.90     2     1
4  ID_1  ID_6    0.09 0.09     2     1
5  ID_4  ID_5    0.70 0.70     ?     0
6  ID_40  ID_50    0.70 0.70     ?     0
7  ID_5  ID_3    0.50 0.50     ?     0
8  ID_15  ID_31    0.50 0.50     ?     0

Reproducible Data

query <- c("ID_1", "ID_2", "ID_3", "ID_4", "ID_5", "ID_6")
target <- c("ID_2", "ID_3", "ID_4", "ID_5", "ID_3", "ID_1")
ssp <- c(0.5, 0.9, 0.8, 0.7, 0.5, 0.09)
df_1 <- data.frame(query, target, ssp)


nodeA <- c("ID_1", "ID_3", "ID_3", "ID_40", "ID_15", "ID_1")
nodeB <- c("ID_2", "ID_2", "ID_4", "ID_50", "ID_31", "ID_6")
new_ssp <- c(0.5, 0.9, 0.8, 0.7, 0.5, 0.09)
df_2 <- data.frame(nodeA, nodeB, new_ssp)

Here's an approach that get's you close. We create a combined ID of the sorted values from two nodes, and use that to group and then summarise. The warnings are due to the max function being used, with no values and returning -Inf. That feels ok for this situation, but you could use mutate(ssp = if_else(ssp == -Inf, new_ssp, ssp), new_ssp = if_else(new_ssp == -Inf, ssp, new_ssp )) to get your output.

library(tidyverse)

query <- c("ID_1", "ID_2", "ID_3", "ID_4", "ID_5", "ID_6")
target <- c("ID_2", "ID_3", "ID_4", "ID_5", "ID_3", "ID_1")
ssp <- c(0.5, 0.9, 0.8, 0.7, 0.5, 0.09)
df_1 <- tibble(query, target, ssp)

nodeA <- c("ID_1", "ID_3", "ID_3", "ID_40", "ID_15", "ID_1")
nodeB <- c("ID_2", "ID_2", "ID_4", "ID_50", "ID_31", "ID_6")
new_ssp <- c(0.5, 0.9, 0.8, 0.7, 0.5, 0.09)
df_2 <- tibble(nodeA, nodeB, new_ssp)

# rename df_1 columns to match df_2
df_1 <- df_1 %>% rename(nodeA = query, nodeB = target)

# combine all rows into one dataframe
df <- bind_rows(df_1, df_2)

# function to sort, then combine 2 strings
str_c_sort <- function(x, y){
  c(x, y) %>% 
    sort() %>% 
    str_c(collapse = "-")
}

# use function to create a nodes column.
# group by the nodes column, and then summarise
# take the first nodes for the node columns
df %>% 
  mutate(nodes = map2_chr(nodeA, nodeB, str_c_sort)) %>% 
  group_by(nodes) %>% 
  summarise(nodeA = first(nodeA),
            nodeB = first(nodeB),
            ssp = max(ssp, na.rm = TRUE),
            new_ssp = max(new_ssp, na.rm = TRUE),
            label = if_else(n() > 1, 1, 0))
#> Warning in max(ssp, na.rm = TRUE): no non-missing arguments to max; returning -
#> Inf

#> Warning in max(ssp, na.rm = TRUE): no non-missing arguments to max; returning -
#> Inf
#> Warning in max(new_ssp, na.rm = TRUE): no non-missing arguments to max;
#> returning -Inf

#> Warning in max(new_ssp, na.rm = TRUE): no non-missing arguments to max;
#> returning -Inf
#> # A tibble: 8 x 6
#>   nodes       nodeA nodeB     ssp new_ssp label
#>   <chr>       <chr> <chr>   <dbl>   <dbl> <dbl>
#> 1 ID_1-ID_2   ID_1  ID_2     0.5     0.5      1
#> 2 ID_1-ID_6   ID_6  ID_1     0.09    0.09     1
#> 3 ID_15-ID_31 ID_15 ID_31 -Inf       0.5      0
#> 4 ID_2-ID_3   ID_2  ID_3     0.9     0.9      1
#> 5 ID_3-ID_4   ID_3  ID_4     0.8     0.8      1
#> 6 ID_3-ID_5   ID_5  ID_3     0.5  -Inf        0
#> 7 ID_4-ID_5   ID_4  ID_5     0.7  -Inf        0
#> 8 ID_40-ID_50 ID_40 ID_50 -Inf       0.7      0

# or a more concise version using across
df %>% 
  mutate(nodes = map2_chr(nodeA, nodeB, str_c_sort)) %>% 
  group_by(nodes) %>% 
  summarise(across(c(nodeA, nodeB), first),
            across(c(ssp, new_ssp), ~max(.x, na.rm = TRUE)),
            label = if_else(n() > 1, 1, 0))
#> Warning in max(.x, na.rm = TRUE): no non-missing arguments to max; returning -
#> Inf
#> Warning in max(.x, na.rm = TRUE): no non-missing arguments to max; returning -
#> Inf

#> Warning in max(.x, na.rm = TRUE): no non-missing arguments to max; returning -
#> Inf

#> Warning in max(.x, na.rm = TRUE): no non-missing arguments to max; returning -
#> Inf
#> # A tibble: 8 x 6
#>   nodes       nodeA nodeB     ssp new_ssp label
#>   <chr>       <chr> <chr>   <dbl>   <dbl> <dbl>
#> 1 ID_1-ID_2   ID_1  ID_2     0.5     0.5      1
#> 2 ID_1-ID_6   ID_6  ID_1     0.09    0.09     1
#> 3 ID_15-ID_31 ID_15 ID_31 -Inf       0.5      0
#> 4 ID_2-ID_3   ID_2  ID_3     0.9     0.9      1
#> 5 ID_3-ID_4   ID_3  ID_4     0.8     0.8      1
#> 6 ID_3-ID_5   ID_5  ID_3     0.5  -Inf        0
#> 7 ID_4-ID_5   ID_4  ID_5     0.7  -Inf        0
#> 8 ID_40-ID_50 ID_40 ID_50 -Inf       0.7      0

Created on 2021-04-01 by the reprex package (v1.0.0)

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.