How to find common rows (considering vice versa format) of 2 dataframe

Hello,

I want to find the common rows between 2 dataframe. To find the common rows, I can use inner_join(), semi_join(), and merge(). But, these operations are not fulfilling my purposes. Because my data in the dataframe is a little different!

Sometimes, the data in the dataframe can be vise versa. Like the 3rd and 5th rows of dataframe-1 and dataframe-2. Dataframe-1 contains A3 A1 0.75 but Dataframe-2 contains A1 A3 0.75 . I would like to take these 2 rows as the same.

My first dataframe looks like

  query target weight
1    A1     A2   0.60
2    A2     A5   0.50
3    A3     A1   0.75
4    A4     A5   0.88
5    A5     A3   0.99
6    (+)-1(10),4-Cadinadiene     Falcarinone-10     0.09
7    Leucodelphinidin-100    (+)-1(10),4-Cadinadiene     0.876
8    Lignin  (2E,7R,11R)-2-Phyten-1-ol   0.778
9    (2E,7R,11R)-2-Phyten-1-ol   Leucodelphinidin    0.55
10   Falcarinone     Lignin  1
11   A1  (+)-1(10),4-Cadinadiene     1
12   A2  Lignin-10  1
13   A3  (2E,7R,11R)-2-Phyten-1-ol   1
14   Falcarinone  A6    1
15   A4  Leucodelphinidin    1
16   A4  Leucodelphinidin    1
17   Falcarinone  A100    1
18   A4  Falcarinone     1

the second dataframe looks like

  query target
1    A1     A2   
2    A2     A5   
3    A1     A3  // Missing in the output
4    A4     A5   
5    A3     A5  // Missing in the output
6    A3  (2E,7R,11R)-2-Phyten-1-ol   
7    (+)-1(10),4-Cadinadiene     Falcarinone    
8    Leucodelphinidin    (+)-1(10),4-Cadinadiene-100    
9    Lignin-2  (2E,7R,11R)-2-Phyten-1-ol   
10   A11  (+)-1(10),4-Cadinadiene    
11   A2  Lignin  
12   A3  (2E,7R,11R)-2-Phyten-1-0l 
13   Falcarinone  A60    
14   A4  Leucodelphinidin  // Missing in the output

The code I am using
output <- semi_join(Dataframe-1, Dataframe-2) OR
output <- inner_join(df_only_dd, sample_data_dd_interaction)

The output I am getting

  query target weight
1    A1     A2   0.60
2    A2     A5   0.50

But, my expected output is like this

  query target weight
1    A1     A2   0.60
2    A2     A5   0.50
3    A3     A1   0.75
4    A4     A5   0.88
5    A5     A3   0.99
6    A4  Leucodelphinidin  1

Reproducible code is given below

df_1 <- read.table(text="query   target     weight
A1  A2  0.6
A2  A5  0.5
A3  A1  0.75
A4  A5  0.88
A5  A3  0.99
(+)-1(10),4-Cadinadiene     Falcarinone     0.09
Leucodelphinidin    (+)-1(10),4-Cadinadiene     0.876
Lignin  (2E,7R,11R)-2-Phyten-1-ol   0.778
(2E,7R,11R)-2-Phyten-1-ol   Leucodelphinidin    0.55
Falcarinone     Lignin  1
A1  (+)-1(10),4-Cadinadiene     1
A2  Lignin  1
A3  (2E,7R,11R)-2-Phyten-1-ol   1
Falcarinone  A6    1
A4  Leucodelphinidin    1
A4  Leucodelphinidin    1
Falcarinone  A100    1
A5  Falcarinone     1", header=TRUE)
df_2 <- read.table(text="query   target
A1  A2 
A2  A5
A1  A3  
A4  A5  
A3  A5  
(+)-1(10),4-Cadinadiene     Falcarinone    
Leucodelphinidin    (+)-1(10),4-Cadinadiene-100    
Lignin-2  (2E,7R,11R)-2-Phyten-1-ol   
A11  (+)-1(10),4-Cadinadiene    
A2  Lignin  
A3  (2E,7R,11R)-2-Phyten-1-0l 
Falcarinone  A6    
A4  Leucodelphinidin  ", header=TRUE)

Any kind of suggestion is appreciated.

suppressPackageStartupMessages({
  library(dplyr)
})
df_1 <- read.table(text="query   target     weight
A1  A2  0.6
A2  A5  0.5
A3  A1  0.75
A4  A5  0.88
A5  A3  0.99", header=TRUE)

df_2 <- read.table(text="query   target     weight
A1  A2  0.6
A2  A5  0.5
A1  A3  0.75
A4  A5  0.88
A3  A5  0.99", header=TRUE)

# target
df_3 <- read.table(text = "query target weight
A1     A2   0.60
A2     A5   0.50
A3     A1   0.75
A4     A5   0.88
A5     A3   0.99", header = TRUE)

header <- colnames(df_1)
df_4 <- inner_join(df_1,df_2,by = "weight") %>%
  select(-query.y,-target.y)
colnames(df_4) <- header

df_4
#>   query target weight
#> 1    A1     A2   0.60
#> 2    A2     A5   0.50
#> 3    A3     A1   0.75
#> 4    A4     A5   0.88
#> 5    A5     A3   0.99

df_4 == df_3
#>      query target weight
#> [1,]  TRUE   TRUE   TRUE
#> [2,]  TRUE   TRUE   TRUE
#> [3,]  TRUE   TRUE   TRUE
#> [4,]  TRUE   TRUE   TRUE
#> [5,]  TRUE   TRUE   TRUE

Created on 2021-01-02 by the reprex package (v0.3.0.9001)

Hi @akib62, can you please share data where some rows will drop, and your expected result in that case? For now, left table seems to be same as output table.

@technocrat joined on weight, and based on your inputs/outputs that seem to give correct results. But it seems a little odd to me that you want that. I may be completely wrong though, and if that's the case, I'm sorry.

1 Like

@technocrat thank you very much. But, your code is not working for me. This is actually my fault. Because my actual dataframe contains different kinds of data and I gave a little portion (more specifically) similar kinds of data in my previous post. However, I updated the dataframes and please look at it now and give your suggestion.

My second dataframe also contains Null for some weight. So, in my understanding I can not use weight in the code.

I have deleted the weight columns from the df_2.

@Yarnabrina Thanks for the comment. It seems that I made confused after using the same kind of data from the dataframes. However, now I made changes in my dataframe. Please, look now and give your suggestions.

Are you trying to conform df_2 (column marked query) to df_1 except for the weights in df_2, which should remain in the df_2_rev table?

> anti_join(df_2,std)
Joining, by = c("query", "target")
             query                      target
1               A1                          A3
2               A3                          A5
3 Leucodelphinidin (+)-1(10),4-Cadinadiene-100
4         Lignin-2   (2E,7R,11R)-2-Phyten-1-ol
5              A11     (+)-1(10),4-Cadinadiene
6               A3   (2E,7R,11R)-2-Phyten-1-0l

At some point Null should be either excluded or imputed if weight is to be used in calculations.

@technocrat thanks for the reply and sorry for my late reply.

I am trying to 'match' the query and target of df_1 with df_2. Most importantly, if I get any value in the df_1 which is reverse in the df_2 (like the 3rd and 5th row of both dataframe) I want to count them as a match pair. After matching, I want to save this pair from the df_1 including the weight value not from the df_2.

In the df_2_rev I just deleted the weight column. Now, we can think that df_1 contains 3 columns but df_2 contains 2 columns.

However, what is the std in your second solution?

Let me know your thoughts.

Perhaps its convenient, to stipulate an order for query / target names , at least for purposes of joining.
something like this:


library(tidyverse)
df_1_x <- mutate(rowwise(df_1),
       name_1 = head(sort(c(query,target)),1),
       name_2 = tail(sort(c(query,target)),1))%>% ungroup

df_2_x <- mutate(rowwise(df_2),
       name_1 = head(sort(c(query,target)),1),
       name_2 = tail(sort(c(query,target)),1)) %>% ungroup

#match regardless of order
inner_join(df_1_x,df_2_x,
           by=c("name_1","name_2"))

#unmatched 1
anti_join(df_1_x,df_2_x,
          by=c("name_1","name_2"))

#unmatched 2
anti_join(df_2_x,df_1_x,
          by=c("name_1","name_2"))

std in this case is df_1; not sure why I chose that name. See @nirgrahamuk's post for an implementation

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.