Matching data frames of different lenghts by rows of unique entries.

Hello, I have these 4 data frames below. I want to match the entries from the column genes_brown_GO to the entries from the column GeneSymbol in each of the other data frames. genes_brown_GO column is a small list present in the GeneSymbol columns of the other data frames, so it's like I want to sort out the genes_brown_GO column from the other data frames by rows (so keeping the <chr> column, so the GeneSymbol, with the <dbl> column, so the log2FoldChange values).

Here the sample data from 'genes_brown_GO':

    # A tibble: 159 x 1
   genes_brown_GO$`genes_brown_ GO`
   <chr>                           
 1 Actr2                           
 2 Adipor2                         
 3 Adrm1                           
 4 Agfg1                           
 5 Alcam                           
 6 Alg3                            
 7 Anxa2                           
 8 Aoah                            
 9 Ap1g1                           
10 Apod                            
# ... with 149 more rows

The sample data from 'IMQvsLAL_6h_protein_coding_log2FC':
     # A tibble: 16,341 x 1
   IvsL_6h_readable$GeneSymbol $log2FoldChange
   <chr>                                 <dbl>
 1 March1                               1.34  
 2 Marc1                               -2.03  
 3 March2                               0.148 
 4 Marc2                               -0.0372
 5 Ackr1                                2.30  
 6 Aldoa                                0.320 
 7 Aldoa                               -1.08  
 8 Ankhd1                               0.501 
 9 Arhgef2                              0.593 
10 Clec2d                               1.25  
# ... with 16,331 more rows

The sample data from 'IMQvsLAL_16h_protein_coding_log2FC':
    # A tibble: 28,361 x 1
   IvsL_16h_readable$GeneSymbol $log2FoldChange
   <chr>                                  <dbl>
 1 Marc1                                 -1.23 
 2 March1                                -1.91 
 3 Marc2                                  0.178
 4 March2                                -0.795
 5 1700030C10Rik                          0.794
 6 1700030C10Rik                         -0.413
 7 4930594M22Rik                         -0.659
 8 4930594M22Rik                         -1.37 
 9 Actr2                                  0.561
10 Aldoa                                  1.14 
# ... with 28,351 more rows

The sample data from 'IMQvsLAL_16h_protein_coding_log2FC':
    A tibble: 16,341 x 1
   IvsI_readable$GeneSymbol $log2FoldChange
   <chr>                              <dbl>
 1 March1                            0.421 
 2 Marc1                            -1.37  
 3 Marc2                            -0.129 
 4 March2                           -0.788 
 5 Actr2                            -0.957 
 6 Aldoa                             0.735 
 7 Aldoa                             0.0730
 8 Anxa2                            -1.54  
 9 Ap1g1                            -0.259 
10 Atad3a                            0.363 
# ... with 16,331 more rows

This is what I tried so far by matching 'genes_brown_GO' with each one of the above data frames (just an example):

    library(dplyr)
> IvsL_6h_merged <- merge(genes_brown_GO, IMQvsLAL_6h_protein_coding_log2FC)

This is what came out:

    # A tibble: 2,598,219 x 1
   IvsL_6h_merged$`genes_brown_ GO` $GeneSymbol $log2FoldChange
   <chr>                            <chr>                 <dbl>
 1 Actr2                            March1                 1.34
 2 Adipor2                          March1                 1.34
 3 Adrm1                            March1                 1.34
 4 Agfg1                            March1                 1.34
 5 Alcam                            March1                 1.34
 6 Alg3                             March1                 1.34
 7 Anxa2                            March1                 1.34
 8 Aoah                             March1                 1.34
 9 Ap1g1                            March1                 1.34
10 Apod                             March1                 1.34
# ... with 2,598,209 more rows

As you can see, this is not what I was expecting. How can I solve this?

Thank you so much for your willingness!

Welcome to the community!

I believe the function is behaving as expected. It creates the Cartesian product, as you are not using the by argument and there are no common column names. This argument (and its friends by.x and by.y) helps to specify the columns with respect to which you are combining.

If you want to get more specific help, can you please share a small part of your dataset in a copy paste friendly way? Sharing dput(head(<your_data_frame_name>)) should be sufficient in this case.

Hope this helps.

(A side note: You don't need dplyr to use merge.)

Thank you a lot @Yarnabrina for the reply and the welcome. I add below the parts of datasets.
Good tip for dplyr!

dput() for 'genes_brown_GO':

structure(list(`genes_brown_ GO` = c("Actr2", "Adipor2", "Adrm1", 
"Agfg1", "Alcam", "Alg3")), row.names = c(NA, 6L), class = "data.frame")

dput() for 'IMQvsLAL_6h_protein_coding_log2FC' (i think just one of the other three data frames is enough :wink: ):

structure(list(GeneSymbol = c("March1", "Marc1", "March2", "Marc2", 
"Ackr1", "Aldoa"), log2FoldChange = c(1.34127129458368, -2.02525311416417, 
0.148269792854153, -0.0372489280122111, 2.29790214816817, 0.320027338944237
)), row.names = c(NA, 6L), class = "data.frame")

Could you tell me then the syntax to use the argument by.x and by.y in the entire code of merge please?

I tried to use the argument by in this way (I changed the colname in genes_brown_GO with GeneSymbol, so the columns have the same header:
IvsL_6h_merged <- merge(genes_brown_GO, IMQvsLAL_6h_protein_coding_log2FC, by.GeneSymbol)

But it didn't work. It gives me this error:
Error in fix.by(by.x, x) : object 'by.GeneSymbol' not found

Really kind!

Thanks a lot @Yarnabrina, I think I solved! A random friend of mine popped up and looking at the code he suggested sth similar to yours! :rofl::rofl: (crazy guy!)

IvsL_6h_merged <- merge(genes_brown_GO, IvsL_6h_pc, by="GeneSymbol")
1 Like

With the data you shared, this code won't work. That's because "GeneSymbol" is not available in both the datasets.

I actually changed the GeneSymbol in all the datasets to make it work :wink:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.