How can I find common features between multiple tables?

I have multiple tables with tab separated format (four attached) each with five columns. The first column is the features , the third is the sample_type (although not mentioned in the tables).

TABLE: a.csv

Streptococcaceae 4.17705641901 control 3.47827812404 0.337904019103
Firmicutes_unclassified 4.43447010448 control 3.9217759446 0.0348055639329
Actinomycetaceae 2.51369314622 control 2.16004135478 0.0243610387108
Eubacteriaceae 4.80377020941 control 4.27387393831 0.0845334692571
Tannerellaceae 4.14816223344 control 3.275816707 0.949059680886
Oxalobacteraceae 2.38299415571 test 2.0825523478 0.65669077294
Staphylococcaceae 1.7758703522 test 3.29562599424 0.317310507863
Propionibacteriaceae 1.491161512 test 2.37370336438 0.916511907864
Lactobacillaceae 3.24263012709 test 2.65310263606 0.654720846019
Clostridiales_unclassified 3.33143883197 test 2.68073144255 0.22479909925

TABLE b.csv

Streptococcaceae 4.14574536955 control 3.75857890864 0.0412500165939
Firmicutes_unclassified 4.54285192263 control 3.6456600168 0.545349668011
Eubacteriaceae 4.72053565932 control 4.06108447754 0.449691797969
Pseudomonadaceae 2.47280035866 control 2.20503368624 0.146793087309
Erysipelotrichaceae 3.71662926551 control 3.05668482543 0.939742989577
Victivallaceae 0.527629857442 control 2.74539202586 0.317310507863
Tannerellaceae 4.50079062345 test 3.58357798914 0.820595839755
Veillonellaceae 3.87333100809 test 3.19818181055 0.289192119229
Rikenellaceae 4.9250780149 test 4.35390701567 0.545349668011
Lactobacillaceae 3.0789422455 test 2.71335701105 0.412191111952
Helicobacteraceae 2.96591152623 test 2.78958989566 0.317310507863

TABLE: c.csv

Firmicutes_unclassified 4.3949137452 control 3.80320556213 0.202318749895
Eubacteriaceae 4.67973931478 control 3.77752868209 0.914742321683
Tannerellaceae 4.24480153834 control 3.57432876491 0.0575043321308
Veillonellaceae 4.31037823286 control 3.77586730047 0.488677469577
Burkholderiaceae 0 control 2.38754970657 0.356028985238
Streptococcaceae 3.70445898562 test 2.99074124872 0.976675416504
Proteobacteria_unclassified 3.54646811621 test 3.11297394996 0.0769750871296
Erysipelotrichaceae 4.0459816119 test 3.44639785866 0.526971732416
Sutterellaceae 3.52589117576 test 2.94530632209 0.173128322568
Acidaminococcaceae 4.43174037412 test 3.88725876468 0.174392135504

TABLE: d.csv

Firmicutes_unclassified 4.48268161942 control 4.08546779118 0.0378086594704
Sutterellaceae 3.72334956471 control 3.17665465678 0.369158808941
Proteobacteria_unclassified 3.41146437028 control 3.03389982079 0.195865532789
Tannerellaceae 4.51120731157 control 3.82196492147 0.143867728482
Streptococcaceae 3.14327432538 test 2.56433651799 0.89796929457
Eubacteriaceae 4.80334041206 test 4.25388289196 0.488744120387
Veillonellaceae 4.16830871806 test 3.64943977262 0.411122991069
Pseudomonadaceae 0.591920446836 test 2.86633204137 0.317310507863
Caulobacteraceae 0.848425939934 test 2.67734038571 0.317310507863

Now, I want some script for the following tasks:

  1. I want to find out the features with the same sample_type across all the four files and write them in an output file with their name and sample_type.
  2. I will also like to get the features with the same sample_type among three and two out of four files and write them separately in the same file.
  3. It would be great if the corresponding file name of the common features are mentioned.

Can anyone please help me?


Why bash? In R, it should be possible by reading the table, then using intersect() and table() to find the features of interest, and subsetting the tables accordingly. In bash, you might be able to get that with some awk magic, but that seems exceedingly complicated, R is just better suited for this. Also, this is an R forum...

1 Like

Then, just extract the feature columns of each data frame, make entries unique() if needed, c()oncatenate them together, and make a table(). You can very simply filter the table to keep entries that appear 4 times, or 3 or 2 times, and get the corresponding names() of the table: these are the features of interest. So finally you just have to select the corresponding rows in the original data frames.

Is there a step that doesn't work?

1 Like

Please give me some time and let you know. I am absolutely beginner in R.

Hi @AlexisW, I have used this:

data_frame_1 <- data.frame(read.table("~/Documents/A.csv"))
data_frame_2 <- data.frame(read.table("~/Documents/B.csv"))
data_frame_3 <- data.frame(read.table("~/Documents/C.csv"))
data_frame_4 <- data.frame(read.table("~/Documents/D.csv"))
intersect <- Reduce(intersect, list(data_frame_1$V1, data_frame_2$V1, data_frame_3$V1, data_frame_4$V1))

And this reverts:

> Reduce(intersect, list(data_frame_1$V1, data_frame_2$V1, data_frame_3$V1, data_frame_4$V1))
 [1] "Streptococcaceae"           "Firmicutes_unclassified"    "Eubacteriaceae"             "Tannerellaceae"            
 [5] "Veillonellaceae"            "Erysipelotrichaceae"        "Clostridiaceae"             "Rikenellaceae"             
 [9] "Ruminococcaceae"            "Akkermansiaceae"            "Eggerthellaceae"            "Enterobacteriaceae"        
[13] "Oscillospiraceae"           "Desulfovibrionaceae"        "Lactobacillaceae"           "Clostridiales_unclassified"
[17] "Odoribacteraceae"           "Barnesiellaceae"            "Lachnospiraceae"            "Sutterellaceae"            
[21] "Acidaminococcaceae"         "Prevotellaceae"             "Selenomonadaceae"           "Coriobacteriaceae"         
[25] "Bacteroidaceae"             "Bifidobacteriaceae"

it means, it reverts only the common features in the first column. But I need the common features across the four tables those match with the third column as well.

Oh, missed that. You can simply paste(V1, V3) together, do the filtering on these vectors, and at the end strsplit() it to recover columns V1 and V3.

Hi @AlexisW- I am trying a different approach to address the problem. Probably you may want to help me. I have extracted out the rows with "control" element from each dataframe. Then extracted out only the first column from it. Resulting into only a single column from each dataframe like:

> a_control
23           Oxalobacteraceae
24          Staphylococcaceae
25       Propionibacteriaceae
26           Lactobacillaceae
27 Clostridiales_unclassified
28        Methanobacteriaceae
29           Odoribacteraceae
30           Peptoniphilaceae
31            Barnesiellaceae
32            Lachnospiraceae
33           Leuconostocaceae
34             Sutterellaceae
35         Acidaminococcaceae
36             Prevotellaceae
37           Selenomonadaceae
38          Coriobacteriaceae
39             Bacteroidaceae
40         Bifidobacteriaceae

I have 3 more dataframes like the one above. They are b_control, c_control, d_control. Now I want to make a single table from these three. That means, a table with four columns will be formed. Now, from this table I will like to get the elements which are duplicated 4 times, 3 times and 2 times in the dataframe. I think this will make the problem much easier.
So, can you tell me how can I proceed from this step i.e., how to make a single table from these 4 dataframes and find the elements with those frequencies?

Thanks a lot for your replies,

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

H sorry @AlexisW, I mistakenly wrote about bash. I just need some R solution only.