HI,
I have a df with two columns:
Column 1 - gene code
Column 2 - gene description. Within the gene description, appears the gene symbol in parenthesis (). In addition, when two rows share the same symbol (two similar genes), they include the string " variant # ".
I've been requested to create a new df, which its first column consists of the gene symbol, and the second column consist of the variant number. If the gene is unique and has no other genes sharing the same symbol name, it should appear as variant 1.
For example:
001357882 cysteine and histidine rich 1 (Cyhr1), transcript variant 8
In the new df should appear as:
Cyhr1 variant 8
I've tried using the regex function but I get many errors. I think creating a loop is the bast way but I'm a little stuck here. If anyone could assist...
Thank you
There might be better regular expressions than these but it works on your example. I used look-ahead and look-behind expressions to find the content between parentheses.
Thank you! The thing is that I have a df of 4k+ rows. So I can't insert each symbol. I need the system to run over the df, find in every row the symbol of the gene, which is written inside a parenthesis, and also determine which variant of the gene it is.
Then, it must create a new df based on the previous data:
First column, the symbol of the gene.
Second column, variant 1 if it is the only row matching to the symbol of the gene. Otherwise, variant #, according to the variant number written in the description.
The data looks as following:
NM_013485 complement component 9 (C9), mRNA.
2 NM_001034878 dynein, axonemal, intermediate chain 2 (Dnaic2), mRNA.
3 NM_177364 SH3 and PX domains 2B (Sh3pxd2b), mRNA.
4 NM_009437 thiosulfate sulfurtransferase, mitochondrial (Tst), mRNA.
5 NM_001098810 peptidyl-tRNA hydrolase 2 (Ptrh2), transcript variant 2, mRNA.
6 NM_001039684 Meckel syndrome, type 1 (Mks1), mRNA.
7 NM_183034 pleckstrin homology domain containing, family M (with RUN domain) member 1 (Plekhm1), mRNA.
8 NM_133807 leucine rich repeat containing 59 (Lrrc59), mRNA.
9 NM_144824 WD repeat containing, antisense to Trp53 (Wrap53), transcript variant 1, mRNA.
10 NM_010271 glycerol-3-phosphate dehydrogenase 1 (soluble) (Gpd1), mRNA.
11 NM_001364769 WD repeat containing, antisense to Trp53 (Wrap53), transcript variant 2, mRNA.
12 NM_177752 essential meiotic structure-specific endonuclease 1 (Eme1), mRNA.
Thank you again
I manually entered the data in my first example because you did not provide any data structure to work with. I have taken the data from your last post and made a text file from it, separating the columns with semicolons. You may obtain your data in some other way. Starting from the point where the data have been read into a data frame named DF, my previous code provides most of what you want. I have added a step to put explicit "variant 1" values in rows where no variant is listed.
When I tried to do it by myself, one of the things that I did wrong is that when I look for data inside parenthesis I don't precisely get the symbol of the gene, but I also get other non important data. I would like to get rid of this non-relevant data. I might suggest that since most of this data begins wit non capital letter (because it's not a name or symbol of a gene) we can filter it this way. So my regular expression would be the any string inside Parenthesis that begins with capital letter. Do you how can I do it?