Using regular expressions to create a data frame from an existing one

HI,
I have a df with two columns:
Column 1 - gene code
Column 2 - gene description. Within the gene description, appears the gene symbol in parenthesis (). In addition, when two rows share the same symbol (two similar genes), they include the string " variant # ".
I've been requested to create a new df, which its first column consists of the gene symbol, and the second column consist of the variant number. If the gene is unique and has no other genes sharing the same symbol name, it should appear as variant 1.
For example:
001357882 cysteine and histidine rich 1 (Cyhr1), transcript variant 8
In the new df should appear as:
Cyhr1 variant 8

I've tried using the regex function but I get many errors. I think creating a loop is the bast way but I'm a little stuck here. If anyone could assist...
Thank you

There might be better regular expressions than these but it works on your example. I used look-ahead and look-behind expressions to find the content between parentheses.

library(stringr)
library(dplyr)

DF <- data.frame(Code = "001357882", Desc = "cysteine and histidine rich 1 (Cyhr1), transcript variant 8")
DF
#>        Code                                                        Desc
#> 1 001357882 cysteine and histidine rich 1 (Cyhr1), transcript variant 8
DF2 <- mutate(DF, Code = NULL, Symbol = str_extract(Desc, "(?<=\\().+(?=\\))"),
              Variant = str_extract(Desc, "variant \\d+"), Desc = NULL)
DF2
#>   Symbol   Variant
#> 1  Cyhr1 variant 8

Created on 2020-05-29 by the reprex package (v0.2.1)

Thank you! The thing is that I have a df of 4k+ rows. So I can't insert each symbol. I need the system to run over the df, find in every row the symbol of the gene, which is written inside a parenthesis, and also determine which variant of the gene it is.
Then, it must create a new df based on the previous data:
First column, the symbol of the gene.
Second column, variant 1 if it is the only row matching to the symbol of the gene. Otherwise, variant #, according to the variant number written in the description.
The data looks as following:
NM_013485 complement component 9 (C9), mRNA.
2 NM_001034878 dynein, axonemal, intermediate chain 2 (Dnaic2), mRNA.
3 NM_177364 SH3 and PX domains 2B (Sh3pxd2b), mRNA.
4 NM_009437 thiosulfate sulfurtransferase, mitochondrial (Tst), mRNA.
5 NM_001098810 peptidyl-tRNA hydrolase 2 (Ptrh2), transcript variant 2, mRNA.
6 NM_001039684 Meckel syndrome, type 1 (Mks1), mRNA.
7 NM_183034 pleckstrin homology domain containing, family M (with RUN domain) member 1 (Plekhm1), mRNA.
8 NM_133807 leucine rich repeat containing 59 (Lrrc59), mRNA.
9 NM_144824 WD repeat containing, antisense to Trp53 (Wrap53), transcript variant 1, mRNA.
10 NM_010271 glycerol-3-phosphate dehydrogenase 1 (soluble) (Gpd1), mRNA.
11 NM_001364769 WD repeat containing, antisense to Trp53 (Wrap53), transcript variant 2, mRNA.
12 NM_177752 essential meiotic structure-specific endonuclease 1 (Eme1), mRNA.
Thank you again :slight_smile:

I manually entered the data in my first example because you did not provide any data structure to work with. I have taken the data from your last post and made a text file from it, separating the columns with semicolons. You may obtain your data in some other way. Starting from the point where the data have been read into a data frame named DF, my previous code provides most of what you want. I have added a step to put explicit "variant 1" values in rows where no variant is listed.

library(stringr)
library(dplyr)

DF <- read.csv("~/R/Play/Dummy.csv", sep = ";", stringsAsFactors = FALSE)
DF
#>            Code
#> 1     NM_013485
#> 2  NM_001034878
#> 3     NM_177364
#> 4     NM_009437
#> 5  NM_001098810
#> 6  NM_001039684
#> 7     NM_183034
#> 8     NM_133807
#> 9     NM_144824
#> 10    NM_010271
#> 11 NM_001364769
#> 12    NM_177752
#>                                                                                            Desc
#> 1                                                            complement component 9 (C9), mRNA.
#> 2                                        dynein, axonemal, intermediate chain 2 (Dnaic2), mRNA.
#> 3                                                       SH3 and PX domains 2B (Sh3pxd2b), mRNA.
#> 4                                     thiosulfate sulfurtransferase, mitochondrial (Tst), mRNA.
#> 5                                peptidyl-tRNA hydrolase 2 (Ptrh2), transcript variant 2, mRNA.
#> 6                                                         Meckel syndrome, type 1 (Mks1), mRNA.
#> 7   pleckstrin homology domain containing, family M (with RUN domain) member 1 (Plekhm1), mRNA.
#> 8                                             leucine rich repeat containing 59 (Lrrc59), mRNA.
#> 9                WD repeat containing, antisense to Trp53 (Wrap53), transcript variant 1, mRNA.
#> 10                                 glycerol-3-phosphate dehydrogenase 1 (soluble) (Gpd1), mRNA.
#> 11               WD repeat containing, antisense to Trp53 (Wrap53), transcript variant 2, mRNA.
#> 12                            essential meiotic structure-specific endonuclease 1 (Eme1), mRNA.
DF2 <- mutate(DF, Code = NULL, Symbol = str_extract(Desc, "(?<=\\().+(?=\\))"),
                Variant = str_extract(Desc, "variant \\d+"), Desc = NULL)
DF2 <- mutate(DF2, Variant = ifelse(is.na(Variant), "variant 1", Variant))
DF2
#>                                Symbol   Variant
#> 1                                  C9 variant 1
#> 2                              Dnaic2 variant 1
#> 3                            Sh3pxd2b variant 1
#> 4                                 Tst variant 1
#> 5                               Ptrh2 variant 2
#> 6                                Mks1 variant 1
#> 7  with RUN domain) member 1 (Plekhm1 variant 1
#> 8                              Lrrc59 variant 1
#> 9                              Wrap53 variant 1
#> 10                     soluble) (Gpd1 variant 1
#> 11                             Wrap53 variant 2
#> 12                               Eme1 variant 1

Created on 2020-05-30 by the reprex package (v0.3.0)

When entering the code above I get the following error:

DF2 <- mutate(DF, Code = NULL, Symbol = str_extract(Desc, "(?<=\().+(?=\))"),

  •           Variant = str_extract(Desc, "variant \\d+"), Desc = NULL)
    

Error in stri_extract_first_regex(string, pattern, opts_regex = opts(pattern)) :
object 'Desc' not found

DF2 <- mutate(DF2, Variant = ifelse(is.na(Variant), "variant 1", Variant))
Error in mutate(DF2, Variant = ifelse(is.na(Variant), "variant 1", Variant)) :
object 'DF2' not found
DF2
Error: object 'DF2' not found

What is the meaning of Desc?

When I tried to do it by myself, one of the things that I did wrong is that when I look for data inside parenthesis I don't precisely get the symbol of the gene, but I also get other non important data. I would like to get rid of this non-relevant data. I might suggest that since most of this data begins wit non capital letter (because it's not a name or symbol of a gene) we can filter it this way. So my regular expression would be the any string inside Parenthesis that begins with capital letter. Do you how can I do it?

@Yarnabrina - Thanks for catching the bad matches.

@ninirois - I added column names Code and Desc to the data set you posted. You will have to edit my code to use the column names of your data set.

Here is my latest version, searching for text that begins with an upper case letter inside of parentheses.

library(stringr)
library(dplyr)
DF <- read.csv("~/R/Play/Dummy.csv", sep = ";", stringsAsFactors = FALSE)

DF2 <- mutate(DF, Code = NULL, Symbol = str_extract(Desc, "(?<=\\()[A-Z][^\\)]+(?=\\))"),
                Variant = str_extract(Desc, "variant \\d+"), Desc = NULL)
DF2 <- mutate(DF2, Variant = ifelse(is.na(Variant), "variant 1", Variant))
DF2
#>      Symbol   Variant
#> 1        C9 variant 1
#> 2    Dnaic2 variant 1
#> 3  Sh3pxd2b variant 1
#> 4       Tst variant 1
#> 5     Ptrh2 variant 2
#> 6      Mks1 variant 1
#> 7   Plekhm1 variant 1
#> 8    Lrrc59 variant 1
#> 9    Wrap53 variant 1
#> 10     Gpd1 variant 1
#> 11   Wrap53 variant 2
#> 12     Eme1 variant 1

Created on 2020-05-30 by the reprex package (v0.3.0)

Ok it worked!
Thank you very much!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.