How to select substring from a string?


#1

If my string is a DNA sequence TATACATGATCGGGTCATAAGCTATAATGGGGCAATAA

and i want to to extract substring from ATG to TAA then, what should i do?

According to this string there will be two substring -

  1. ATGATCGGGTCATAA
    2)ATGGGGCAATAA

#2

I recommend package: stringi and you probably need to learn some basic stuff about regular expressions.

library(stringi)
x = "TATACATGATCGGGTCATAAGCTATAATGGGGCAATAA"
stri_extract(x, regex = "ATG.*?TAA")
"ATGATCGGGTCATAA"
# compare without ? (a.s.k greedy)
stri_extract(x, regex = "ATG.*TAA")
"ATGATCGGGTCATAAGCTATAATGGGGCAATAA"

#3

What if there are multiple substrings in a string then also this would help?


#4

Just use:

stri_extract_all(x, regex = "ATG.*?TAA")
[[1]]
[1] "ATGATCGGGTCATAA" "ATGGGGCAATAA"  

That's what you want?


#5

Aslo if i want to stop at not only TAA but at TAG and TGA also then, what should i do?


#6

In that way we will play around all the time. I showed you an approach and I believe that you can learn basic reg. exp. that helps you to resolve the problem.


#7

Thank you so much, i have one more query,
If i have an argument of length 1000 to 10000 and if i want to extract substring from particular number to some particular number for example from 368 to 897 then what approach should i use?


#8

You mean positions?

stri_sub(x, from = 3, to = 5)
[1] "TAC"

It's worth learning stringi package. There're lots of useful functions.


#9

Thank you sir! i will learn this package