substring with conditions

I need to extract date information from many strings. My strings are like:

"1001171_S-1_2013-11-20_0001104659-13-086087.txt"
"100_S-1_2014-1-10_000659-13-086087.txt"
...

basically for each string, I need to extract the component between the second and the third underscore symbol, I am not sure how to do that.

Does this regular expression work?

library(stringr)

text <- c("1001171_S-1_2013-11-20_0001104659-13-086087.txt",
          "100_S-1_2014-1-10_000659-13-086087.txt")

str_extract(text, "(?<=_)\\d{4}-\\d{1,2}-\\d{1,2}(?=_)")
#> [1] "2013-11-20" "2014-1-10"

Created on 2019-10-18 by the reprex package (v0.3.0.9000)

1 Like

Thanks so much!

One more question,

"/run/media/bb/cc/GA/DrRao/JOBS/Edgar filings_full text/Form S-1/1001171/1001171_S-1_2013-11-20_0001104659-13-086087.txt"

"/run/media/bb/cc/GA/DrRao/JOBS/Edgar filings_full text/Form S-1/1001172/1001172_S-1_2013-01-20_0001104659-13-086087.txt"

for these two strings, how could I extract the substring 1001171 and 1001172?

Assuming you are working only with files inside "Form S-1" folder, this should work

library(stringr)

text <- c("/run/media/bb/cc/GA/DrRao/JOBS/Edgar filings_full text/Form S-1/1001171/1001171_S-1_2013-11-20_0001104659-13-086087.txt",
          "/run/media/bb/cc/GA/DrRao/JOBS/Edgar filings_full text/Form S-1/1001172/1001172_S-1_2013-01-20_0001104659-13-086087.txt")

str_extract(text, "(?<=S-1/)\\d{7}")
#> [1] "1001171" "1001172"

You can extract any substring as long as it follows a pattern and you can describe it with a regular expression.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.