Help with str_extract

Suppose I have a string like

text <- "1234.Lorem, lpsum, etc."

I need to separate in two:
x1 = "1234"
x2 = "Lorem, Ipsum, etc."

I was able to get the x1 by:

x1 <- str_extract(string = text,
pattern = "[^\..]+")

But I can't get x2. The best I could found was:

x2 <- str_extract(string = text,
pattern = "\..*")

But I get: x2 = ".Lorem, lpsum, etc."

How can I get x2 correclty?

I would use a look behind assertion.

library(stringr)
text <- "1234.Lorem, lpsum, etc."
str_extract(text, "[^\\.]+")
str_extract(text, "(?<=\\.).+")
1 Like

This looks like it will work. @willmjr just thought I'd translate the regex here, in case it's confusing what each part does.

\\. = a literal period .
[^\\.] = any character besides a literal period .
+ = at least one time
[^\\.]+ = any non-period character more than once

Note that [^\\.]+ happens to work for your particular case because str_extract() grabs the first match if there are multiple matches. But [^\\.]+ actually matches the second half of your string as well. I would modify [^\\.]+ to ^[^\\.]+. Confusingly, the ^ outside of the [] means that you're "anchoring" the regex to only look for matches that begin at the beginning of the string. Note that that's a different meaning than when ^ is inside the [], in which case it serves to specify "any character except the following".

For the second part, we have "(?<=\\.).+". Breaking that down:
The (?<=) is the "lookbehind assertion". The characters after the = in the parentheses will be matched as "what comes before the pattern we're actually interested in". So those characters won't actually be returned in the match.

(?<=\\.) means "there is a period immediately before the desired match"
. outside the parentheses means "any character except \n"
+, as before, means "at least one time"
So putting it together, "(?<=\\.).+" means "at least one character immediately following a period"

I hope that's helpful!

3 Likes

Thank you so much, excellent explanation!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.