I am struggling to write the correct regex
pattern to match the following condition
(contains the word other
) OR (contains both us
AND car
)
This code works as expected:
str_detect(c('us cars',
'u.s. cars',
'us and bikes',
'other'),
regex('other|((?=.*us)(?=.*car))',
ignore_case = TRUE))
[1] TRUE FALSE FALSE TRUE
However, if I try to include variations of us
(united states) such as u.s.
and u.s
then the pattern does not work anymore.
str_detect(c('us cars',
'u.s. cars',
'us and bikes',
'other'),
regex('other|((?=.*us)(?=.*u.s.)(?=.*u.s)(?=.*car))',
ignore_case = TRUE))
[1] FALSE FALSE FALSE TRUE
What is the issue here? Why is the AND
pattern working in the first case but not in the second?
Thanks!
The problem is with your use of the ".", this special character means any character on regex if you want a literal dot you have to escape it with \\.
library(stringr)
str_detect(c('us cars',
'u.s. cars',
'us and bikes',
'other'),
regex('other|((?=.*u\\.*s\\.*)(?=.*car))',
ignore_case = TRUE))
#> [1] TRUE TRUE FALSE TRUE
Created on 2019-02-12 by the reprex package (v0.2.1)
1 Like
thanks but then I am confused. isnt .
a character as well? it should match with the regex dot (any character) right?
for instance, here I am escaping and still does not work
> str_detect(c('us cars',
+ 'u.s. cars',
+ 'us and bikes',
+ 'other'),
+ regex('other|((?=.*us)(?=.*u\\.s\\.)(?=.*u\\.s)(?=.*car).*)',
+ ignore_case = TRUE))
[1] FALSE FALSE FALSE TRUE
Nope, because you are just tellin any character but not how many times, .*
this means any character 0 or more times.
2 Likes
this works.
str_detect(c('us cars',
'u.s. cars',
'us and bikes',
'other'),
regex('other|((?=.*u\\.s\\.)(?=.*car).*)',
ignore_case = TRUE))
It seems the issue is having more than one AND. Am I completely crazy here?!
The idea of a regular expression is to define a pattern not each variation of the text as you are trying to do.
1 Like
absolutely. but on purely logical grounds, I dont get why combining four logical conditions with AND
does not work while it works only with two. Perhaps there is a limit in the number of lookaheads that can be done?
You are getting the logic wrong, none of your words match the pattern you where specifying
((?=.*us)(?=.*u\\.s\\.)(?=.*u\\.s)(?=.*car))
this string would match that US U.S. U.S car
do you see the reason? you are using AND not OR
4 Likes
damn you are right... time to go to bed I guess.... Thanks for helping out!!
1 Like
system
Closed
February 19, 2019, 3:46am
11
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed. If you have a query related to it or one of the replies, start a new topic and refer back with a link.