bug with AND and OR operators in regex?

I am struggling to write the correct regex pattern to match the following condition

(contains the word other) OR (contains both us AND car)

This code works as expected:

    str_detect(c('us cars',
                 'u.s. cars',
                 'us and bikes',
                 'other'),
               regex('other|((?=.*us)(?=.*car))',
                     ignore_case = TRUE))
    [1]  TRUE FALSE FALSE  TRUE

However, if I try to include variations of us (united states) such as u.s. and u.s then the pattern does not work anymore.

    str_detect(c('us cars',
                 'u.s. cars',
                 'us and bikes',
                 'other'),
               regex('other|((?=.*us)(?=.*u.s.)(?=.*u.s)(?=.*car))',
                     ignore_case = TRUE))
    [1] FALSE FALSE FALSE  TRUE

What is the issue here? Why is the AND pattern working in the first case but not in the second?
Thanks!

The problem is with your use of the ".", this special character means any character on regex if you want a literal dot you have to escape it with \\.

library(stringr)
str_detect(c('us cars',
             'u.s. cars',
             'us and bikes',
             'other'),
           regex('other|((?=.*u\\.*s\\.*)(?=.*car))',
                 ignore_case = TRUE))
#> [1]  TRUE  TRUE FALSE  TRUE

Created on 2019-02-12 by the reprex package (v0.2.1)

1 Like

thanks but then I am confused. isnt . a character as well? it should match with the regex dot (any character) right?

for instance, here I am escaping and still does not work

> str_detect(c('us cars',
+              'u.s. cars',
+              'us and bikes',
+              'other'),
+            regex('other|((?=.*us)(?=.*u\\.s\\.)(?=.*u\\.s)(?=.*car).*)',
+                  ignore_case = TRUE))
[1] FALSE FALSE FALSE  TRUE

Nope, because you are just tellin any character but not how many times, .* this means any character 0 or more times.

2 Likes

this works.


str_detect(c('us cars',
             'u.s. cars',
             'us and bikes',
             'other'),
           regex('other|((?=.*u\\.s\\.)(?=.*car).*)',
                 ignore_case = TRUE))

It seems the issue is having more than one AND. Am I completely crazy here?!

The idea of a regular expression is to define a pattern not each variation of the text as you are trying to do.

1 Like

absolutely. but on purely logical grounds, I dont get why combining four logical conditions with AND does not work while it works only with two. Perhaps there is a limit in the number of lookaheads that can be done?

You are getting the logic wrong, none of your words match the pattern you where specifying
((?=.*us)(?=.*u\\.s\\.)(?=.*u\\.s)(?=.*car))
this string would match that US U.S. U.S car do you see the reason? you are using AND not OR

4 Likes

damn you are right... time to go to bed I guess.... Thanks for helping out!!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.