Weird behaviour of Regular Expressions

Hi,
I have a df withmodel names starting with a number 2, 3 or 6.

source <- data.frame(
  stringsAsFactors = FALSE,
            Model3 = c("1.2","1.2 C","2","2-1.3",
                       "2-1.4","2-1.5","2-GT-","2-WHI","2 (10","2 (11",
                       "2 .","2 ..","2 1","2 1.","2 1..","2 1.2","2 1.3",
                       "2 1.4","2 1.5","3 2-3","2 2 (","3 2 1","2 2 3","6 2 5",
                       "2 2 A","2 3DR","2 4DR","2 5","2 5-D","2 5 D",
                       "2 55","2 5DO","2 5DR","2 75P","2 76","2 90","2 90P",
                       "2.2TM","2.3","2DR C","2TS","3","3-1.6","3-2.0",
                       "3 (12","6","6-1.8","6-2..","6-2.0","6 (14","6 (15"),
             Score = c(6,1,30252,19,1,18,3,2,1,
                       10,4,1,1,21,1,128,23938,1660,39640,1,6,972,
                       41,1034,5,2644,2,1,5,175,2,28,54054,227,6,1,
                       1464,1225,1,224,2,25922,26,42,183,30808,4,10,
                       14,9,41)
)

source

I use this to recode Model3 into ModelCat where model names starting from 2 should be coded as '2', starting from 3 as '3' and starting from 6 as '6'. I use this code:

library(dplyr)
result <- source %>% 
  mutate(Model3=as.character(Model3)) %>%
  mutate(ModelCat = case_when(
    grepl(x = Model3, pattern = '^20|^21|^23|^25', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model3, pattern = '^2.1|^2.2|^2.3|^2.4|^2.5|^2.6', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model3, pattern = '^2|^2\\s|^2-', ignore.case = TRUE) ~ '2',
    grepl(x = Model3, pattern = '2\\s1.2|2\\s1.3|2\\s1.4|2\\s1.5|2\\s1.6|2\\s2.7|2\\s2.8|2\\s3dr|2\\s5dr', ignore.case = TRUE) ~ '2',
    grepl(x = Model3, pattern = '^30|^32|^33|^35', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model3, pattern = '^3|^3\\s|^3-', ignore.case = TRUE) ~ '3',
    grepl(x = Model3, pattern = '^62|^65', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model3, pattern = '^6|^6\\s|^6-', ignore.case = TRUE) ~ '6',
    TRUE ~ "Other"
  ))

result

I have tried multiple options (that is why the code is a bit messy) but, for some reason, 3s and 6s are properly recoded whereas I have problems with recoding 2s. I really do not know why. Can anyone help?

you havent exactly told us what the problem is ?
you say you are satisfied every allocation you have into 3 and 6, and everything else is 2 or other , but that part is sometimes 'wrong'
you seem to have taken care to convert all sorts of numbers that would otherwise be classed as 2's to 'Other'
your first two grepl's are about that. If you remove those, then those 'Others' become '2'

Consider that if we only have your non-working code - we have no more insight to your intentions than what you tell us ... so be careful to tell us as clearly as you can your problem.

best of luck.

Hi, sorry. The issue is with "2&space&digit" which are allocated as "Other". They should be "2". This works for 3s and 6s but does not for 2s so "2 1", "2 1."..."2 5DR" should be "2" but they are not regardless my efforts :frowning:

Also, first two grepls exclude names like 2021, 210, 2.1, 2.2 etc as I have these in my real data file

to match an entry whose first character is 2 then a space then a digit would be ^2\\s\\d

does this help?

result <- source %>% 
  mutate(Model3=as.character(Model3)) %>%
  mutate(ModelCata = case_when(
    grepl(x = Model3, pattern = '^20|^21|^23|^25', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model3, pattern = '^2.1|^2.2|^2.3|^2.4|^2.5|^2.6', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model3, pattern = '^2|^2\\s|^2-', ignore.case = TRUE) ~ '2',
    grepl(x = Model3, pattern = '2\\s1.2|2\\s1.3|2\\s1.4|2\\s1.5|2\\s1.6|2\\s2.7|2\\s2.8|2\\s3dr|2\\s5dr', ignore.case = TRUE) ~ '2',
    grepl(x = Model3, pattern = '^30|^32|^33|^35', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model3, pattern = '^3|^3\\s|^3-', ignore.case = TRUE) ~ '3',
    grepl(x = Model3, pattern = '^62|^65', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model3, pattern = '^6|^6\\s|^6-', ignore.case = TRUE) ~ '6',
    TRUE ~ "Other"), 
    
  ModelCatb = case_when(
    grepl(x = Model3, pattern = '^2\\s\\d', ignore.case = TRUE) ~ '2',
    grepl(x = Model3, pattern = '^20|^21|^23|^25', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model3, pattern = '^2.1|^2.2|^2.3|^2.4|^2.5|^2.6', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model3, pattern = '^2|^2\\s|^2-', ignore.case = TRUE) ~ '2',
    grepl(x = Model3, pattern = '2\\s1.2|2\\s1.3|2\\s1.4|2\\s1.5|2\\s1.6|2\\s2.7|2\\s2.8|2\\s3dr|2\\s5dr', ignore.case = TRUE) ~ '2',
    grepl(x = Model3, pattern = '^30|^32|^33|^35', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model3, pattern = '^3|^3\\s|^3-', ignore.case = TRUE) ~ '3',
    grepl(x = Model3, pattern = '^62|^65', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model3, pattern = '^6|^6\\s|^6-', ignore.case = TRUE) ~ '6',
    TRUE ~ "Other"
  )) 

result %>% filter(ModelCata != ModelCatb)

Thank you.
Can these be replaced by something similar:

grepl(x = Model3, pattern = '^2.1|^2.2|^2.3|^2.4|^2.5|^2.6', ignore.case = TRUE) ~ 'Other', 

grepl(x = Model3, pattern = '2\\s1.2|2\\s1.3|2\\s1.4|2\\s1.5|2\\s1.6|2\\s2.7|2\\s2.8|2\\s3dr|2\\s5dr', ignore.case = TRUE) ~ '2',

Basically, all records starting from 2 and followed by dots should be ignored and names like "2 1.2" or "2 1.3" should be 2s.

Also "2-1.3" so names starting from "2-" are not picked up for 2 (but they are for 3 and 6).

source <- data.frame(
  stringsAsFactors = FALSE,
  Model3 = c("1.2","1.2 C","2","2-1.3",
             "2-1.4","2-1.5","2-GT-","2-WHI","2 (10","2 (11",
             "2 .","2 ..","2 1","2 1.","2 1..","2 1.2","2 1.3",
             "2 1.4","2 1.5","3 2-3","2 2 (","3 2 1","2 2 3","6 2 5",
             "2 2 A","2 3DR","2 4DR","2 5","2 5-D","2 5 D",
             "2 55","2 5DO","2 5DR","2 75P","2 76","2 90","2 90P",
             "2.2TM","2.3","2DR C","2TS","3","3-1.6","3-2.0",
             "3 (12","6","6-1.8","6-2..","6-2.0","6 (14","6 (15"),
  Score = c(6,1,30252,19,1,18,3,2,1,
            10,4,1,1,21,1,128,23938,1660,39640,1,6,972,
            41,1034,5,2644,2,1,5,175,2,28,54054,227,6,1,
            1464,1225,1,224,2,25922,26,42,183,30808,4,10,
            14,9,41)
)

pick_models <- function(x) {# provided on request with a complete `reprex`} and a data frame of the desired results if different=&mdash;the problem is underspecified. 
pick_models(source$Model3)
#>  [1] "2-1.3" "2-1.4" "2-1.5" "2-GT-" "2-WHI" "2 (10" "2 (11" "2 ."   "2 .." 
#> [10] "2 1"   "2 1."  "2 1.." "2 1.2" "2 1.3" "2 1.4" "2 1.5" "2 2 (" "2 2 3"
#> [19] "2 2 A" "2 3DR" "2 4DR" "2 5"   "2 5-D" "2 5 D" "2 55"  "2 5DO" "2 5DR"
#> [28] "2 75P" "2 76"  "2 90"  "2 90P" "2DR C" "2TS"   "2"     "2-1.3" "2-1.4"
#> [37] "2-1.5" "2-GT-" "2-WHI" "2 (10" "2 (11" "2 ."   "2 .."  "2 1"   "2 1." 
#> [46] "2 1.." "2 1.2" "2 1.3" "2 1.4" "2 1.5" "3 2-3" "2 2 (" "3 2 1" "2 2 3"
#> [55] "6 2 5" "2 2 A" "2 3DR" "2 4DR" "2 5"   "2 5-D" "2 5 D" "2 55"  "2 5DO"
#> [64] "2 5DR" "2 75P" "2 76"  "2 90"  "2 90P" "2.2TM" "2.3"   "2DR C" "2TS"  
#> [73] "3"     "3-1.6" "3-2.0" "3 (12" "6"     "6-1.8" "6-2.." "6-2.0" "6 (14"
#> [82] "6 (15"

Created on 2023-01-06 with reprex v2.0.2

Hi @Slavek,
My approach was to separate out the two key characters (first and second) from each string, and then use simple ifelse() to generate the ModelCat column. This, hopefully, avoids a lot of (potentially) error-prone grepping.

suppressPackageStartupMessages(library(tidyverse))

source <- data.frame(
  stringsAsFactors = FALSE,
  Model3 = c("1.2","1.2 C","2","2-1.3",
             "2-1.4","2-1.5","2-GT-","2-WHI","2 (10","2 (11",
             "2 .","2 ..","2 1","2 1.","2 1..","2 1.2","2 1.3",
             "2 1.4","2 1.5","3 2-3","2 2 (","3 2 1","2 2 3","6 2 5",
             "2 2 A","2 3DR","2 4DR","2 5","2 5-D","2 5 D",
             "2 55","2 5DO","2 5DR","2 75P","2 76","2 90","2 90P",
             "2.2TM","2.3","2DR C","2TS","3","3-1.6","3-2.0",
             "3 (12","6","6-1.8","6-2..","6-2.0","6 (14","6 (15"),
  Score = c(6,1,30252,19,1,18,3,2,1,
            10,4,1,1,21,1,128,23938,1660,39640,1,6,972,
            41,1034,5,2644,2,1,5,175,2,28,54054,227,6,1,
            1464,1225,1,224,2,25922,26,42,183,30808,4,10,
            14,9,41)
)

source %>% 
  mutate(char_1 = str_sub(Model3, start=1, end=1),
         char_2 = str_sub(Model3, start=2, end=2),
         ModelCat = ifelse(char_2 == ".", NA, char_1)) %>%
  drop_na(ModelCat)
#>    Model3 Score char_1 char_2 ModelCat
#> 1       2 30252      2               2
#> 2   2-1.3    19      2      -        2
#> 3   2-1.4     1      2      -        2
#> 4   2-1.5    18      2      -        2
#> 5   2-GT-     3      2      -        2
#> 6   2-WHI     2      2      -        2
#> 7   2 (10     1      2               2
#> 8   2 (11    10      2               2
#> 9     2 .     4      2               2
#> 10   2 ..     1      2               2
#> 11    2 1     1      2               2
#> 12   2 1.    21      2               2
#> 13  2 1..     1      2               2
#> 14  2 1.2   128      2               2
#> 15  2 1.3 23938      2               2
#> 16  2 1.4  1660      2               2
#> 17  2 1.5 39640      2               2
#> 18  3 2-3     1      3               3
#> 19  2 2 (     6      2               2
#> 20  3 2 1   972      3               3
#> 21  2 2 3    41      2               2
#> 22  6 2 5  1034      6               6
#> 23  2 2 A     5      2               2
#> 24  2 3DR  2644      2               2
#> 25  2 4DR     2      2               2
#> 26    2 5     1      2               2
#> 27  2 5-D     5      2               2
#> 28  2 5 D   175      2               2
#> 29   2 55     2      2               2
#> 30  2 5DO    28      2               2
#> 31  2 5DR 54054      2               2
#> 32  2 75P   227      2               2
#> 33   2 76     6      2               2
#> 34   2 90     1      2               2
#> 35  2 90P  1464      2               2
#> 36  2DR C   224      2      D        2
#> 37    2TS     2      2      T        2
#> 38      3 25922      3               3
#> 39  3-1.6    26      3      -        3
#> 40  3-2.0    42      3      -        3
#> 41  3 (12   183      3               3
#> 42      6 30808      6               6
#> 43  6-1.8     4      6      -        6
#> 44  6-2..    10      6      -        6
#> 45  6-2.0    14      6      -        6
#> 46  6 (14     9      6               6
#> 47  6 (15    41      6               6

Created on 2023-01-08 with reprex v2.0.2