Issues with regular expressions

Hi,
I hope I have improved my regular expressions skills a bit but I still have issues with some (like brackets).
I have following df with Model names and I would like to recode this list into ModelCat:

source <- 
  data.frame(
    stringsAsFactors = FALSE,
    Resp = c("aaa",
             "bbb","ccc","ddd","eee","fff","ggg","hhh",
             "iii","jjj","kkk","lll","mmm","nnn","ooo","ppp","qqq"),
    Model = c("3008",
              "3008 (2016)","308","308 (2013)",
              "3008 Hybride Diesel","207","3008 Hatchback","Crossland x",
              "crossland","corsa","corsa-e","corsa e",
              "4007","New c4","c4","corsa 307 electric","crossland diesel hatchback")
  )
source

library(dplyr)
result <- source %>% 
  mutate(ModelCat = case_when(
    grepl(x = Model, pattern = '308\\s(2013)', ignore.case = TRUE) ~ '308 (2013)',
    grepl(x = Model, pattern = '308', ignore.case = TRUE) ~ '308',
    grepl(x = Model, pattern = '2008', ignore.case = TRUE) ~ 'Peugeot 2008',
    grepl(x = Model, pattern = '3008\\shybride', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model, pattern = '3008\\shatchback', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model, pattern = '3008\\s(2016)', ignore.case = TRUE) ~ 'Peugeot 3008 (2016)',
    grepl(x = Model, pattern = '3008', ignore.case = TRUE) ~ 'Peugeot 3008',
    grepl(x = Model, pattern = 'Corsa-e\\sELECTRIC\\sHATCHBACK', ignore.case = TRUE) ~ 'Vauxhall Corsa E',
    grepl(x = Model, pattern = 'Corsa-e', ignore.case = TRUE) ~ 'Vauxhall Corsa E',
    grepl(x = Model, pattern = 'Corsa\\sE', ignore.case = TRUE) ~ 'Vauxhall Corsa E',
    grepl(x = Model, pattern = 'Corsa', ignore.case = TRUE) ~ 'Vauxhall Corsa',
    grepl(x = Model, pattern = 'Nuovo\\sC4|NEUER\\sC4|New\\sC4|Nuevo\\sC4|Nuova\\sC4|NV\\sC4|C4\\sNeu|C4\\sNlle', ignore.case = TRUE) ~ 'New c4',
    grepl(x = Model, pattern = 'Crossland\\sHatchback', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model, pattern = 'Crossland\\sX\\sHatchback', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model, pattern = 'Crossland\\sX', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model, pattern = 'Crossland', ignore.case = TRUE) ~ 'Crossland',
    TRUE ~ "Other"
  ))
result

My objective is to:

  1. keep Peugeot 3008 and Peugeot 3008 (2016) and recode all other versions of 3008 into "Other".
  2. keep Peugeot 308 and 308 (2013).
  3. simplify recoding models with "New" in multiple languages (New c4) , electric models (names containing "-e", " E" or "Electric") like Corsa and models containing extra characters (Crossland but not crossland x , crossland hatchback, crossland diesel hatchback).

Can you help please?

Do you actually mean "regular expressions"? If so, please correct your topic title, if not, Can you clarify what do you mean with "general impressions"?

Sorry. I must have been tired :joy: writing this post.
Of course I meant "regular expressions".

I have tried multiple options (incl. wild characters) but brackets are very tricky! I cannot get my head around it :frowning:
Maybe I could simply replace problematic brackets (and other weird characters) from Model information prior to my recoding?

Even this: Demystifying Regular Expressions in R - Rsquared Academy Blog - Explore Discover Learn did not answer my problem...

Maybe I could add this:

source$Model <-str_replace_all(source$Model, "[)]", "")
source$Model <-str_replace_all(source$Model, "[(]", "")

and then fix regular expressions with years?

There is no point in removing parentheses, you can match a four digit number enclosed in parentheses with this regex \\(\\d{4}\\)

Thank you but it is not working. What am I doing wrong?

library(dplyr)
result <- source %>% 
  mutate(ModelCat = case_when(
    grepl(x = Model, pattern = '308\\(\\d{4}\\)', ignore.case = TRUE) ~ '308 (2013)',
    grepl(x = Model, pattern = '308', ignore.case = TRUE) ~ '308',
    grepl(x = Model, pattern = '2008', ignore.case = TRUE) ~ 'Peugeot 2008',
    grepl(x = Model, pattern = '3008\\shybride', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model, pattern = '3008\\shatchback', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model, pattern = '3008\\(\\d{4}\\)', ignore.case = TRUE) ~ 'Peugeot 3008 (2016)',
    grepl(x = Model, pattern = '3008', ignore.case = TRUE) ~ 'Peugeot 3008',
    grepl(x = Model, pattern = 'Corsa-e\\sELECTRIC\\sHATCHBACK', ignore.case = TRUE) ~ 'Vauxhall Corsa E',
    grepl(x = Model, pattern = 'Corsa-e', ignore.case = TRUE) ~ 'Vauxhall Corsa E',
    grepl(x = Model, pattern = 'Corsa\\sE', ignore.case = TRUE) ~ 'Vauxhall Corsa E',
    grepl(x = Model, pattern = 'Corsa', ignore.case = TRUE) ~ 'Vauxhall Corsa',
    grepl(x = Model, pattern = 'Nuovo\\sC4|NEUER\\sC4|New\\sC4|Nuevo\\sC4|Nuova\\sC4|NV\\sC4|C4\\sNeu|C4\\sNlle', ignore.case = TRUE) ~ 'New c4',
    grepl(x = Model, pattern = 'Crossland\\sHatchback', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model, pattern = 'Crossland\\sX\\sHatchback', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model, pattern = 'Crossland\\sX', ignore.case = TRUE) ~ 'Other',
    grepl(x = Model, pattern = 'Crossland', ignore.case = TRUE) ~ 'Crossland',
    TRUE ~ "Other"
  ))
result

Can you help?
Is it easy to improve other elements from my code?

... add a space...

  mutate(ModelCat = case_when(
    grepl(x = Model, pattern = '308 \\(\\d{4}\\)', ignore.case = TRUE) ~ '308 (2013)',

Thank you. Now I need to simplify other codes.
Can we use "patern!=" in grepl?
Something like pattern=="Crossland" but pattern!="Crossland\s?"

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.