Removing specific phrases from string values

Slavek · September 16, 2021, 6:51am

Hi,
I have this list of train stations:

source <- data.frame(
  stringsAsFactors = FALSE,
                                Station = c("Acton Central",
                                            "Aldgate East","Allens West",
                                            "Armadale (West Lothian)","Bicester North",
                                            "Birkenhead Central","Birkenhead North",
                                            "Blackpool North",
                                            "Bloxwich North","Box Hill and Westhumble",
                                            "Bramley (West Yorkshire)",
                                            "Bromley North","Burnley Central",
                                            "Cambridge North","Canterbury East",
                                            "Canterbury West","Cardiff Central","Centrale"),
                               PostCode = c("W","E","TS","EH",
                                            "MK","CH","CH","FY","WS","KT",
                                            "BD","BR","BB","CB","CT","CT",
                                            "CF","CR")
                     )

source

and I would like to generate towns they are in by removing North, South, West, East or Central from their names with the exception of "Centrale" and stations with locations in brackets.

As a result I would need this:

result <- data.frame(
  stringsAsFactors = FALSE,
  Station = c("Acton Central",
              "Aldgate East","Allens West",
              "Armadale (West Lothian)","Bicester North",
              "Birkenhead Central","Birkenhead North",
              "Blackpool North",
              "Bloxwich North","Box Hill and Westhumble",
              "Bramley (West Yorkshire)",
              "Bromley North","Burnley Central",
              "Cambridge North","Canterbury East",
              "Canterbury West","Cardiff Central","Centrale"),
  PostCode = c("W","E","TS","EH",
               "MK","CH","CH","FY","WS","KT",
               "BD","BR","BB","CB","CT","CT",
               "CF","CR"),
  Town = c("Acton","Aldgate",
           "Allens","Armadale (West Lothian)",
           "Bicester","Birkenhead","Birkenhead",
           "Blackpool","Bloxwich",
           "Box Hill and Westhumble",
           "Bramley (West Yorkshire)","Bromley","Burnley",
           "Cambridge","Canterbury","Canterbury",
           "Cardiff","Centrale")
)

result

Shall I use a modified version of something like that to remove the words?

blank_statements <- regex("^(North|South|West|East|Central)$", ignore_case = TRUE)

result <- source %>% 
  mutate(across(contains("Station"), ~str_trim(.x))) %>%
  mutate(across(contains("Station"), ~str_remove_all(.x, blank_statements))
  )

How can I keep the exceptions?

Can you help?

samer_elsheikh · September 16, 2021, 7:50am

Hi,
your regex is not working the ^ and $ are limiting search to start AND end of string, so it finds no matches. To look for whole words use \b instead

blank_statements <- regex("\\b(North|South|West|East|Central)\\b", ignore_case = TRUE)

result <- source %>% 
    mutate(
        across(contains("Station"), ~str_trim(.x)),
        across(contains("Station"), ~str_remove_all(.x, "\\b(North|South|West|East|Central)\\b"))
    )


result

Slavek · September 16, 2021, 8:49am

Thank you but still no effect...
Also, I prefer creating a new variable called Town...
I forgot to add one more exception "Box Hill and Westhumble"...

HanOostdijk · September 16, 2021, 10:15am

@Slavek

what do you mean by "no effect" ?
The solution by @samer_elsheikh works for me.
It also handles well your additional exception.

Edit:
mutate(Town = str_remove_all(Station, blank_statements))
creates the new variable 'Town'

samer_elsheikh · September 16, 2021, 11:26am

mutate across() returns the same column name by default. Either assign new names as @HanOostdijk described or if you have multiple columns (i. e. you have to use across) then you can use the .names = argument.

source %>% 
  mutate(across(contains("Station"), ~str_trim(.x))) %>%
  mutate(across(contains("Station"), ~str_remove_all(.x, blank_statements), .names = "{.col}_Town")
  )

Slavek · September 16, 2021, 12:42pm

Yes, sorry. I had an issue with my R.
It is working
Can you advise how I could fix the exceptions:
Armadale (West Lothian)
Bramley (West Yorkshire)
Can we, somehow, exclude names in brackets?

xvalda · September 16, 2021, 3:27pm

@Slavek

I hacked a fairly inelegant solution, but it works on your reprex. (it assumes that text in brackets is always at the end of each string across the whole dataset, I also modified your original code and this won't run on multiple columns).
I'll think on a better (shorter) way.

blank_statements <- regex("\\(.*\\)|\\b(North|South|West|East|Central)\\b", ignore_case = TRUE)
source %>% mutate(brackets = str_extract(Station, "\\(.*\\)"), 
                  brackets = case_when(is.na(brackets) ~ "", TRUE ~ brackets), 
                  Town = str_remove(Station, blank_statements), 
                  Town = str_c(Town, brackets), 
                  Town = str_squish(Town)) %>% 
  select(-brackets)

xvalda · September 16, 2021, 4:41pm

Actually better and more robust, you can use this (building from the former contributions):

blank_statements <- regex("[^\\(.*\\)]\\b(North|South|West|East|Central)\\b", ignore_case = TRUE)
source %>% 
  mutate(across(contains("Station"), ~str_trim(.x))) %>%
  mutate(across(contains("Station"), ~str_remove_all(.x, blank_statements), .names = "{.col}_Town")
  )

Slavek · September 17, 2021, 9:04am

Perfect! I would add another trim in the end:

source %>% 
  mutate(across(contains("Station"), ~str_trim(.x))) %>%
  mutate(across(contains("Station"), ~str_remove_all(.x, blank_statements), .names = "{.col}_Town")) %>% 
  mutate(across(contains("Station"), ~str_trim(.x)
  ))

otherwise, perfect solution

Thank you.

xvalda · September 17, 2021, 9:23am

Actually I don't think you need the first str_trim.
And instead of trim, rather use str_squish(.x), like trim it removes first and last white space, but also removes when there are more than two consecutive blanks in the middle of the string.

kenbutler · September 21, 2021, 12:58am

I think this is too difficult a problem to solve in this way with any generality. For example, Allens West station is not in the west of a place called Allens, and there is too much hand-coding of exceptions for this to generalize to a longer list of stations.

I would first find the postcodes of the stations, and then try to find the towns/cities in which those postcodes are. This is sort-of like geocoding. There are also reverse geocoders that take a lat and a long and return what town-city-county they are in. I think this is a lot more direct than what you are trying to do. There are also any number of stations whose location is nothing to do with their name (Allens West, again, is in Stockton, and stations with an And in them are usually nearer one than the other but serve both.

Slavek · September 21, 2021, 8:06pm

You are right!
Unfortunately, all I have in one data set is town/city variable. The other data set contains Station, Postcodearea (like AB) and Postcodedistrict (like AB1).
I gave up looking for information about towns and their postcodes so I thought I could link two databases using town information (even if extracted from Station name).
Do you know any way of linking Postcodearea (like AB) and Postcodedistrict (like AB1) with towns in the UK?

system · September 28, 2021, 8:07pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.