Removing specific phrases from string values

Hi,
I have this list of train stations:

source <- data.frame(
  stringsAsFactors = FALSE,
                                Station = c("Acton Central",
                                            "Aldgate East","Allens West",
                                            "Armadale (West Lothian)","Bicester North",
                                            "Birkenhead Central","Birkenhead North",
                                            "Blackpool North",
                                            "Bloxwich North","Box Hill and Westhumble",
                                            "Bramley (West Yorkshire)",
                                            "Bromley North","Burnley Central",
                                            "Cambridge North","Canterbury East",
                                            "Canterbury West","Cardiff Central","Centrale"),
                               PostCode = c("W","E","TS","EH",
                                            "MK","CH","CH","FY","WS","KT",
                                            "BD","BR","BB","CB","CT","CT",
                                            "CF","CR")
                     )

source

and I would like to generate towns they are in by removing North, South, West, East or Central from their names with the exception of "Centrale" and stations with locations in brackets.

As a result I would need this:

result <- data.frame(
  stringsAsFactors = FALSE,
  Station = c("Acton Central",
              "Aldgate East","Allens West",
              "Armadale (West Lothian)","Bicester North",
              "Birkenhead Central","Birkenhead North",
              "Blackpool North",
              "Bloxwich North","Box Hill and Westhumble",
              "Bramley (West Yorkshire)",
              "Bromley North","Burnley Central",
              "Cambridge North","Canterbury East",
              "Canterbury West","Cardiff Central","Centrale"),
  PostCode = c("W","E","TS","EH",
               "MK","CH","CH","FY","WS","KT",
               "BD","BR","BB","CB","CT","CT",
               "CF","CR"),
  Town = c("Acton","Aldgate",
           "Allens","Armadale (West Lothian)",
           "Bicester","Birkenhead","Birkenhead",
           "Blackpool","Bloxwich",
           "Box Hill and Westhumble",
           "Bramley (West Yorkshire)","Bromley","Burnley",
           "Cambridge","Canterbury","Canterbury",
           "Cardiff","Centrale")
)

result

Shall I use a modified version of something like that to remove the words?

blank_statements <- regex("^(North|South|West|East|Central)$", ignore_case = TRUE)

result <- source %>% 
  mutate(across(contains("Station"), ~str_trim(.x))) %>%
  mutate(across(contains("Station"), ~str_remove_all(.x, blank_statements))
  )

How can I keep the exceptions?

Can you help?

Hi,
your regex is not working the ^ and $ are limiting search to start AND end of string, so it finds no matches. To look for whole words use \b instead

blank_statements <- regex("\\b(North|South|West|East|Central)\\b", ignore_case = TRUE)

result <- source %>% 
    mutate(
        across(contains("Station"), ~str_trim(.x)),
        across(contains("Station"), ~str_remove_all(.x, "\\b(North|South|West|East|Central)\\b"))
    )


result

Thank you but still no effect...
Also, I prefer creating a new variable called Town...
I forgot to add one more exception "Box Hill and Westhumble"...

@Slavek

what do you mean by "no effect" ?
The solution by @samer_elsheikh works for me.
It also handles well your additional exception.

Edit:
mutate(Town = str_remove_all(Station, blank_statements))
creates the new variable 'Town'

mutate across() returns the same column name by default. Either assign new names as @HanOostdijk described or if you have multiple columns (i. e. you have to use across) then you can use the .names = argument.

source %>% 
  mutate(across(contains("Station"), ~str_trim(.x))) %>%
  mutate(across(contains("Station"), ~str_remove_all(.x, blank_statements), .names = "{.col}_Town")
  )

Yes, sorry. I had an issue with my R.
It is working :slight_smile:
Can you advise how I could fix the exceptions:
Armadale (West Lothian)
Bramley (West Yorkshire)
Can we, somehow, exclude names in brackets?

@Slavek

I hacked a fairly inelegant solution, but it works on your reprex. (it assumes that text in brackets is always at the end of each string across the whole dataset, I also modified your original code and this won't run on multiple columns).
I'll think on a better (shorter) way.

blank_statements <- regex("\\(.*\\)|\\b(North|South|West|East|Central)\\b", ignore_case = TRUE)
source %>% mutate(brackets = str_extract(Station, "\\(.*\\)"), 
                  brackets = case_when(is.na(brackets) ~ "", TRUE ~ brackets), 
                  Town = str_remove(Station, blank_statements), 
                  Town = str_c(Town, brackets), 
                  Town = str_squish(Town)) %>% 
  select(-brackets)

Actually better and more robust, you can use this (building from the former contributions):

blank_statements <- regex("[^\\(.*\\)]\\b(North|South|West|East|Central)\\b", ignore_case = TRUE)
source %>% 
  mutate(across(contains("Station"), ~str_trim(.x))) %>%
  mutate(across(contains("Station"), ~str_remove_all(.x, blank_statements), .names = "{.col}_Town")
  )
2 Likes

Perfect! I would add another trim in the end:

source %>% 
  mutate(across(contains("Station"), ~str_trim(.x))) %>%
  mutate(across(contains("Station"), ~str_remove_all(.x, blank_statements), .names = "{.col}_Town")) %>% 
  mutate(across(contains("Station"), ~str_trim(.x)
  ))

otherwise, perfect solution :grinning:

Thank you.

Actually I don't think you need the first str_trim.
And instead of trim, rather use str_squish(.x), like trim it removes first and last white space, but also removes when there are more than two consecutive blanks in the middle of the string.

1 Like

I think this is too difficult a problem to solve in this way with any generality. For example, Allens West station is not in the west of a place called Allens, and there is too much hand-coding of exceptions for this to generalize to a longer list of stations.

I would first find the postcodes of the stations, and then try to find the towns/cities in which those postcodes are. This is sort-of like geocoding. There are also reverse geocoders that take a lat and a long and return what town-city-county they are in. I think this is a lot more direct than what you are trying to do. There are also any number of stations whose location is nothing to do with their name (Allens West, again, is in Stockton, and stations with an And in them are usually nearer one than the other but serve both.

You are right!
Unfortunately, all I have in one data set is town/city variable. The other data set contains Station, Postcodearea (like AB) and Postcodedistrict (like AB1).
I gave up looking for information about towns and their postcodes so I thought I could link two databases using town information (even if extracted from Station name).
Do you know any way of linking Postcodearea (like AB) and Postcodedistrict (like AB1) with towns in the UK?

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.