Correcting string names by removing numbers and changing case

Hi,
I have imported my data to R but it has some errors. Some town info includes weird numbers:

source <- data.frame(
  stringsAsFactors = FALSE,
                                    URN = c("aaa","bb","cc","dd",
                                            "ee","ff","gg","hh","ii","jjj",
                                            "kkk","ll","mm","nn","oo","pp",
                                            "qq","rr"),
                                   Town = c("Aberdeen","ABERDEEN",
                                            "ABERDEEN      0498",
                                            "ABERGAVENNY    15656","Aberystwyth",
                                            "ABERYSTWYTH    15574","St.austell",
                                            "Aldershot","ALDERSHOT      10306",
                                            "Barrow-in-Furness","BARROWINFURNESS15405",
                                            "BRISTOL","Westonsupermare",
                                            "BRISTOL        15590",
                                            "BRISTOL     1390","Bristol","Burton-on-Trent",
                                            "BURTON-ON-TRENT15660")
                     )

Is it possible to remove all numbers (with prefix gaps if necessery) and changing all names to sentence case in the "Town" variable? I need to get this as a result:

result <- data.frame(
  stringsAsFactors = FALSE,
  URN = c("aaa","bb","cc","dd",
          "ee","ff","gg","hh","ii","jjj",
          "kkk","ll","mm","nn","oo","pp",
          "qq","rr"),
  Town = c("Aberdeen","ABERDEEN",
           "ABERDEEN      0498",
           "ABERGAVENNY    15656","Aberystwyth",
           "ABERYSTWYTH    15574","St.austell",
           "Aldershot","ALDERSHOT      10306",
           "Barrow-in-Furness","BARROWINFURNESS15405",
           "BRISTOL","Westonsupermare",
           "BRISTOL        15590",
           "BRISTOL     1390","Bristol","Burton-on-Trent",
           "BURTON-ON-TRENT15660"),
  CorrectTown = c("Aberdeen","Aberdeen",
                  "Aberdeen","Abergavenny",
                  "Aberystwyth","Aberystwyth","St Austell",
                  "Aldershot","Aldershot",
                  "Barrow-in-Furness","Barrow-in-Furness",
                  "Bristol","Weston-Super-Mare","Bristol","Bristol",
                  "Bristol","Burton-on-Trent",
                  "Burton-on-Trent")
)

How can I do that?
I believe, errors like "BARROWINFURNESS15405" should be changed to "Barrow-in-Furness" mannually (unless there is a clever way as there are many UK towns with a structure: name.with.3+characters"-in-"name.with.3+characters).

Can you help?

I am still struggling with this "trimming" request.
Is it possible to do that in R?

Hi Slavek - I find that a lot of these data cleaning things require running str_replace multiple times on the same data. I think you can get a sense of what would be involved by the following code...

library(dplyr)
library(stringr)

source %>%
	mutate(CorrectTown = str_replace(Town, "\\s*\\d+", ""),
		CorrectTown = str_replace(CorrectTown, "BARROWINFURNESS", "Barrow-in-furness"),
		CorrectTown = str_to_title(CorrectTown),
		CorrectTown = str_replace(CorrectTown, "-In-", "-in-"),
		CorrectTown = str_replace(CorrectTown, "-On-", "-on-")
	)
  • In the first mutate line, str_replace removes the spaces (zero or more) followed by digits
  • The second demonstrates the brute force approach
  • The third applies title case to the text. The one thing this doesn't solve is the problem of the "-On-" and "-In-" towns. Title case capitalizes the first letter of each word.
  • To solve that problem, we have the 4th and 5th lines. There may be a more elegant solution, but I couldn't figure it out

Note that the first str_replace operates on Town and then subsequent lines manipulate CorrectTown

Hope this helps! It's a messy process, but is a lot more fun than brute forcing a bunch of strings.
Pat

Hi there!

I went with a “lookup” approach, which may be complementary. It’s about flagging towns with potential alternative forms (case and special characters).
My assumption from the example data is that a same town may appear several times, sometimes in the correct form ("Barrow-in-furness") sometimes in the incorrect form ("BARROWINFURNESS").
The idea is to create a dataframe with two columns:

  • Names of towns with special characters (-, …)
  • Same name but “normalized”: lowercased and with special characters removed.

Then in our result df, I add a temporary column where I lowercase the town name, and if it matches a normalized name of the previous “lookup” df, then it returns a suggestion in a new column.

This approach is not failsafe, this is why I prefer to flag a possible suggestion in an extra column.

# create result2 df, removing digits and str_to_title
result2 <- source %>% 
  mutate(CorrectTown2 = str_remove(Town, "\\d+"), 
         CorrectTown2 = str_squish(str_to_title(CorrectTown2))
  )
# create a "lookup" df where only cities with non-word characters (such as Barrow-In-Furness) are shown
# + take unique (distinct) towns + lowercase + remove special characters
town_special_char <- result2 %>% 
  filter(str_detect(CorrectTown2, "\\W")) %>% 
  distinct(CorrectTown2) %>% 
  mutate(rm_special_char = str_remove_all(CorrectTown2, "\\W"), 
         rm_special_char = str_to_lower(rm_special_char)) %>% 
  rename(suggested_town = CorrectTown2)
# left join the two dfs to show suggested towns
result2 <- result2 %>% mutate(rm_special_char = str_to_lower(CorrectTown2)) %>% 
  left_join(town_special_char, by = "rm_special_char") %>% 
  select(-rm_special_char)
1 Like

And another tip in case it would help, you can download a list of UK towns & cities and with a similar lookup procedure be able to find the right name.
Here's an example of where you can find such data: Place name data for UK Towns, Cities and Villages with Counties — O'Brien Media Ltd

Last little note, the dataset I found online doesn't give unique identifiers like the postal code. It can be a problem for several towns with the same name but in different counties. For instance there 3 towns called "Well" in Linconshire, North Yorkshire and Hampshire.
I didn't see that info in your example data but maybe worth to keep researching a dataset that has postal codes as well.
But between the examples from Pat and a normalization and merging/joining approach, you're probably getting closer to your goal :slight_smile:

I know the problem is solved but I applied that to my real data set and I have to to extra corrections.
I am applying both solutions above using step 4:

result4 <- result3 %>%
  mutate(suggested_town = str_replace(CorrectTown2, "Barrowinfurness", "Barrow-In-Furness"),
         suggested_town = str_replace(CorrectTown2, "East Kilbirde", "East Kilbride"),
         suggested_town = str_replace(CorrectTown2, "Isle Ofwight", "Isle Of Wight"),
         suggested_town = str_replace(CorrectTown2, "Kings Lynn", "King's Lynn"),
         suggested_town = str_replace(CorrectTown2, "Miltonkeynes", "Milton Keynes"),
         suggested_town = str_replace(CorrectTown2, "St.austell", "St Austell"),
         suggested_town = str_replace(CorrectTown2, "Westonmare", "Weston-Super-Mare"),
         suggested_town = str_replace(CorrectTown2, "Westonsupermare", "Weston-Super-Mare"),
         suggested_town = str_replace(CorrectTown2, "-Upon-", " Upon "),
         suggested_town = str_replace(CorrectTown2, "St.", "St "),
         suggested_town = str_replace(CorrectTown2, "-On-", " on "),
         suggested_town = str_replace(CorrectTown2, " on-", " on ")
  )

but no effect.
What am I doing wrong?

When I review suggested_town, can I replace old, incorrect entries in CorrectTown2 by corrected entries in the suggested_town?

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.