Correcting string names by removing numbers and changing case

Slavek · September 15, 2021, 3:34pm

Hi,
I have imported my data to R but it has some errors. Some town info includes weird numbers:

source <- data.frame(
  stringsAsFactors = FALSE,
                                    URN = c("aaa","bb","cc","dd",
                                            "ee","ff","gg","hh","ii","jjj",
                                            "kkk","ll","mm","nn","oo","pp",
                                            "qq","rr"),
                                   Town = c("Aberdeen","ABERDEEN",
                                            "ABERDEEN      0498",
                                            "ABERGAVENNY    15656","Aberystwyth",
                                            "ABERYSTWYTH    15574","St.austell",
                                            "Aldershot","ALDERSHOT      10306",
                                            "Barrow-in-Furness","BARROWINFURNESS15405",
                                            "BRISTOL","Westonsupermare",
                                            "BRISTOL        15590",
                                            "BRISTOL     1390","Bristol","Burton-on-Trent",
                                            "BURTON-ON-TRENT15660")
                     )

Is it possible to remove all numbers (with prefix gaps if necessery) and changing all names to sentence case in the "Town" variable? I need to get this as a result:

result <- data.frame(
  stringsAsFactors = FALSE,
  URN = c("aaa","bb","cc","dd",
          "ee","ff","gg","hh","ii","jjj",
          "kkk","ll","mm","nn","oo","pp",
          "qq","rr"),
  Town = c("Aberdeen","ABERDEEN",
           "ABERDEEN      0498",
           "ABERGAVENNY    15656","Aberystwyth",
           "ABERYSTWYTH    15574","St.austell",
           "Aldershot","ALDERSHOT      10306",
           "Barrow-in-Furness","BARROWINFURNESS15405",
           "BRISTOL","Westonsupermare",
           "BRISTOL        15590",
           "BRISTOL     1390","Bristol","Burton-on-Trent",
           "BURTON-ON-TRENT15660"),
  CorrectTown = c("Aberdeen","Aberdeen",
                  "Aberdeen","Abergavenny",
                  "Aberystwyth","Aberystwyth","St Austell",
                  "Aldershot","Aldershot",
                  "Barrow-in-Furness","Barrow-in-Furness",
                  "Bristol","Weston-Super-Mare","Bristol","Bristol",
                  "Bristol","Burton-on-Trent",
                  "Burton-on-Trent")
)

How can I do that?
I believe, errors like "BARROWINFURNESS15405" should be changed to "Barrow-in-Furness" mannually (unless there is a clever way as there are many UK towns with a structure: name.with.3+characters"-in-"name.with.3+characters).

Can you help?

Slavek · September 20, 2021, 3:21pm

I am still struggling with this "trimming" request.
Is it possible to do that in R?

Riffomonas · September 20, 2021, 4:14pm

Hi Slavek - I find that a lot of these data cleaning things require running str_replace multiple times on the same data. I think you can get a sense of what would be involved by the following code...

library(dplyr)
library(stringr)

source %>%
	mutate(CorrectTown = str_replace(Town, "\\s*\\d+", ""),
		CorrectTown = str_replace(CorrectTown, "BARROWINFURNESS", "Barrow-in-furness"),
		CorrectTown = str_to_title(CorrectTown),
		CorrectTown = str_replace(CorrectTown, "-In-", "-in-"),
		CorrectTown = str_replace(CorrectTown, "-On-", "-on-")
	)

In the first mutate line, str_replace removes the spaces (zero or more) followed by digits
The second demonstrates the brute force approach
The third applies title case to the text. The one thing this doesn't solve is the problem of the "-On-" and "-In-" towns. Title case capitalizes the first letter of each word.
To solve that problem, we have the 4th and 5th lines. There may be a more elegant solution, but I couldn't figure it out

Note that the first str_replace operates on Town and then subsequent lines manipulate CorrectTown

Hope this helps! It's a messy process, but is a lot more fun than brute forcing a bunch of strings.
Pat

xvalda · September 20, 2021, 4:52pm

Hi there!

I went with a “lookup” approach, which may be complementary. It’s about flagging towns with potential alternative forms (case and special characters).
My assumption from the example data is that a same town may appear several times, sometimes in the correct form ("Barrow-in-furness") sometimes in the incorrect form ("BARROWINFURNESS").
The idea is to create a dataframe with two columns:

Names of towns with special characters (-, …)
Same name but “normalized”: lowercased and with special characters removed.

Then in our result df, I add a temporary column where I lowercase the town name, and if it matches a normalized name of the previous “lookup” df, then it returns a suggestion in a new column.

This approach is not failsafe, this is why I prefer to flag a possible suggestion in an extra column.

# create result2 df, removing digits and str_to_title
result2 <- source %>% 
  mutate(CorrectTown2 = str_remove(Town, "\\d+"), 
         CorrectTown2 = str_squish(str_to_title(CorrectTown2))
  )
# create a "lookup" df where only cities with non-word characters (such as Barrow-In-Furness) are shown
# + take unique (distinct) towns + lowercase + remove special characters
town_special_char <- result2 %>% 
  filter(str_detect(CorrectTown2, "\\W")) %>% 
  distinct(CorrectTown2) %>% 
  mutate(rm_special_char = str_remove_all(CorrectTown2, "\\W"), 
         rm_special_char = str_to_lower(rm_special_char)) %>% 
  rename(suggested_town = CorrectTown2)
# left join the two dfs to show suggested towns
result2 <- result2 %>% mutate(rm_special_char = str_to_lower(CorrectTown2)) %>% 
  left_join(town_special_char, by = "rm_special_char") %>% 
  select(-rm_special_char)

xvalda · September 20, 2021, 5:07pm

And another tip in case it would help, you can download a list of UK towns & cities and with a similar lookup procedure be able to find the right name.
Here's an example of where you can find such data: Place name data for UK Towns, Cities and Villages with Counties — O'Brien Media Ltd

xvalda · September 20, 2021, 5:25pm

Last little note, the dataset I found online doesn't give unique identifiers like the postal code. It can be a problem for several towns with the same name but in different counties. For instance there 3 towns called "Well" in Linconshire, North Yorkshire and Hampshire.
I didn't see that info in your example data but maybe worth to keep researching a dataset that has postal codes as well.
But between the examples from Pat and a normalization and merging/joining approach, you're probably getting closer to your goal

Slavek · September 21, 2021, 11:51am

I know the problem is solved but I applied that to my real data set and I have to to extra corrections.
I am applying both solutions above using step 4:

result4 <- result3 %>%
  mutate(suggested_town = str_replace(CorrectTown2, "Barrowinfurness", "Barrow-In-Furness"),
         suggested_town = str_replace(CorrectTown2, "East Kilbirde", "East Kilbride"),
         suggested_town = str_replace(CorrectTown2, "Isle Ofwight", "Isle Of Wight"),
         suggested_town = str_replace(CorrectTown2, "Kings Lynn", "King's Lynn"),
         suggested_town = str_replace(CorrectTown2, "Miltonkeynes", "Milton Keynes"),
         suggested_town = str_replace(CorrectTown2, "St.austell", "St Austell"),
         suggested_town = str_replace(CorrectTown2, "Westonmare", "Weston-Super-Mare"),
         suggested_town = str_replace(CorrectTown2, "Westonsupermare", "Weston-Super-Mare"),
         suggested_town = str_replace(CorrectTown2, "-Upon-", " Upon "),
         suggested_town = str_replace(CorrectTown2, "St.", "St "),
         suggested_town = str_replace(CorrectTown2, "-On-", " on "),
         suggested_town = str_replace(CorrectTown2, " on-", " on ")
  )

but no effect.
What am I doing wrong?

Slavek · September 21, 2021, 12:06pm

When I review suggested_town, can I replace old, incorrect entries in CorrectTown2 by corrected entries in the suggested_town?

system · September 28, 2021, 12:06pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.