You can address this using regular expressions. Here you can use the capital letters to find the separations (but you'll need to take a good look at the results to make sure there's no inconsistency).
So we want our regexp to be as strict as possible to detect inconsistencies, we'll start with "^"
and end with "$"
.
- The first thing we expect is a single digit:
"[0-9]"
- Then we want a word that starts with a capital then has only lowercase:
"[A-Z][a-z]+"
- The same for the second word:
"[A-Z][a-z]+"
- Then we have 4 digits:
"[0-9]{4}"
- and finally
"CA"
(assuming you have only Canadian cities here).
You want to recover the matches for each of these expressions, so you need to enclose them in parentheses, and you obtain the following regex:
"^([0-9])([A-Z][a-z]+)([A-Z][a-z]+)([0-9]{4})CA$"
Now it's just a matter of using it in a function that accepts a regex. In the tidyverse, you can use it on a vector with str_match()
:
> text <- c("2WinnipegCanada2972CA", "3OttawaCanada3369CA", "4QuebecCanada3444CA", "5MontrealCanada3534CA")
> stringr::str_match(text, "^([0-9])([A-Z][a-z]+)([A-Z][a-z]+)([0-9]{4})CA$")
[,1] [,2] [,3] [,4] [,5]
[1,] "2WinnipegCanada2972CA" "2" "Winnipeg" "Canada" "2972"
[2,] "3OttawaCanada3369CA" "3" "Ottawa" "Canada" "3369"
[3,] "4QuebecCanada3444CA" "4" "Quebec" "Canada" "3444"
[4,] "5MontrealCanada3534CA" "5" "Montreal" "Canada" "3534"
Or if it's a column in a tibble you can use extract()
:
> library(tidyverse)
> tibble(my_col = text) %>%
extract(my_col, into=c("n", "City", "Country", "code"), "^([0-9])([A-Z][a-z]+)([A-Z][a-z]+)([0-9]{4})CA$")
# A tibble: 4 x 4
n City Country code
<chr> <chr> <chr> <chr>
1 2 Winnipeg Canada 2972
2 3 Ottawa Canada 3369
3 4 Quebec Canada 3444
4 5 Montreal Canada 3534
Of course, you may want to capture a larger initial "n"
using [0-9]{1,3}
and capture the country code with additional parentheses at the end: "([0-9]{1,3})([A-Z][a-z]+)([A-Z][a-z]+)([0-9]{4})([A-Z]{2,3})$"
, for that part you really need to carefully examine your data.