Separate o split?

mab9913 · August 4, 2020, 7:08pm

Hello, Iam tryngg to scrap some text out of an url, but I got stuck in this.
The text i get does not hav a separation or anything that allows me to separate, so I get this structure of data, and dont know how to convert this into a a 4 var data frame, with names("ID", "Country", "Zip code", "ID country")
Its my first post at the community,
tanks!

tibble [466 x 1] (S3: tbl_df/tbl/data.frame)
$ .: chr [1:466] "2WinnipegCanada2972CA" "3OttawaCanada3369CA" "4QuebecCanada3444CA" "5MontrealCanada3534CA" ...

AlexisW · August 4, 2020, 8:04pm

You can address this using regular expressions. Here you can use the capital letters to find the separations (but you'll need to take a good look at the results to make sure there's no inconsistency).

So we want our regexp to be as strict as possible to detect inconsistencies, we'll start with "^" and end with "$".

The first thing we expect is a single digit: "[0-9]"
Then we want a word that starts with a capital then has only lowercase: "[A-Z][a-z]+"
The same for the second word: "[A-Z][a-z]+"
Then we have 4 digits: "[0-9]{4}"
and finally "CA" (assuming you have only Canadian cities here).

You want to recover the matches for each of these expressions, so you need to enclose them in parentheses, and you obtain the following regex:

"^([0-9])([A-Z][a-z]+)([A-Z][a-z]+)([0-9]{4})CA$"

Now it's just a matter of using it in a function that accepts a regex. In the tidyverse, you can use it on a vector with str_match():

> text <- c("2WinnipegCanada2972CA", "3OttawaCanada3369CA", "4QuebecCanada3444CA", "5MontrealCanada3534CA")
> stringr::str_match(text, "^([0-9])([A-Z][a-z]+)([A-Z][a-z]+)([0-9]{4})CA$")

     [,1]                    [,2] [,3]       [,4]     [,5]  
[1,] "2WinnipegCanada2972CA" "2"  "Winnipeg" "Canada" "2972"
[2,] "3OttawaCanada3369CA"   "3"  "Ottawa"   "Canada" "3369"
[3,] "4QuebecCanada3444CA"   "4"  "Quebec"   "Canada" "3444"
[4,] "5MontrealCanada3534CA" "5"  "Montreal" "Canada" "3534"

Or if it's a column in a tibble you can use extract():

> library(tidyverse)
> tibble(my_col = text) %>%
    extract(my_col, into=c("n", "City", "Country", "code"), "^([0-9])([A-Z][a-z]+)([A-Z][a-z]+)([0-9]{4})CA$")

# A tibble: 4 x 4
  n     City     Country code 
  <chr> <chr>    <chr>   <chr>
1 2     Winnipeg Canada  2972 
2 3     Ottawa   Canada  3369 
3 4     Quebec   Canada  3444 
4 5     Montreal Canada  3534

Of course, you may want to capture a larger initial "n" using [0-9]{1,3} and capture the country code with additional parentheses at the end: "([0-9]{1,3})([A-Z][a-z]+)([A-Z][a-z]+)([0-9]{4})([A-Z]{2,3})$", for that part you really need to carefully examine your data.

system · August 25, 2020, 8:04pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.