Regular expressions with str_match

bobby · February 15, 2020, 5:05pm

Hi,

I have a simple table (df) with just 1 column (col_1). Each row is string of different lengths that describes a school kid, such as:

"8YOB that has gone for multiple detentions"
"12 Year old girl that has been diagnosed with vision impairment"
"10YO boy from a single-mom family"

I'd like to pick up each row's gender, and then add that as a new column. I know that the gender will always come after the age (which is either a 1 or 2-digit number), and the gender is always stated as "boy", "Boy", "girl", or "Girl". But the number of characters that sits between the age and the gender is variable, although it should be less than 20 characters.

So, I'd like to identify where the age is in each row, then pick out the first "b", "B", "g", or "G" that appears after the age, and then put that in the new column, as either capital M or capital F. So far, this is what I have:

pattern = optional(DGT) %R% DGT  #this line is my main problem, not sure how to code this
gender = str_match(df$col_1, pattern) #how to convert the gender to capital?
df$Gender = gender

Any help? Thanks.

FJCC · February 15, 2020, 5:34pm

Here is a variation on what you have asked for using str_extract and where the new column says either BOY or GIRL. This works if every case contains either boy or girl with any mixture of upper and lower case. Your example does not actually meet that requirement, as shown by the NA in the first row.

library(dplyr)
library(stringr)
DF <- data.frame(Info = c("8YOB that has gone for multiple detentions",
"12 Year old girl that has been diagnosed with vision impairment",
"10YO boy from a single-mom family",
"6 YO Boy", "9 year old Girl"))
DF
#>                                                              Info
#> 1                      8YOB that has gone for multiple detentions
#> 2 12 Year old girl that has been diagnosed with vision impairment
#> 3                               10YO boy from a single-mom family
#> 4                                                        6 YO Boy
#> 5                                                 9 year old Girl
DF <- mutate(DF, Sex = toupper(str_extract(Info, regex("boy|girl", ignore_case = TRUE))))
DF
#>                                                              Info  Sex
#> 1                      8YOB that has gone for multiple detentions <NA>
#> 2 12 Year old girl that has been diagnosed with vision impairment GIRL
#> 3                               10YO boy from a single-mom family  BOY
#> 4                                                        6 YO Boy  BOY
#> 5                                                 9 year old Girl GIRL

^{Created on 2020-02-15 by the reprex package (v0.3.0)}

bobby · February 16, 2020, 4:30am

Thanks for sharing.

Like what you pointed out, that doesn't fully solve my problem though, since most of the rows state the gender using formats like these: xYOB or xYOG. I cannot extract just "B" or "G" since these alphabets are too generic.

If I choose to go with str_extract, one workaround is that I extract the first "b", "g", "boy", "girl" that appears in a row. How shall I code for that?

andresrcs · February 16, 2020, 5:05am

These are a couple of options but the problem is that when it comes to regular expressions is hard to give a fine-tuned solution without having a deeper understanding of the data, for example, if all your lines were like in your example, you could use the second solution which is much simpler but less robust.

library(tidyverse)

DF <- data.frame(Info = c("8YOB that has gone for multiple detentions",
                          "12 Year old girl that has been diagnosed with vision impairment",
                          "10YO boy from a single-mom family",
                          "6 YO Boy", "9 year old Girl"))

DF %>% 
    mutate(sex = str_extract(Info, "(?<=^\\d{1,2})[^BGbg]+[BGbg]") %>% 
               str_extract(".{1}$") %>% 
               str_to_upper()
           )
#>                                                              Info sex
#> 1                      8YOB that has gone for multiple detentions   B
#> 2 12 Year old girl that has been diagnosed with vision impairment   G
#> 3                               10YO boy from a single-mom family   B
#> 4                                                        6 YO Boy   B
#> 5                                                 9 year old Girl   G

DF %>% 
    mutate(sex = str_extract(Info, "[BGbg]") %>% 
               str_to_upper()
    )
#>                                                              Info sex
#> 1                      8YOB that has gone for multiple detentions   B
#> 2 12 Year old girl that has been diagnosed with vision impairment   G
#> 3                               10YO boy from a single-mom family   B
#> 4                                                        6 YO Boy   B
#> 5                                                 9 year old Girl   G

bobby · February 16, 2020, 5:14am

Thanks everyone for your input.

I realised that str_extract() already extracts the first occurrence of the regex that I need, so the following works for me:

gender = str_extract(DF$col_1, "[mMfF]") %>% str_to_upper()
DF$Gender = gender

system · March 8, 2020, 5:14am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.