I have a simple table (df) with just 1 column (col_1). Each row is string of different lengths that describes a school kid, such as:
"8YOB that has gone for multiple detentions"
"12 Year old girl that has been diagnosed with vision impairment"
"10YO boy from a single-mom family"
I'd like to pick up each row's gender, and then add that as a new column. I know that the gender will always come after the age (which is either a 1 or 2-digit number), and the gender is always stated as "boy", "Boy", "girl", or "Girl". But the number of characters that sits between the age and the gender is variable, although it should be less than 20 characters.
So, I'd like to identify where the age is in each row, then pick out the first "b", "B", "g", or "G" that appears after the age, and then put that in the new column, as either capital M or capital F. So far, this is what I have:
pattern = optional(DGT) %R% DGT #this line is my main problem, not sure how to code this
gender = str_match(df$col_1, pattern) #how to convert the gender to capital?
df$Gender = gender
Here is a variation on what you have asked for using str_extract and where the new column says either BOY or GIRL. This works if every case contains either boy or girl with any mixture of upper and lower case. Your example does not actually meet that requirement, as shown by the NA in the first row.
library(dplyr)
library(stringr)
DF <- data.frame(Info = c("8YOB that has gone for multiple detentions",
"12 Year old girl that has been diagnosed with vision impairment",
"10YO boy from a single-mom family",
"6 YO Boy", "9 year old Girl"))
DF
#> Info
#> 1 8YOB that has gone for multiple detentions
#> 2 12 Year old girl that has been diagnosed with vision impairment
#> 3 10YO boy from a single-mom family
#> 4 6 YO Boy
#> 5 9 year old Girl
DF <- mutate(DF, Sex = toupper(str_extract(Info, regex("boy|girl", ignore_case = TRUE))))
DF
#> Info Sex
#> 1 8YOB that has gone for multiple detentions <NA>
#> 2 12 Year old girl that has been diagnosed with vision impairment GIRL
#> 3 10YO boy from a single-mom family BOY
#> 4 6 YO Boy BOY
#> 5 9 year old Girl GIRL
Like what you pointed out, that doesn't fully solve my problem though, since most of the rows state the gender using formats like these: xYOB or xYOG. I cannot extract just "B" or "G" since these alphabets are too generic.
If I choose to go with str_extract, one workaround is that I extract the first "b", "g", "boy", "girl" that appears in a row. How shall I code for that?
These are a couple of options but the problem is that when it comes to regular expressions is hard to give a fine-tuned solution without having a deeper understanding of the data, for example, if all your lines were like in your example, you could use the second solution which is much simpler but less robust.
library(tidyverse)
DF <- data.frame(Info = c("8YOB that has gone for multiple detentions",
"12 Year old girl that has been diagnosed with vision impairment",
"10YO boy from a single-mom family",
"6 YO Boy", "9 year old Girl"))
DF %>%
mutate(sex = str_extract(Info, "(?<=^\\d{1,2})[^BGbg]+[BGbg]") %>%
str_extract(".{1}$") %>%
str_to_upper()
)
#> Info sex
#> 1 8YOB that has gone for multiple detentions B
#> 2 12 Year old girl that has been diagnosed with vision impairment G
#> 3 10YO boy from a single-mom family B
#> 4 6 YO Boy B
#> 5 9 year old Girl G
DF %>%
mutate(sex = str_extract(Info, "[BGbg]") %>%
str_to_upper()
)
#> Info sex
#> 1 8YOB that has gone for multiple detentions B
#> 2 12 Year old girl that has been diagnosed with vision impairment G
#> 3 10YO boy from a single-mom family B
#> 4 6 YO Boy B
#> 5 9 year old Girl G