Separating integers in a column

Hi There,
I have a column with a birth date and a unique identification number that I want to split into two seperate columns giving me the information "age" and "gender"
The cell values look like this:
Ex: 031096-3460

Whereby the first six integers are in the format ddmmyy and the remaining four integers represent a unique identification number. If this number is an even number i want to code it 'female' and if it is uneven I want to code it 'Male' in a seperate column.

In this way, i need to turn the column into a birth day column and convert that to age and the other column into a gender column.
Can anyone help me with some code to do this?
Many Thanks,
Naja

Hi,

First of all, welcome to the RStudio Community!

Find below a way to implement your request

library(stringr)
library(dplyr)
library(lubridate)

#Create a fake dataset with the original column
myData = data.frame(original = c("031096-3460", "220506-2567"))

#Create two new columns birthday and id
myData = myData %>% mutate(birthday = as.Date(str_extract(original, "^\\d+"), "%d%m%y"),
                           id = as.numeric(str_extract(original, "\\d+$")))

#Convert birthday to age and id to gender
myData = myData %>% mutate(age = floor(interval(birthday, Sys.Date()) / years(1)),
                           gender = ifelse(id %% 2 == 0, 'Female', "Male"))

The first part uses the str_extract function from stringr to get the right data and convert it to the type needed

The second part then converts the birthday to age using the interval function from lubridate and a simple if-else to see if the id is even or not to assign the gender

NOTE: you could actually combine step 1 and 2 into just a single step, but I split it to make it more readable.

#All in one (just get age and gender)
myData = myData %>% 
  mutate(age = floor(interval(
                as.Date(original, "%d%m%y"), Sys.Date()) 
                / years(1)),
         gender = ifelse(
           as.numeric(str_extract(original, "\\d+$")) %% 2 == 0, 
           'Female', "Male"))

Hope this helps,
PJ

2 Likes

Hi Pieterjanvc!

Thank you so much for your useful guide and your welcome to the community. As you can hear I'm still fairly new at using R. When I run your code many of the age integers now appear in a negative format. Ages range only from +50 to -50.

I'm having a problem with the first bit of code as the manual entering of the numbers (c("031096-3460", "220506-2567") wouldn't work since it is a dataset of roughly 30.000 dataentries. I made the code as follows, but that results in these negative integers, probably because I cannot seem to get the first step right. My second bit of code looks as follows:

###############################################

donations = donations %>%
mutate(age = floor(interval(
as.Date(CPR, "%d%m%y"), Sys.Date())
/ years(1)),
gender = ifelse(
as.numeric(str_extract(CPR, "\d+$")) %% 2 == 0,
'Female', "Male"))

######################################################
Whereby 'donations' is the name of the dataset and 'CPR' is the name of the column I wish to split.

Many Thanks,
Naja

You are not supposed to do any manual data entry, that is just made up data used for exemplifying purposes, since you haven't provided any sample data for us to work with.

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

Your code is OK, but it's the strings that do not contain enough information. You only provide the last two digits of the year, so lubridate assumes that from the current year 2019 or 19, 50 years before and after are displayed as two numbers (i.e. the year closest to the current year). So 94 is interpreted as 1994, but 42 is interpreted as 2042 (since 2042 is closer to 2019 than 1942).

There is no fool proof way to fix this (apart from having the full year), as you'll never know for sure what the two digits mean. Of course you know for birthdays that they can't be in the future, but someone who was born in 1915, and now is 104 years old will be interpreted as 4 years old now.

I hope you see what the problem is.

I created one workaround, but it is not perfect.

library(stringr)
library(dplyr)
library(lubridate)

myData = data.frame(original = c("030919-3460", "220522-2567"))

myData = myData %>% mutate(birthday = as.integer(str_extract(original, "..(?=-)")),
                           birthday = ifelse(birthday > 19, 
                                             str_replace(original, "(....)(..(?=-))()", "\\119\\2"),
                                             str_replace(original, "(....)(..(?=-))()", "\\120\\2")),
                           birthday = as.Date(str_extract(birthday, "^\\d+"), "%d%m%Y"),
                           id = as.numeric(str_extract(original, "\\d+$")))

myData = myData %>% mutate(age = floor(interval(birthday, Sys.Date()) / years(1)),
                           gender = ifelse(id %% 2 == 0, 'Female', "Male"))

> print(myData)
     original   birthday   id age gender
1 030919-3460 2019-09-03 3460   0 Female
2 220522-2567 1922-05-22 2567  97   Male

I first extract the two digit year from the original string, then look if it's > 19 (current year). If so, I put 19 in from of it, else 20. This changes 15 into 2015 but 35 into 1935. It still will make errors for people > 100 years old.

PJ

There is also a lower level lubridate function that allows you to change the default cutoff value (cutoff_2000 = 68L)

library(stringr)
library(dplyr)
library(lubridate)

myData <- data.frame(original = c("030919-3460", "220522-2567"))

myData %>%
    mutate(birthday = parse_date_time2(str_extract(original, "\\d+(?=-)"),
                                       orders = "dmy",
                                       cutoff_2000 = 19),
           id = as.numeric(str_extract(original, "\\d+$")),
           age = floor(interval(birthday, Sys.Date()) / years(1)),
           gender = ifelse(id %% 2 == 0, 'Female', "Male"))
#>      original   birthday   id age gender
#> 1 030919-3460 2019-09-03 3460   0 Female
#> 2 220522-2567 1922-05-22 2567  97   Male
1 Like

Thank you so much to both of you. This seems to work now although there appears to be a small number of NA's with both your codes. Still a huge help though as I can just eliminate the NA's from the dataset.
Andresrcs, I have tried for the first time to do a reproducable example in another question on here and really appreciated your FAQ for that. Thank you!

  • Naja
1 Like

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it: