Assistance with Data Cleanup

Hi,
I am trying to clean up some data and have failed every attempt in Excel. I have used R in the past but am very new. The data has a date column but has a dozen of formats in it. I only need to extract the year form the data. I was able to do some of this in Excel, but not able to get all of my 4,000 rows. Does anyone have any suggestions on how I will be able to extract this data? I am attaching an image of what I am working with. Thanks in advance!

If the year is always the first occurrence of four consecutive numerics, then you can use

library(stringr)
str_extract("Thu,Jul,30,2015,14:18:26GMT,0500,CDT", "\\d{4}")

More complete code might be like this, if your data frame is called DF.

library(dplyr)
library(stringr)
DF %>% mutate(Year = str_extract(Date, "\\d{4}"))

If you need the year to be numeric, you can wrap that str_extract() in an as.numeric() function.

3 Likes

I was going to suggest lubridate::dmy_hms but it seems it can't handle your data very well. @FJCC's answer is better.

Thank you for the suggestions. I will look into using Stringer for this. The years are always a 4 digit year but aren't always at the of the text string. Thank you both for the assistance!

you can use `str_extract(Date, "20\d\d")1 which will return the first thing that looks like 20xx in the string.

You can get more sophisticated with the regex (regular expression) if this doesn't quite work.

1 Like

Thanks, the " mutate(year = str_extract(date, "\d{4}")) " code worked and extracted the years.

2 Likes

My experience with dates in Excel spreadsheets in the wild has been that they are often formatted inconsistently, as your screenshots show. Converting to strings and using regex tools can be a pain in the neck, but has been the most reliable way for me to scrub date fields from Excel.

BTW: For the benefit of those to follow, please mark @FJCC's answer as the solution. Thanks.

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.