Hi @oli.m
Like @nirgrahamuk mentioned, the lack of consistency in date format cannot be solved entirely with stringr, and I guess you will still need a consistent date format for your analytics.
I suggest a solution below, with 3 basic scenarios:
- do you have at least one clear dmy format? --> meaning first number in the date is an integer > 12
- do you have at least one clear mdy format? --> meaning second number in the date is an integer > 12
- is it undefined? --> other cases, when each of two 1st numbers in date <= 12, so we don't know which format, I put the mdy format here but you can decide.
My options below don't cover every possible case and some possible wrong entries.
I did the pattern detection row by row pasting all dates, thinking that the person entering b.date for Jasi will use the same format as for a.date.
So these are some "naive" assumptions about your data but when you apply it on your whole dataset, you can take a look at dates that were not parsed right and adapt the code accordingly.
library(tidyverse); library(lubridate)
data <- tibble(
name = c("Josh", "Jasi", "Sophie", "Leni"),
b.date = c("1.17.1990", "24.09.1865", "03.12.2000", "10.04.2000"),
a.date = c("4.13.1990", "02.03.1865", "03.04.2000", "11.04.2000"))
data %>%
unite("pasted", grep("dat", names(data), value = TRUE), sep = " ", remove = FALSE) %>%
mutate(across(contains("dat"),
~case_when(
str_detect(pasted, "(^|\\s+)(1[3-9]|[23][0-9])\\.") ~ dmy(.),
str_detect(pasted, "\\.(1[3-9]|[23][0-9])\\.") ~ mdy(.),
TRUE ~ mdy(.)))) %>% select(-pasted) %>%
suppressWarnings()