Here are @Ibrahim and @Dobrokhotov1989 snippets together
suppressPackageStartupMessages({
library(dplyr)
library(stringr)
library(tidyr)
})
pattern1 <- "(?<=(S\\/DATE:)).*"
pattern2 <- "(?<=(S\\/DATE: )).*"
dates <- data.frame(col1 = c("customer", "customer2", "customer3"),
Notes = c("DOB: 12/10/62
START: 09/01/2019
END: 09/01/2020", "
S/DATE: 28/08/19
R/DATE: 27/08/20", "DOB: 13/01/1980
Start:04/12/2018"),
End_date = NA,
Start_Date = NA )
# avoid naming objects after functions in name space to prevent
# collison; some operations will treat extract as a closure,
# rather than a data frame; same with df, data, etc.
xtract <- extract(
dates,
col = "Notes",
into = "Start_date",
regex = pattern1
)
xtract
#> col1 Start_date End_date Start_Date
#> 1 customer <NA> NA NA
#> 2 customer2 S/DATE: NA NA
#> 3 customer3 <NA> NA NA
xtract2 <- dates %>%
mutate(Start_date =
str_extract(string = Notes, pattern1)) %>%
select(col1, Start_date, End_date)
xtract2
#> col1 Start_date End_date
#> 1 customer <NA> NA
#> 2 customer2 28/08/19 NA
#> 3 customer3 <NA> NA
There's a slight different in the regex
in the second case—the space following the colon. The reason that the one fails and the other doesn't however, is subtle.
tidyr::extract
requires grouped expressions but stringr::stsr_extract
doesn't. From help(extract)
regex a regular expression used to extract the desired values. There should be one group (defined by ()) for each element of into.
I agree with @Dobrokhotov1989 that str_extract
is preferable to writing a grouped regex
that will pick out the right date string. I would also convert to a datetime object:
suppressPackageStartupMessages({
library(dplyr)
library(lubridate)
library(stringr)
library(tidyr)
})
pattern1 <- "(?<=(S\\/DATE:)).*"
pattern2 <- "(?<=(S\\/DATE: )).*"
pattern3 <- "(^.*START:.)(\\d+/\\d+//d+)"
dates <- data.frame(col1 = c("customer", "customer2", "customer3"),
Notes = c("DOB: 12/10/62
START: 09/01/2019
END: 09/01/2020", "
S/DATE: 28/08/19
R/DATE: 27/08/20", "DOB: 13/01/1980
Start:04/12/2018"),
End_date = NA,
Start_Date = NA )
xtract3 <- dates %>%
mutate(Start_date =
str_extract(string = Notes, pattern2),
Start_date = dmy(Start_date)) %>%
select(col1, Start_date, End_date)
xtract3
#> col1 Start_date End_date
#> 1 customer <NA> NA
#> 2 customer2 2019-08-28 NA
#> 3 customer3 <NA> NA
As far as using regex101.com and other checkers, there's no guarantee that the same regular expression will work identically across all implementations of the parsing engine. My recommendation is to use the facilities of {stringr}
for the basic and simple cases, because it's simple and hard to get lost in. For the difficult cases, it's preferable to learn one regular expression language in depth and to use it exclusively through a system call. I've done this with bespoke bison/flex code, for example and had a much easier time.