Extract date after string using tidyr's extract function

Ibrahim · April 13, 2021, 5:26am

I'm trying to extract dates from a Notes column using tidyr's extract function. The data I'm working on looks like this:

dates <- data.frame(col1 = c("customer", "customer2", "customer3"),
                    Notes = c("DOB: 12/10/62
START: 09/01/2019
END: 09/01/2020", "
S/DATE: 28/08/19
R/DATE: 27/08/20", "DOB: 13/01/1980
Start:04/12/2018"),
                    End_date = NA,
                    Start_Date = NA )

I tried extracting the date following the string "S/DATE" like this:

extract <- extract(
  dates,
  col = "Notes",
  into = "Start_date",
  regex = "(?<=(S\\/DATE:)).*"  # Using regex lookahead
)

However, this only extracts the string "S/DATE:", not the date after it. When I tried this on regex101.com, it works as expected.

Thanks. Ibrahim

Dobrokhotov1989 · April 13, 2021, 6:20am

I don't know what's wrong with the tidy::extract() but with srtinger::str_extract() it works well.

  suppressMessages(library(tidyverse))
  
  dates <- data.frame(col1 = c("customer", "customer2", "customer3"),
                      Notes = c("DOB: 12/10/62
START: 09/01/2019
END: 09/01/2020", "
S/DATE: 28/08/19
R/DATE: 27/08/20", "DOB: 13/01/1980
Start:04/12/2018"),
                      End_date = NA,
                      Start_Date = NA )
  
  extract <- dates %>%
    mutate(Start_date = 
             str_extract(string = Notes, pattern = "(?<=(S\\/DATE: )).*")) %>%
    select(col1, Start_date, End_date)
  
  extract
#>        col1 Start_date End_date
#> 1  customer       <NA>       NA
#> 2 customer2   28/08/19       NA
#> 3 customer3       <NA>       NA

technocrat · April 13, 2021, 7:06am

Here are @Ibrahim and @Dobrokhotov1989 snippets together

suppressPackageStartupMessages({
  library(dplyr)
  library(stringr)
  library(tidyr)
})

pattern1 <- "(?<=(S\\/DATE:)).*"
pattern2 <- "(?<=(S\\/DATE: )).*"

dates <- data.frame(col1 = c("customer", "customer2", "customer3"),
                    Notes = c("DOB: 12/10/62
START: 09/01/2019
END: 09/01/2020", "
S/DATE: 28/08/19
R/DATE: 27/08/20", "DOB: 13/01/1980
Start:04/12/2018"),
End_date = NA,
Start_Date = NA )

# avoid naming objects after functions in name space to prevent
# collison; some operations will treat extract as a closure, 
# rather than a data frame; same with df, data, etc.

xtract <- extract(
  dates,
  col = "Notes",
  into = "Start_date",
  regex = pattern1 
)

xtract
#>        col1 Start_date End_date Start_Date
#> 1  customer       <NA>       NA         NA
#> 2 customer2    S/DATE:       NA         NA
#> 3 customer3       <NA>       NA         NA

xtract2 <- dates %>%
  mutate(Start_date = 
           str_extract(string = Notes, pattern1)) %>%
  select(col1, Start_date, End_date)

xtract2
#>        col1 Start_date End_date
#> 1  customer       <NA>       NA
#> 2 customer2   28/08/19       NA
#> 3 customer3       <NA>       NA

There's a slight different in the regex in the second case—the space following the colon. The reason that the one fails and the other doesn't however, is subtle.

tidyr::extract requires grouped expressions but stringr::stsr_extract doesn't. From help(extract)

regex a regular expression used to extract the desired values. There should be one group (defined by ()) for each element of into.

I agree with @Dobrokhotov1989 that str_extract is preferable to writing a grouped regex that will pick out the right date string. I would also convert to a datetime object:

suppressPackageStartupMessages({
  library(dplyr)
  library(lubridate)
  library(stringr)
  library(tidyr)
})

pattern1 <- "(?<=(S\\/DATE:)).*"
pattern2 <- "(?<=(S\\/DATE: )).*"
pattern3 <- "(^.*START:.)(\\d+/\\d+//d+)"

dates <- data.frame(col1 = c("customer", "customer2", "customer3"),
                    Notes = c("DOB: 12/10/62
START: 09/01/2019
END: 09/01/2020", "
S/DATE: 28/08/19
R/DATE: 27/08/20", "DOB: 13/01/1980
Start:04/12/2018"),
End_date = NA,
Start_Date = NA )

xtract3 <- dates %>%
  mutate(Start_date = 
           str_extract(string = Notes, pattern2),
         Start_date = dmy(Start_date)) %>%
  select(col1, Start_date, End_date)

xtract3
#>        col1 Start_date End_date
#> 1  customer       <NA>       NA
#> 2 customer2 2019-08-28       NA
#> 3 customer3       <NA>       NA

As far as using regex101.com and other checkers, there's no guarantee that the same regular expression will work identically across all implementations of the parsing engine. My recommendation is to use the facilities of {stringr} for the basic and simple cases, because it's simple and hard to get lost in. For the difficult cases, it's preferable to learn one regular expression language in depth and to use it exclusively through a system call. I've done this with bespoke bison/flex code, for example and had a much easier time.

system · April 20, 2021, 7:07am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.