Regex to choose the year and month from file names ?

AbhishekHP · July 19, 2019, 1:32pm

This question is very similar to these but finding the regex is the challenge:

Let's consider we have 1000 files,
we go to use the file names to add columns to differentiate them

library(tidyverse)
library(rebus)
# Sample 
filenames = c("abcd_xcv_pl_2019_01.csv","abcd_xcv_pl_2019_02_vb_df.csv")

 df<- c()
 for (x in filenames) {
    u<-read_csv2(x) # actually reading file 1 by 1 
    u$Year = str_extract(x,"\\d{4}") # this selects the year easily
    u$Month = str_extract(x, ANY_CHAR %R% one_or_more(DGT) %R% optional(ANY_CHAR) %R% ".csv") # in order to select the month !!!
    df <- rbind(df, u)
    cat(x, "\n ")
  }

#Tried so many regex but 
str_extract(x, ANY_CHAR %R% one_or_more(DGT) %R% optional(ANY_CHAR) %R% ".csv")

str_extract(x, one_or_more(DGT) %R% ANY_CHAR %R% ".csv")

but non gives exactly 01 or 02 from the last file names (months)

nwerth · July 19, 2019, 2:08pm

No matter the solution, stringr functions (like base R's gsub) are vectorized, so you can pull them out of the loop.

For extracting multiple parts of a strings, you can use str_match:

library(stringr)
fnames <- c("abcd_xcv_pl_2019_01.csv","abcd_xcv_pl_2019_02_vb_df.csv")
pattern <- paste0(
  "(?<!\\d)", # not preceded by a digit

  "(",        # start defining group 1
    "\\d{4}", # match 4 digits in a row
  ")",        # done defining group 1

  "\\D",      # match a non-digit character

  "(",        # start defining group 2
    "\\d{2}", # match 2 digits in a row
  ")",        # done defining group 2

  "(?!\\d)"   # not followed by a digit
)
pattern
# [1] "(?<!\\d)(\\d{4})\\D(\\d{2})(?!\\d)"
date_parts <- str_match(fnames, pattern)
date_parts
#      [,1]      [,2]   [,3]
# [1,] "2019_01" "2019" "01"
# [2,] "2019_02" "2019" "02"

As you can see, str_match returns a matrix where the first column is part of the string matching the entire pattern, and the rest of the columns are the parts matching the groups. So date_parts has two extra columns, because pattern had two groups. With row and column names, this matrix will be easy to use in the loop.

rownames(date_parts) <- fnames
colnames(date_parts) <- c("matched", "year", "month")
date_parts
#                               matched   year   month
# abcd_xcv_pl_2019_01.csv       "2019_01" "2019" "01" 
# abcd_xcv_pl_2019_02_vb_df.csv "2019_02" "2019" "02"

for (x in filenames) {
  u <- read_csv2(x) # actually reading file 1 by 1 
  u$Year <- date_parts[x, "year"]
  u$Month <- date_parts[x, "month"]
  df <- rbind(df, u)
  cat(x, "\n ")
}

AbhishekHP · July 19, 2019, 2:24pm

Thanks a ton for explaining each and every steps
regex is sometimes hard for me to even understand
Any recommendation ?
rebus was temporary solution

nwerth · July 19, 2019, 2:42pm

Regex is definitely it's own language, so everyone struggles with it as first. Luckily, it's one you can learn as you need.

Some resources:

Ian Kopacka's regex cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/regex.pdf
Reference table: http://userguide.icu-project.org/strings/regexp
An online tutorial I just found: https://regexone.com/ (I haven't done it, but it looks promising)

AbhishekHP · July 19, 2019, 3:03pm

Thanks and immediately printed the cheatsheet already

system · July 26, 2019, 3:03pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.