Regex to choose the year and month from file names ?

This question is very similar to these but finding the regex is the challenge:




Let's consider we have 1000 files,
we go to use the file names to add columns to differentiate them

library(tidyverse)
library(rebus)
# Sample 
filenames = c("abcd_xcv_pl_2019_01.csv","abcd_xcv_pl_2019_02_vb_df.csv")

 df<- c()
 for (x in filenames) {
    u<-read_csv2(x) # actually reading file 1 by 1 
    u$Year = str_extract(x,"\\d{4}") # this selects the year easily
    u$Month = str_extract(x, ANY_CHAR %R% one_or_more(DGT) %R% optional(ANY_CHAR) %R% ".csv") # in order to select the month !!!
    df <- rbind(df, u)
    cat(x, "\n ")
  }

#Tried so many regex but 
str_extract(x, ANY_CHAR %R% one_or_more(DGT) %R% optional(ANY_CHAR) %R% ".csv")

str_extract(x, one_or_more(DGT) %R% ANY_CHAR %R% ".csv")

but non gives exactly 01 or 02 from the last file names (months)

No matter the solution, stringr functions (like base R's gsub) are vectorized, so you can pull them out of the loop.

For extracting multiple parts of a strings, you can use str_match:

library(stringr)
fnames <- c("abcd_xcv_pl_2019_01.csv","abcd_xcv_pl_2019_02_vb_df.csv")
pattern <- paste0(
  "(?<!\\d)", # not preceded by a digit

  "(",        # start defining group 1
    "\\d{4}", # match 4 digits in a row
  ")",        # done defining group 1

  "\\D",      # match a non-digit character

  "(",        # start defining group 2
    "\\d{2}", # match 2 digits in a row
  ")",        # done defining group 2

  "(?!\\d)"   # not followed by a digit
)
pattern
# [1] "(?<!\\d)(\\d{4})\\D(\\d{2})(?!\\d)"
date_parts <- str_match(fnames, pattern)
date_parts
#      [,1]      [,2]   [,3]
# [1,] "2019_01" "2019" "01"
# [2,] "2019_02" "2019" "02"

As you can see, str_match returns a matrix where the first column is part of the string matching the entire pattern, and the rest of the columns are the parts matching the groups. So date_parts has two extra columns, because pattern had two groups. With row and column names, this matrix will be easy to use in the loop.

rownames(date_parts) <- fnames
colnames(date_parts) <- c("matched", "year", "month")
date_parts
#                               matched   year   month
# abcd_xcv_pl_2019_01.csv       "2019_01" "2019" "01" 
# abcd_xcv_pl_2019_02_vb_df.csv "2019_02" "2019" "02"

for (x in filenames) {
  u <- read_csv2(x) # actually reading file 1 by 1 
  u$Year <- date_parts[x, "year"]
  u$Month <- date_parts[x, "month"]
  df <- rbind(df, u)
  cat(x, "\n ")
}
2 Likes

Thanks a ton for explaining each and every steps :slight_smile:
regex is sometimes hard for me to even understand :frowning:
Any recommendation ?
rebus was temporary solution

Regex is definitely it's own language, so everyone struggles with it as first. Luckily, it's one you can learn as you need.

Some resources:

1 Like

Thanks and immediately printed the cheatsheet already :slight_smile:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.