This question is very similar to these but finding the regex is the challenge:
Let's consider we have 1000 files,
we go to use the file names to add columns to differentiate them
library(tidyverse)
library(rebus)
# Sample
filenames = c("abcd_xcv_pl_2019_01.csv","abcd_xcv_pl_2019_02_vb_df.csv")
df<- c()
for (x in filenames) {
u<-read_csv2(x) # actually reading file 1 by 1
u$Year = str_extract(x,"\\d{4}") # this selects the year easily
u$Month = str_extract(x, ANY_CHAR %R% one_or_more(DGT) %R% optional(ANY_CHAR) %R% ".csv") # in order to select the month !!!
df <- rbind(df, u)
cat(x, "\n ")
}
#Tried so many regex but
str_extract(x, ANY_CHAR %R% one_or_more(DGT) %R% optional(ANY_CHAR) %R% ".csv")
str_extract(x, one_or_more(DGT) %R% ANY_CHAR %R% ".csv")
but non gives exactly 01 or 02 from the last file names (months)
No matter the solution, stringr functions (like base R's gsub) are vectorized, so you can pull them out of the loop.
For extracting multiple parts of a strings, you can use str_match:
library(stringr)
fnames <- c("abcd_xcv_pl_2019_01.csv","abcd_xcv_pl_2019_02_vb_df.csv")
pattern <- paste0(
"(?<!\\d)", # not preceded by a digit
"(", # start defining group 1
"\\d{4}", # match 4 digits in a row
")", # done defining group 1
"\\D", # match a non-digit character
"(", # start defining group 2
"\\d{2}", # match 2 digits in a row
")", # done defining group 2
"(?!\\d)" # not followed by a digit
)
pattern
# [1] "(?<!\\d)(\\d{4})\\D(\\d{2})(?!\\d)"
date_parts <- str_match(fnames, pattern)
date_parts
# [,1] [,2] [,3]
# [1,] "2019_01" "2019" "01"
# [2,] "2019_02" "2019" "02"
As you can see, str_match returns a matrix where the first column is part of the string matching the entire pattern, and the rest of the columns are the parts matching the groups. So date_parts has two extra columns, because pattern had two groups. With row and column names, this matrix will be easy to use in the loop.