Thanks @grosscol for creative approach to speed up. (5x)
%<-% is this from magrittr ?
If we divide further i.e 4 parts running for loops,
Are there any trade offs ?
Because end of the day, this data should be loaded into Shiny and rendered
actually: 30 million rows
> print(elapsed) [without future]
user system elapsed
541.731 51.299 610.976
> print(elapsed) [with future]
user system elapsed
178.281 2.553 182.613
#########################
@aosmith, @andresrcs
It seems that map_dfr has an issue binding when giving the colClasses.
I had to change a column type from integer to character (F column below)
even though column entries were purely integer type
Error was:
Column F can't be converted from integer to character
Also, @andresrcs,
Actually function call look like this,
Struggled to adopt purrr and mutate when having so many things happening within each loop. Is there a better way using purrr ?
column_names <- c("A","B","C","D","E","F","G")
column_classes <- c("factor", "factor", "factor", "factor", "factor", "character", "character")
read_and_add = function(fl_nm) {
file <- fread(paste0("data/",fl_nm), header = F, skip = 2, colClasses = column_classes)
setnames(file, column_names)
pattern = paste0(
"(?<!\\d)", # not preceded by a digit
"(", # start defining group 1
"\\d{4}", # match 4 digits in a row
")", # done defining group 1
"\\D", # match a non-digit character
"(", # start defining group 2
"\\d{2}", # match 2 digits in a row
")", # done defining group 2
"(?!\\d)" # not followed by a digit
)
date_parts <- str_match(fl_nm, pattern)
rownames(date_parts) <- fl_nm
colnames(date_parts) <- c("matched", "year", "month")
file$year <- date_parts[fl_nm, "year"]
file$month <- date_parts[fl_nm, "month"]
file
}
df_func = map_dfr(.x = list.files(path = "data", pattern = "*csv"),
.f = read_and_add)