Using purrr:map for looping and simultaneously perform operation on each loop

This question is very similar to these but trying to use purrr:map :

Let's consider we have 1000 files,
we go to use the file names to add columns to differentiate them

library(tidyverse)
library(rebus)
# Sample 
#list.files results in  #c("abcd_xcv_pl_2019_01.csv","abcd_xcv_pl_2019_02_vb_df.csv"#)

 df<- c()
 for (x in list.files(*.csv)) {
    u <- read_csv2(x) # actually reading file 1 by 1 (tried fread as well)
    u$Year = str_extract(x,"\\d{4}") # this selects the year easily
    df <- rbind(df, u)
  }
In order to speed up, I tried fread but due to for loop, it is still taking minutes 
map(.x = list.files(*.csv).
        .f = ~ read_csv2(.x))
but how can we add columns to each file ?

I would likely approach this sort of problem by making a function to read in an individual dataset and the modify it/add colums as needed and then looping. However, I don't know if the map() loop will end up being faster than a for() loop.

Something like (untested)

read_and_add = function(file) {
     dat = read_csv2(file)
     dat$Year = str_extract(file,"\\d{4}")
     dat
}

Then I could use map_dfr() to read all the datasets in and row bind them together into a single dataset.

map_dfr(.x = list.files(*.csv).
    .f = read_and_add)

I go through a (more complicated) example that I worked through with a student on my blog here. Since it's a "teaching" post I go into a lot of detail, so you may want to skip some of it. You can see where I first make the function for reading and modifying each dataset here .

2 Likes

Also, you can read all files directly into a single dataframe using the .id parameter to capture the filename and create the year column afterwards.

library(tidyverse)
library(stringr)

list_of_files <- list.files(pattern = ".csv$",
                            full.names = TRUE)
df <- list_of_files %>%
    setNames(nm = .) %>% 
    map_dfr(read.csv2, .id = "file_name") %>% 
    mutate(Year = str_extract(file_name, "\\d{4}"))
5 Likes

This might not be able to be sped up.

This problem sounds like you're trying to read many files from storage. If that is the case, it may be that you are I/O bound. Attempting to do loop more cleverly is unlikely to improve performance. Attempting multiple reads in parallel may even result in worse performance depending on the underlying storage system.

Reading from disk is slow

For example, this write up on the relative speed of processing and IO tries to scale latency to put it in a human context: latency scaled

Try with futures

Split files into two lists and have one list read asynchrnously by a non-blocking process. This should attempt to have two difference processes reading from storage. You may have to run something like this from the command line (RScript), as there are caveats when trying to run from RStudio.

library(readr)
library(future)
plan(multiprocess)

all_file_names <- list.files('*.csv')
mid <- floor(length(all_file_names)/2)
# split list in half-ish
file_names_set_1 <- all_file_names[1:mid]
file_names_set_2 <- all_file_names[(mid+1):length(all_file_names)]

# Time how long this takes
start_time <- proc.time()

# Use future to read first half of list
df_1 %<-% {
  for (x in file_names_set_1) {
    u <- read_csv2(x)
    u$Year = str_extract(x,"\\d{4}")
    async_df <- rbind(df, u)
  }
  async_df
}

# Read second have of list in current process
for (x in file_names_set_2) {
  u <- read_csv2(x)
  u$Year = str_extract(x,"\\d{4}") # this selects the year easily
  df_2 <- rbind(df, u)
}

df <- rbind(df_1, df_2)

elapsed <- proc.time() - start_time
print(elapsed)
1 Like

Thanks @grosscol for creative approach to speed up. (5x)
%<-% is this from magrittr ?
If we divide further i.e 4 parts running for loops,
Are there any trade offs ?
Because end of the day, this data should be loaded into Shiny and rendered
actually: 30 million rows

> print(elapsed) [without future]
   user  system elapsed 
541.731  51.299 610.976 

> print(elapsed) [with future]
   user  system elapsed 
178.281   2.553 182.613 

#########################
@aosmith, @andresrcs
It seems that map_dfr has an issue binding when giving the colClasses.
I had to change a column type from integer to character (F column below)
even though column entries were purely integer type
Error was:
Column F can't be converted from integer to character

Also, @andresrcs,
Actually function call look like this,
Struggled to adopt purrr and mutate when having so many things happening within each loop. Is there a better way using purrr ?

column_names <- c("A","B","C","D","E","F","G")
column_classes <- c("factor", "factor", "factor", "factor", "factor", "character", "character") 
read_and_add = function(fl_nm) {
  file <- fread(paste0("data/",fl_nm), header = F, skip = 2, colClasses = column_classes)
  setnames(file, column_names)
  pattern = paste0(
    "(?<!\\d)", # not preceded by a digit
    "(",        # start defining group 1
    "\\d{4}", # match 4 digits in a row
    ")",        # done defining group 1
    "\\D",      # match a non-digit character
    "(",        # start defining group 2
    "\\d{2}", # match 2 digits in a row
    ")",        # done defining group 2
    "(?!\\d)"   # not followed by a digit
  )
  date_parts <- str_match(fl_nm, pattern)
  rownames(date_parts) <- fl_nm
  colnames(date_parts) <- c("matched", "year", "month")
  file$year <- date_parts[fl_nm, "year"]
  file$month <- date_parts[fl_nm, "month"]
  file
}
df_func = map_dfr(.x = list.files(path = "data", pattern = "*csv"),
        .f = read_and_add)
1 Like

Unless you have a good reason for editing each file's column names before merging them together, I would do it this way

library(tidyverse)
library(stringr)
library(data.table)

column_names <- c("file_name","A","B","C","D","E","F","G","year","month")
column_classes <- c("factor", "factor", "factor", "factor", "factor", "character", "character") 

list_of_files <- list.files(path = "data",
                            pattern = ".csv$",
                            full.names = TRUE)
df <- list_of_files %>%
    setNames(nm = .) %>% 
    map_dfr(~fread(., header = F, skip = 2, colClasses = column_classes),
            .id = "file_name") %>% 
    mutate(year = str_extract(file_name, "(?<\\D)\\d{4}(?\\D)"),
           month = str_extract(file_name, "(?<\\D)\\d{2}(?\\D)")) %>% 
    rename_all(~column_names)

Although, I can't be sure this would fit your real data, if you need more specific help consider sharing a link to a couple sample .csv files to test our code.

The operator %<-% is from future. It's the assignment operator that delegates to a future evaluation block. When using a multiprocess plan, it means the code is run in a non-blocking process. It should only block if and when the assigned value is referenced subsequently but the code has not already finished.

Glad it did result in a performance increase for you. You'd have to experiment to determine what the optimal number of reading processes is for your hardware and workload. The factors involved in retrieval from storage go beyond just the R code.

fread should have best performance in reading files (you mentioned fread but I only see read_csv2 in your post).

purrr:map will not have much performance gain over for loop.

  • The usual advice of avoiding for loop is intended for you to find right vectorized function alternatives, which often implemented the loop with C so is faster.
  • And another common bad practice to be avoided is increasing a list/vector inside a loop.

Other than these two cases, for loop/lapply/map are similar in performance.

Using data.table to add column is also very fast because it happened in place without creating new copy. I used this kind of code several times. (I don't have access to my previous code now and don't have time to test the code below).

If you still need to improve on performance and your file structure is well known, you can set colClasses which should reduce the time fread used on detecting column types.

If you need to read lots of file/big files, learning data.table will be well worth your effort (readr is one big step over base R, but still not as good as fread in case of reading csv files)

library(data.table)

res_list <- lapply(list.files(*.csv), function(x) {
   dt <- fread(x)
   dt[, Year := str_extract(x,"\\d{4}")]
})

res_dt <- rbindlist(res_list)

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.