Using purrr:map for looping and simultaneously perform operation on each loop

AbhishekHP · July 19, 2019, 3:10pm

This question is very similar to these but trying to use purrr:map :

Let's consider we have 1000 files,
we go to use the file names to add columns to differentiate them

library(tidyverse)
library(rebus)
# Sample 
#list.files results in  #c("abcd_xcv_pl_2019_01.csv","abcd_xcv_pl_2019_02_vb_df.csv"#)

 df<- c()
 for (x in list.files(*.csv)) {
    u <- read_csv2(x) # actually reading file 1 by 1 (tried fread as well)
    u$Year = str_extract(x,"\\d{4}") # this selects the year easily
    df <- rbind(df, u)
  }
In order to speed up, I tried fread but due to for loop, it is still taking minutes 
map(.x = list.files(*.csv).
        .f = ~ read_csv2(.x))
but how can we add columns to each file ?

aosmith · July 19, 2019, 3:31pm

I would likely approach this sort of problem by making a function to read in an individual dataset and the modify it/add colums as needed and then looping. However, I don't know if the map() loop will end up being faster than a for() loop.

Something like (untested)

read_and_add = function(file) {
     dat = read_csv2(file)
     dat$Year = str_extract(file,"\\d{4}")
     dat
}

Then I could use map_dfr() to read all the datasets in and row bind them together into a single dataset.

map_dfr(.x = list.files(*.csv).
    .f = read_and_add)

I go through a (more complicated) example that I worked through with a student on my blog here. Since it's a "teaching" post I go into a lot of detail, so you may want to skip some of it. You can see where I first make the function for reading and modifying each dataset here .

andresrcs · July 19, 2019, 4:51pm

Also, you can read all files directly into a single dataframe using the .id parameter to capture the filename and create the year column afterwards.

library(tidyverse)
library(stringr)

list_of_files <- list.files(pattern = ".csv$",
                            full.names = TRUE)
df <- list_of_files %>%
    setNames(nm = .) %>% 
    map_dfr(read.csv2, .id = "file_name") %>% 
    mutate(Year = str_extract(file_name, "\\d{4}"))

grosscol · July 19, 2019, 5:56pm

This might not be able to be sped up.

This problem sounds like you're trying to read many files from storage. If that is the case, it may be that you are I/O bound. Attempting to do loop more cleverly is unlikely to improve performance. Attempting multiple reads in parallel may even result in worse performance depending on the underlying storage system.

Reading from disk is slow

For example, this write up on the relative speed of processing and IO tries to scale latency to put it in a human context: latency scaled

Try with futures

Split files into two lists and have one list read asynchrnously by a non-blocking process. This should attempt to have two difference processes reading from storage. You may have to run something like this from the command line (RScript), as there are caveats when trying to run from RStudio.

library(readr)
library(future)
plan(multiprocess)

all_file_names <- list.files('*.csv')
mid <- floor(length(all_file_names)/2)
# split list in half-ish
file_names_set_1 <- all_file_names[1:mid]
file_names_set_2 <- all_file_names[(mid+1):length(all_file_names)]

# Time how long this takes
start_time <- proc.time()

# Use future to read first half of list
df_1 %<-% {
  for (x in file_names_set_1) {
    u <- read_csv2(x)
    u$Year = str_extract(x,"\\d{4}")
    async_df <- rbind(df, u)
  }
  async_df
}

# Read second have of list in current process
for (x in file_names_set_2) {
  u <- read_csv2(x)
  u$Year = str_extract(x,"\\d{4}") # this selects the year easily
  df_2 <- rbind(df, u)
}

df <- rbind(df_1, df_2)

elapsed <- proc.time() - start_time
print(elapsed)

AbhishekHP · July 20, 2019, 10:33am

Thanks @grosscol for creative approach to speed up. (5x)
%<-% is this from magrittr ?
If we divide further i.e 4 parts running for loops,
Are there any trade offs ?
Because end of the day, this data should be loaded into Shiny and rendered
actually: 30 million rows

> print(elapsed) [without future]
   user  system elapsed 
541.731  51.299 610.976 

> print(elapsed) [with future]
   user  system elapsed 
178.281   2.553 182.613

#########################
@aosmith, @andresrcs
It seems that map_dfr has an issue binding when giving the colClasses.
I had to change a column type from integer to character (F column below)
even though column entries were purely integer type
Error was:
Column F can't be converted from integer to character

github.com/tidyverse/purrr

map_dfr() column binds vectors

opened 05:48PM - 09 Mar 18 UTC

closed 09:38PM - 08 Sep 22 UTC

garrettgman

feature map

`map_dfr()` does the same thing as `map_dfc()` when given a list of vectors: it …_column binds_ the vectors into a data frame. ````r list_of_vecs <- list(a = c(1,1,1, 1), b = c(2, 2, 2, 2), c = c(3, 3, 3, 3)) list_of_vecs %>% map_dfr(~.x) ## # A tibble: 4 x 3 ## a b c ## <dbl> <dbl> <dbl> ## 1 1 2 3 ## 2 1 2 3 ## 3 1 2 3 ## 4 1 2 3 ```` I'd expect `map_dfr()` to row bind the vectors into a data frame with one long column, or to throw an error.

Also, @andresrcs,
Actually function call look like this,
Struggled to adopt purrr and mutate when having so many things happening within each loop. Is there a better way using purrr ?

column_names <- c("A","B","C","D","E","F","G")
column_classes <- c("factor", "factor", "factor", "factor", "factor", "character", "character") 
read_and_add = function(fl_nm) {
  file <- fread(paste0("data/",fl_nm), header = F, skip = 2, colClasses = column_classes)
  setnames(file, column_names)
  pattern = paste0(
    "(?<!\\d)", # not preceded by a digit
    "(",        # start defining group 1
    "\\d{4}", # match 4 digits in a row
    ")",        # done defining group 1
    "\\D",      # match a non-digit character
    "(",        # start defining group 2
    "\\d{2}", # match 2 digits in a row
    ")",        # done defining group 2
    "(?!\\d)"   # not followed by a digit
  )
  date_parts <- str_match(fl_nm, pattern)
  rownames(date_parts) <- fl_nm
  colnames(date_parts) <- c("matched", "year", "month")
  file$year <- date_parts[fl_nm, "year"]
  file$month <- date_parts[fl_nm, "month"]
  file
}
df_func = map_dfr(.x = list.files(path = "data", pattern = "*csv"),
        .f = read_and_add)

andresrcs · July 20, 2019, 2:45pm

Unless you have a good reason for editing each file's column names before merging them together, I would do it this way

library(tidyverse)
library(stringr)
library(data.table)

column_names <- c("file_name","A","B","C","D","E","F","G","year","month")
column_classes <- c("factor", "factor", "factor", "factor", "factor", "character", "character") 

list_of_files <- list.files(path = "data",
                            pattern = ".csv$",
                            full.names = TRUE)
df <- list_of_files %>%
    setNames(nm = .) %>% 
    map_dfr(~fread(., header = F, skip = 2, colClasses = column_classes),
            .id = "file_name") %>% 
    mutate(year = str_extract(file_name, "(?<\\D)\\d{4}(?\\D)"),
           month = str_extract(file_name, "(?<\\D)\\d{2}(?\\D)")) %>% 
    rename_all(~column_names)

Although, I can't be sure this would fit your real data, if you need more specific help consider sharing a link to a couple sample .csv files to test our code.

grosscol · July 21, 2019, 6:48am

The operator %<-% is from future. It's the assignment operator that delegates to a future evaluation block. When using a multiprocess plan, it means the code is run in a non-blocking process. It should only block if and when the assigned value is referenced subsequently but the code has not already finished.

Glad it did result in a performance increase for you. You'd have to experiment to determine what the optimal number of reading processes is for your hardware and workload. The factors involved in retrieval from storage go beyond just the R code.

dracodoc · July 21, 2019, 3:07pm

fread should have best performance in reading files (you mentioned fread but I only see read_csv2 in your post).

purrr:map will not have much performance gain over for loop.

The usual advice of avoiding for loop is intended for you to find right vectorized function alternatives, which often implemented the loop with C so is faster.
And another common bad practice to be avoided is increasing a list/vector inside a loop.

Other than these two cases, for loop/lapply/map are similar in performance.

Using data.table to add column is also very fast because it happened in place without creating new copy. I used this kind of code several times. (I don't have access to my previous code now and don't have time to test the code below).

If you still need to improve on performance and your file structure is well known, you can set colClasses which should reduce the time fread used on detecting column types.

If you need to read lots of file/big files, learning data.table will be well worth your effort (readr is one big step over base R, but still not as good as fread in case of reading csv files)

library(data.table)

res_list <- lapply(list.files(*.csv), function(x) {
   dt <- fread(x)
   dt[, Year := str_extract(x,"\\d{4}")]
})

res_dt <- rbindlist(res_list)

system · July 28, 2019, 3:07pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.