Using `purrr` to read multiple files

durraniu · October 9, 2018, 7:52pm

Objective

I have 100 .hdf5 files in a folder. I want to read them, extract some data and then combine those data in 1 data frame (from all 100). .hdf5 files can be read using rhdf5 library in R.

My current code

Using the for-loop I can achieve my objective as follows:

library(rhdf5)

temp = list.files(pattern="*.hdf5")
df_list = list()  # initialize a list

# Read all files into a list of data frames
for (i in unique(temp)){
  
  ## read 1 folder from the given file
  data <- h5read(file = i, name = "data")
  
  ### extract the SCC_Follow_Info dataset 
  df <- data$SCC_Follow_Info
  df <- as.data.frame(df)
  
  ## assign to the list
  df_list[[i]] <- df
} 

# Combining all data to 1 data frame ---------------------------
library(data.table)
df_sim <- data.table::rbindlist(df_list, idcol = "file.ID")

Question

Is there a purrr way to achieve my objective here? I read somewhere about this topic but can't seem to find it. It would be great if you could share a blog post doing something similar.

ksavin · October 9, 2018, 8:06pm

Do you really need purrr for this?

library(magrittr)
library(rhdf5)
library(data.table)
lapply(list.files(pattern="*.hdf5"), function(x) {
  h5read(file = x, name = 'data')$SCC_Follow_Info %>% as.data.table
}) %>% rbindlist

martin.R · October 9, 2018, 8:25pm

There are some examples using purrr here which you could adapt:
https://readxl.tidyverse.org/articles/articles/readxl-workflows.html

durraniu · October 9, 2018, 8:25pm

Thanks for your answer. This is very useful. However, I am wondering if this code can be extended for extracting multiple data objects from the .hdf5 file. For example, if I want to extract 5 more objects like SCC_Follow_Info and then finally combine them. I don't want to use h5read multiple times.

durraniu · October 9, 2018, 8:28pm

Thanks! This looks like something I can adapt.

ksavin · October 9, 2018, 8:44pm

If you are combining them with rbind, it's just as easy. Something like this

library(magrittr)
library(rhdf5)
library(data.table)

objects <- c('object1', 'object2', 'object3')
lapply(list.files(pattern="*.hdf5"), function(x) {
  h5read(file = x, name = 'data')[objects] %>% lapply(as.data.table)
}) %>% Reduce(c, .) %>% rbindlist

BTW, do.call(rbind) can be used instead of rbindlist, but data.table's solution is a bit faster and the package is overall wonderful.

rensa · October 9, 2018, 9:15pm

I do exactly this with NetCDF files (which, as of version 4, are acutally interoperable with HDF5 files)

library(tidyverse)
library(rhdf5)

# first, here's our extractor function. you can use it anonymously
# inside map_dfr; i'm separating it out here for clarity (and so you
# can reuse it). the extractor needs to accept a filename and
# return a data frame.
hdf5_extractor = function(fname) {

  data = h5read(file = fname, name = "data")

  # what you do here depends on how objects inside
  # the file are structured. if they're just vectors, you can
  # create and return a data frame like this:
  return(data_frame(
    data$SCC_Follow_Info,
    data$something_else,
    data$another_thing))

  # if they aren't vectors, you'll have to think about another way
  # to combine them into a data frame...
}

# get the file list and pipe it into our extractor function
df_dim =
  list.files(pattern="*.hdf5") %>%
  set_names(.) %>%
  map_dfr(hdf5_extractor, .id = "file.ID")

If you have large HDF5 files and don't need everything from a particular column, you can also modify this function to filter the contents before you return them

durraniu · October 10, 2018, 3:24pm

Thanks a lot! This is very easy to understand. However, I am running into another problem now. I know it is different from the original question, but am posting here as the code is the same.

Error with here package

I want to use the here package to locate my files:

> df_sim <- list.files(path = here("data", "raw_data"),
+                      pattern="*.hdf5") %>%
+   set_names(.) %>%
+   map_dfr(hdf5_extractor, .id = "file.ID")
 Show Traceback
 
 Rerun with Debug
 Error in h5checktypeOrOpenLoc(file, readonly = TRUE) : 
  Error in h5checktypeOrOpenLoc(). Cannot open file. File 'C:\Users\durraniu\Google Drive\Dissertation\Cars_20160601_01.hdf5' does not exist.

This is not what I expected. If I run just the first 2 lines, I get the correct ouput:

> list.files(path = here("data", "raw_data"),
+                      pattern="*.hdf5") %>%
+   set_names(.)
    Cars_20160601_01.hdf5     Cars_20160601_02.hdf5     Cars_20160601_03.hdf5     Cars_20160601_04.hdf5 
  "Cars_20160601_01.hdf5"   "Cars_20160601_02.hdf5"   "Cars_20160601_03.hdf5"   "Cars_20160601_04.hdf5"  .... <continued>

What am I doing wrong?

nwerth · October 10, 2018, 3:29pm

list.files gives just the base names by default (which, IMO, is odd and rarely useful). You need to ask for the whole paths.

list.files(
  path = here("data", "raw_data"),
  pattern = "*.hdf5",
  full.names = TRUE
)

durraniu · October 10, 2018, 6:25pm

Thank you. I learned a lot in this thread.

rensa · October 10, 2018, 10:15pm

/facepalm Yep, I forgot that If you want to isolate the file name later (in order to extract metadata from it), you can pipe the full names through basename() to remove the path and then tidyr::separate() to turn the delimited filename column into several columns