Using `purrr` to read multiple files

purrr

#1

Objective

I have 100 .hdf5 files in a folder. I want to read them, extract some data and then combine those data in 1 data frame (from all 100). .hdf5 files can be read using rhdf5 library in R.

My current code

Using the for-loop I can achieve my objective as follows:

library(rhdf5)

temp = list.files(pattern="*.hdf5")
df_list = list()  # initialize a list

# Read all files into a list of data frames
for (i in unique(temp)){
  
  ## read 1 folder from the given file
  data <- h5read(file = i, name = "data")
  
  ### extract the SCC_Follow_Info dataset 
  df <- data$SCC_Follow_Info
  df <- as.data.frame(df)
  
  ## assign to the list
  df_list[[i]] <- df
} 

# Combining all data to 1 data frame ---------------------------
library(data.table)
df_sim <- data.table::rbindlist(df_list, idcol = "file.ID")

Question

Is there a purrr way to achieve my objective here? I read somewhere about this topic but can't seem to find it. It would be great if you could share a blog post doing something similar.


#2

Do you really need purrr for this?

library(magrittr)
library(rhdf5)
library(data.table)
lapply(list.files(pattern="*.hdf5"), function(x) {
  h5read(file = x, name = 'data')$SCC_Follow_Info %>% as.data.table
}) %>% rbindlist

#3

There are some examples using purrr here which you could adapt:
https://readxl.tidyverse.org/articles/articles/readxl-workflows.html


#4

Thanks for your answer. This is very useful. However, I am wondering if this code can be extended for extracting multiple data objects from the .hdf5 file. For example, if I want to extract 5 more objects like SCC_Follow_Info and then finally combine them. I don't want to use h5read multiple times.


#5

Thanks! This looks like something I can adapt.


#6

If you are combining them with rbind, it's just as easy. Something like this

library(magrittr)
library(rhdf5)
library(data.table)

objects <- c('object1', 'object2', 'object3')
lapply(list.files(pattern="*.hdf5"), function(x) {
  h5read(file = x, name = 'data')[objects] %>% lapply(as.data.table)
}) %>% Reduce(c, .) %>% rbindlist

BTW, do.call(rbind) can be used instead of rbindlist, but data.table's solution is a bit faster and the package is overall wonderful.


#7

I do exactly this with NetCDF files (which, as of version 4, are acutally interoperable with HDF5 files) :slight_smile:

library(tidyverse)
library(rhdf5)

# first, here's our extractor function. you can use it anonymously
# inside map_dfr; i'm separating it out here for clarity (and so you
# can reuse it). the extractor needs to accept a filename and
# return a data frame.
hdf5_extractor = function(fname) {

  data = h5read(file = fname, name = "data")

  # what you do here depends on how objects inside
  # the file are structured. if they're just vectors, you can
  # create and return a data frame like this:
  return(data_frame(
    data$SCC_Follow_Info,
    data$something_else,
    data$another_thing))

  # if they aren't vectors, you'll have to think about another way
  # to combine them into a data frame...
}

# get the file list and pipe it into our extractor function
df_dim =
  list.files(pattern="*.hdf5") %>%
  set_names(.) %>%
  map_dfr(hdf5_extractor, .id = "file.ID")

If you have large HDF5 files and don't need everything from a particular column, you can also modify this function to filter the contents before you return them :slight_smile:


#8

Thanks a lot! This is very easy to understand. However, I am running into another problem now. I know it is different from the original question, but am posting here as the code is the same.

Error with here package

I want to use the here package to locate my files:

> df_sim <- list.files(path = here("data", "raw_data"),
+                      pattern="*.hdf5") %>%
+   set_names(.) %>%
+   map_dfr(hdf5_extractor, .id = "file.ID")
 Show Traceback
 
 Rerun with Debug
 Error in h5checktypeOrOpenLoc(file, readonly = TRUE) : 
  Error in h5checktypeOrOpenLoc(). Cannot open file. File 'C:\Users\durraniu\Google Drive\Dissertation\Cars_20160601_01.hdf5' does not exist. 

This is not what I expected. If I run just the first 2 lines, I get the correct ouput:

> list.files(path = here("data", "raw_data"),
+                      pattern="*.hdf5") %>%
+   set_names(.)
    Cars_20160601_01.hdf5     Cars_20160601_02.hdf5     Cars_20160601_03.hdf5     Cars_20160601_04.hdf5 
  "Cars_20160601_01.hdf5"   "Cars_20160601_02.hdf5"   "Cars_20160601_03.hdf5"   "Cars_20160601_04.hdf5"  .... <continued>

What am I doing wrong?


#9

list.files gives just the base names by default (which, IMO, is odd and rarely useful). You need to ask for the whole paths.

list.files(
  path = here("data", "raw_data"),
  pattern = "*.hdf5",
  full.names = TRUE
)

#10

Thank you. I learned a lot in this thread.


#11

/facepalm Yep, I forgot that :laughing: If you want to isolate the file name later (in order to extract metadata from it), you can pipe the full names through basename() to remove the path and then tidyr::separate() to turn the delimited filename column into several columns :slightly_smiling_face: