Using `purrr` to read multiple files

rensa · October 9, 2018, 9:15pm

I do exactly this with NetCDF files (which, as of version 4, are acutally interoperable with HDF5 files)

library(tidyverse)
library(rhdf5)

# first, here's our extractor function. you can use it anonymously
# inside map_dfr; i'm separating it out here for clarity (and so you
# can reuse it). the extractor needs to accept a filename and
# return a data frame.
hdf5_extractor = function(fname) {

  data = h5read(file = fname, name = "data")

  # what you do here depends on how objects inside
  # the file are structured. if they're just vectors, you can
  # create and return a data frame like this:
  return(data_frame(
    data$SCC_Follow_Info,
    data$something_else,
    data$another_thing))

  # if they aren't vectors, you'll have to think about another way
  # to combine them into a data frame...
}

# get the file list and pipe it into our extractor function
df_dim =
  list.files(pattern="*.hdf5") %>%
  set_names(.) %>%
  map_dfr(hdf5_extractor, .id = "file.ID")

If you have large HDF5 files and don't need everything from a particular column, you can also modify this function to filter the contents before you return them