How to declare HDF5r attribute data type: Transferring data between R and Python

Issue

I have a data set I am trying to use to create an H5 file I can pass to a Python package. I am populating the attributes using the function attr(object=,attr_name=) <- value. However, when I try to load my attributes for each object in the group within my h5 file, it appears the data class is not being preserved. When I open my h5 file in Python with {h5py} and look at the attributes every object is defined as fallows dtype=object. Does anyone know if this is a default of the attr() function? If so, should I try to use the create_attr() instead?

Thanks for any and all help! I recommend running this in Rmarkdown so you can make one r chunk and one python chunk for each of my blocks of code.

Reprex - edited

I am providing a simplified version of the code with sample data for three objects/events within the first group.

These events house a 3x6000 matrix each.

Each matrix should have 3 attributes each - a numeric, a char, and a list

Edits

  1. The reticulate package will be used at the end of the r chunk to pass the path to the h5 file you created in R to the Python chunk.

  2. Removed the list format from the purrr functions, functions working cleanly now.

R Code for Creating the File

library(hdf5r)
library(dplyr)
library(reticulate)

h5_file_path = here::here() # path to where you are creating the h5 file 

# This line creates the empty file
NMTSO_trainer.h5 <- H5File$new(filename = sprintf("%s/NMTSO_trainer.h5", h5_file_path), mode = "a")

# This creates a group within the file, think of the file as a directory tree and each group is like folder 
data.grp <- NMTSO_trainer.h5$create_group("data")

# Items to populate attribues
trace_name = c("sample_event1", "sample_event2", "sample_event3")
col_names = c("att1", "att2", "att3")
value = list(runif(n = 1, -100, 100), "SC", list(c(0,0,runif(n = 1, 0, 5))))

# Place holder for the matrices per event
x = list()

events = length(trace_name)

# Populates the event matrices
for (i in 1:events){
  x[[i]] <- runif(n = 6000, -1, 1) %>% matrix(nrow = 1)
  x[[i]] <- rep(0,(2 * 6000)) %>% matrix(nrow = 2) %>% rbind(x[[i]])
}

# Puts each matrix within the corresponding "folder" in the h5 file
purrr::map2(trace_name, x, function(trace_name, x){
  data.grp[[trace_name]] <- x
})

# Puts the corresponding attributes with each matrix - there should be 3 per matrix. 
# This is where I am wondering if I should use create_attr() rather than h5attr()
purrr::walk(trace_name, function(trace_name){
  purrr::walk2(col_names, value, function(col_names, value){
    h5attr(data.grp[[trace_name]], col_names) <- value
  })
})

# Shows the class of the objects pupulated in the h5 file according to R
h5attr(data.grp[[trace_name[1]]], col_names[1]) %>% class()
h5attr(data.grp[[trace_name[1]]], col_names[2]) %>% class()
h5attr(data.grp[[trace_name[1]]], col_names[3]) %>% class()

# The file must be closed for all data to be written to the file
NMTSO_trainer.h5$close_all(close_self = TRUE)

py$file_path = sprintf("%s/NMTSO_trainer.h5", h5_file_path)

Python Chunk for Evaluating Attr Format

Edits

  1. Passed the file_path object created in R into the Python environment using the reticulate package. No longer requires any user file_path manipulation as long as the code is ran in Rstudio's Rmarkdown files to take advantage of R's Python engine.
import h5py
import pandas as pd
import numpy as np

e = h5py.File(file_path, 'r')

# Shows the users what groups are in the file 
list(e.keys())

group = e['data']

# Shows the user what events are in the group
list(group.keys())

# Shows the user what is in the attributes 
group['sample_event1'].attrs['att1']
group['sample_event1'].attrs['att2']
group['sample_event1'].attrs['att3']

# Shows the user what format the data is in
type(group['sample_event1'].attrs['att1'])
type(group['sample_event1'].attrs['att2'])
type(group['sample_event1'].attrs['att3'])

Half Solution/Work Around

So I found if I use {reticulate} to create the hdf5 file in Python rather than R, the attributes I create retain their formatting after closing the file and reopening them. As I have built my pipeline in R, this is not the most ideal solution. If anyone knows how to do this with {HDF5r} rather than {h5py}, I would love to learn.

Remaining Question

The first attribute's type returns as a float64, I believe the standard for R. Is there a simple way to convert to float32? Is it necessary to convert between them? I have an exemplary file that goes to a program and the float attributes are float32.

Theory

Could the reason why all objects from R opened in Python through an hdf5 file are considered arrays have something to do with how r treats its objects as vectors?

library(dplyr)
library(reticulate)

h5_file_path = sprintf("%s/NMTSO_test.h5", here::here())

# Items to populate attribues
trace_name = c("sample_event1", "sample_event2", "sample_event3")
col_names = c("att1", "att2", "att3")
value = list(runif(n = 1, -100, 100), "SC", list(c(0,0,runif(n = 1, 0, 5))))

# Place holder for the matrices per event
x = list()

events = length(trace_name)

# Populates the event matrices
for (i in 1:events){
  x[[i]] <- runif(n = 6000, -1, 1) %>% matrix(nrow = 1)
  x[[i]] <- rep(0,(2 * 6000)) %>% matrix(nrow = 2) %>% rbind(x[[i]])
  
rm(events)
rm(i)
}

reticulate::repl_python()

import h5py
import pandas as pd
import numpy as np

# instert path inplace of my path here
e = h5py.File(''+r.h5_file_path, 'w')

# loop to create and fill each event in the group with values from the R object 'x'
for i in np.r_[0:len(r.trace_name)]:
  e.create_dataset("data/"+ r.trace_name[i], r.x[i].shape, data=r.x[i], dtype=np.float32)

# loops to populate the attributes in each sample_event from the R object 'value'
for i in np.r_[0:len(r.trace_name)]:
  for j in np.r_[0:len(r.col_names)]:
    e['data/'+r.trace_name[i]].attrs[''+r.col_names[j]] = r.value[j]

# Shows the user what is in the attributes 
e['data/sample_event1'].attrs['att1']
e['data/sample_event1'].attrs['att2']
e['data/sample_event1'].attrs['att3']

# Shows the user what format the data is in
type(e['data/sample_event1'].attrs['att1'])
type(e['data/sample_event1'].attrs['att2'])
type(e['data/sample_event1'].attrs['att3'])

e.close()

exit

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.