Tidync: split-apply-combine with tidyverse vs. array operations

dplyr
tidyr
tidync

#1

I don't work with gridded data very often, by climate science standards, but a recent analysis had me doing some small operations on some gridded NetCDF data: monthly occurrences of records in grid cells over Australia.

I wanted to derive from this the age of records for each grid cell and timestep. I chose to use the super-handy tidync package to pull out a data frame using hyper_tibble, group by grid cell (longitude + latitude) and use tidyr::fill to work it out:

rec_txa =
  # get the netcdf into a data frame
  tidync(
    here('analysis', 'data', 'record-ts', 'monthly-TXmean-records.nc')) %>%
  activate('hot_record_time') %>% .
  hyper_tibble() %T>%
  # tidy up a little
  print() %>%
  mutate(record = as.logical(hot_record_time)) %>%
  select(-hot_record_time) %>%
  arrange(time) %>%
  # cell-by-cell: find the time since the last record
  group_by(longitude, latitude) %>%
  mutate(last_record = if_else(record, true = time, false = NA_integer_)) %>%
  fill(last_record) %>%
  mutate(
    record_age = time - last_record,
    record_age_clamped = case_when(
      record_age > 120 ~ NA_integer_,
      TRUE ~ record_age
    )) %>%
  select(-last_record) %>%
  ungroup() %>%
  # convert time column to actual dates
  mutate(date = as.Date('1950-01-15') + months(time - 1)) %T>%
  print()

In this case it was fast enough for my needs: on the order of 10 seconds on my 2014 MBP for what was a ~2M row data frame (2000 cells * 1000 months). However, I'm aware that tidync also has a hyper_array function, and I'm not very familiar with R's matrix tools. Would it be quicker to dump to an array and use other tools to do this work? Or is there some possible performance benefit from hyper_tibble accepting a grouping argument to skip dumping the whole thing into a single data frame before splitting it up again?

I mostly ask because it's pretty typical for me to do embarrassingly parallelisable work on gridded data (either by grid cell, working on time series one at a time, or grouping by timestep to aggregate spatially), and in my PhD work I've mostly outsourced this kinda thing with bigger datasets to dedicated tools like CDO, at the expense of readability. But for many of my colleagues who're Python users, xarray's similar split-apply-combine workflow is often used for analysis on model output.


#2

Emily Robinson did a neat post on making R code faster a while back, and (not knowing much about your use case), I'd venture a guess that matrices would be faster…



#3

Thanks @mara! I learned to use profvis recently, so if I get some time to spare I might have a look at where the holdup is and see what I can do about it :slight_smile: I particularly appreciate that Emily's article has a section on weighing up the speed boost from switching to matrix operations against the readability of tidy tools :smiley:


#4

Missed this sorry! The hyper_array output is a list of arrays of each variable in the grid (use select_var arg to limit), and their dimensions will reflect any hyper_filter expressions. It's definitely quicker to not expand to data frame and the tidync object does have a "transforms" component with the axis values of each dimension - again those have a selected columns to reflect filters. Everything thing needed is in the tidync object tables, but not all needed helper functions exist yet


#5

Thanks @mdsumner! I didn't find hyper_tbl_cube until later, and tbl_cube looks like an exciting way to combine the strengths of leaving arrays with those of the tidyverse :slight_smile:


#6

Tbl_cube is pretty good but can't deal with many real world cases. I think keeping two tables is best, and multiple ones if the dimensions are not simplistic. I've tried few things and keen to hear any use cases you have -I'd really benefit from how you'd like a system to work. Feel free to discuss on GitHub if you like.