How to recursively apply a function to a nested list of data frames?

I have my data organized into nested lists of data frames. I'd like to be able to apply a function to each of the data frames and return the updated data frames in the same nested list structure. Currently I am using nested calls to lapply(). This works but is difficult to read. I was hopeful that rapply() could solve my problem by recursively applying a function to all list elements. But since data frames are technically lists, rapply() wants to operate at the level of the data frame columns instead of the entire data frame.

I did some searching online, but I couldn't find anything that exactly addressed the problem in a succinct way. For example, here's a StackOverflow answer that recommends the repeated lapply() strategy that I am already using.

Is there any way to get rapply() to apply functions to the entire data frame? Or is there another option for recursively applying functions?

Below is a reprex for a doubly nested list of data frames. The situation is of course worse for more highly nested lists.


# Create example data frame
newDF <- function() {
  observations <- rpois(1, lambda = 10)
  data.frame(group = sample(letters[1:3], observations, replace = TRUE),
             measure= rnorm(observations))
#>   group     measure
#> 1     a  0.67623031
#> 2     c -0.97297851
#> 3     c -1.35261998
#> 4     b  0.69287941
#> 5     a -0.15351746
#> 6     c -0.62919316
#> 7     a  0.02225157
#> 8     c -1.22776690

# Example nested list
ex1 <- list(
  a1 = list(
    b1 = newDF()
  a2 = list(
    b2 = newDF(),
    b3 = newDF(),
    b4 = newDF()
  a3 = list(
    b5 = newDF()
#> List of 3
#>  $ a1:List of 1
#>   ..$ b1:'data.frame':   8 obs. of  2 variables:
#>   .. ..$ group  : chr [1:8] "a" "a" "a" "b" ...
#>   .. ..$ measure: num [1:8] -0.537 0.458 -2.297 -0.248 -1.514 ...
#>  $ a2:List of 3
#>   ..$ b2:'data.frame':   12 obs. of  2 variables:
#>   .. ..$ group  : chr [1:12] "a" "b" "a" "b" ...
#>   .. ..$ measure: num [1:12] -0.234 -0.871 -0.882 -1.164 -0.999 ...
#>   ..$ b3:'data.frame':   19 obs. of  2 variables:
#>   .. ..$ group  : chr [1:19] "a" "c" "a" "a" ...
#>   .. ..$ measure: num [1:19] 1.607 -1.653 -0.79 0.389 -1.645 ...
#>   ..$ b4:'data.frame':   7 obs. of  2 variables:
#>   .. ..$ group  : chr [1:7] "c" "c" "b" "b" ...
#>   .. ..$ measure: num [1:7] 0.503 1.325 0.468 -0.542 -0.504 ...
#>  $ a3:List of 1
#>   ..$ b5:'data.frame':   9 obs. of  2 variables:
#>   .. ..$ group  : chr [1:9] "c" "c" "b" "a" ...
#>   .. ..$ measure: num [1:9] -1.278 -0.136 0.295 -0.217 -0.771 ...

# Example function that needs to be applied to the entire data frame, not its columns
removeNegative <- function(x) x[x[["measure"]] >= 0, ]

# Double lapply() works but is unwieldy
result <- lapply(ex1, function(x) lapply(x, removeNegative))
#>   group   measure
#> 2     a 0.4579189
#> 6     c 0.6518109
#> 7     b 0.9167038

# This throws an error because it attempts to apply the function to a single column
rapply(ex1, removeNegative, how = "replace")
#> Error in x[["measure"]]: subscript out of bounds

# This does nothing because none of the columns have class "data.frame"
result <- rapply(ex1, removeNegative, classes = "data.frame", how = "replace")
#>   group    measure
#> 1     a -0.5373990
#> 2     a  0.4579189
#> 3     a -2.2969573
#> 4     b -0.2479370
#> 5     c -1.5138069
#> 6     c  0.6518109

Is this what you want:

myrapply = function (x, myfun) {
  if ("data.frame" %in% class(x)) return(myfun(x))
  if ("list" %in% class(x)) return (purrr::map(x,~myrapply(.,myfun)))
  stop('myrapply: argument is neither data.frame or list')

removeNegative <- function(x) x[x[["measure"]] >= 0, ]

ex2 = myrapply(ex1,removeNegative)
1 Like

Here is a similar idea to @HanOostdijk's response.

TheFunc <- function(D) {
  removeNegative <- function(x) x[x[["measure"]] >= 0, ]
  if ( { 
} else {
  lapply(D, TheFunc)
OUT <- lapply(ex1, TheFunc)
1 Like

@HanOostdijk and @FJCC Thanks to you both for your quick replies! I learned a lot by reading both of your proposed solutions.

@HanOostdijk I selected your response as the solution since you were the first to suggest the idea of combining recursion with checking if the element is a data frame.

Here's the final solution I ended up using:

# Recursively apply function to all data frames in a nested list
dfrapply <- function(object, f, ...) {
  if (inherits(object, "data.frame")) {
    return(f(object, ...))
  if (inherits(object, "list")) {
    return(lapply(object, function(x) dfrapply(x, f, ...)))
  stop("List element must be either a data frame or another list")

And here is the result when it is applied to my original reprex:

> result <- dfrapply(ex1, removeNegative)
> result[[1]][[1]]
  group   measure
2     a 0.4579189
6     c 0.6518109
7     b 0.9167038

