How to recursively apply a function to a nested list of data frames?

I have my data organized into nested lists of data frames. I'd like to be able to apply a function to each of the data frames and return the updated data frames in the same nested list structure. Currently I am using nested calls to lapply(). This works but is difficult to read. I was hopeful that rapply() could solve my problem by recursively applying a function to all list elements. But since data frames are technically lists, rapply() wants to operate at the level of the data frame columns instead of the entire data frame.

I did some searching online, but I couldn't find anything that exactly addressed the problem in a succinct way. For example, here's a StackOverflow answer that recommends the repeated lapply() strategy that I am already using.

Is there any way to get rapply() to apply functions to the entire data frame? Or is there another option for recursively applying functions?

Below is a reprex for a doubly nested list of data frames. The situation is of course worse for more highly nested lists.

set.seed(20200618)

# Create example data frame
newDF <- function() {
  observations <- rpois(1, lambda = 10)
  data.frame(group = sample(letters[1:3], observations, replace = TRUE),
             measure= rnorm(observations))
}
newDF()
#>   group     measure
#> 1     a  0.67623031
#> 2     c -0.97297851
#> 3     c -1.35261998
#> 4     b  0.69287941
#> 5     a -0.15351746
#> 6     c -0.62919316
#> 7     a  0.02225157
#> 8     c -1.22776690

# Example nested list
ex1 <- list(
  a1 = list(
    b1 = newDF()
  ),
  a2 = list(
    b2 = newDF(),
    b3 = newDF(),
    b4 = newDF()
  ),
  a3 = list(
    b5 = newDF()
  )
)
str(ex1)
#> List of 3
#>  $ a1:List of 1
#>   ..$ b1:'data.frame':   8 obs. of  2 variables:
#>   .. ..$ group  : chr [1:8] "a" "a" "a" "b" ...
#>   .. ..$ measure: num [1:8] -0.537 0.458 -2.297 -0.248 -1.514 ...
#>  $ a2:List of 3
#>   ..$ b2:'data.frame':   12 obs. of  2 variables:
#>   .. ..$ group  : chr [1:12] "a" "b" "a" "b" ...
#>   .. ..$ measure: num [1:12] -0.234 -0.871 -0.882 -1.164 -0.999 ...
#>   ..$ b3:'data.frame':   19 obs. of  2 variables:
#>   .. ..$ group  : chr [1:19] "a" "c" "a" "a" ...
#>   .. ..$ measure: num [1:19] 1.607 -1.653 -0.79 0.389 -1.645 ...
#>   ..$ b4:'data.frame':   7 obs. of  2 variables:
#>   .. ..$ group  : chr [1:7] "c" "c" "b" "b" ...
#>   .. ..$ measure: num [1:7] 0.503 1.325 0.468 -0.542 -0.504 ...
#>  $ a3:List of 1
#>   ..$ b5:'data.frame':   9 obs. of  2 variables:
#>   .. ..$ group  : chr [1:9] "c" "c" "b" "a" ...
#>   .. ..$ measure: num [1:9] -1.278 -0.136 0.295 -0.217 -0.771 ...

# Example function that needs to be applied to the entire data frame, not its columns
removeNegative <- function(x) x[x[["measure"]] >= 0, ]

# Double lapply() works but is unwieldy
result <- lapply(ex1, function(x) lapply(x, removeNegative))
head(result[[1]][[1]])
#>   group   measure
#> 2     a 0.4579189
#> 6     c 0.6518109
#> 7     b 0.9167038

# This throws an error because it attempts to apply the function to a single column
rapply(ex1, removeNegative, how = "replace")
#> Error in x[["measure"]]: subscript out of bounds

# This does nothing because none of the columns have class "data.frame"
result <- rapply(ex1, removeNegative, classes = "data.frame", how = "replace")
head(result[[1]][[1]])
#>   group    measure
#> 1     a -0.5373990
#> 2     a  0.4579189
#> 3     a -2.2969573
#> 4     b -0.2479370
#> 5     c -1.5138069
#> 6     c  0.6518109

Created on 2020-06-18 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.0.1 (2020-06-06)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  ctype    English_United States.1252  
#>  tz       America/New_York            
#>  date     2020-06-18                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date       lib source        
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
#>  backports     1.1.7   2020-05-13 [1] CRAN (R 4.0.0)
#>  callr         3.4.3   2020-03-28 [1] CRAN (R 4.0.0)
#>  cli           2.0.2   2020-02-28 [1] CRAN (R 4.0.0)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.0)
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.0)
#>  devtools      2.3.0   2020-04-10 [1] CRAN (R 4.0.0)
#>  digest        0.6.25  2020-02-23 [1] CRAN (R 4.0.0)
#>  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.0)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.0)
#>  fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.0)
#>  fs            1.4.1   2020-04-04 [1] CRAN (R 4.0.0)
#>  glue          1.4.1   2020-05-13 [1] CRAN (R 4.0.0)
#>  highr         0.8     2019-03-20 [1] CRAN (R 4.0.0)
#>  htmltools     0.4.0   2019-10-04 [1] CRAN (R 4.0.0)
#>  knitr         1.28    2020-02-06 [1] CRAN (R 4.0.0)
#>  magrittr      1.5     2014-11-22 [1] CRAN (R 4.0.0)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.0)
#>  pkgbuild      1.0.8   2020-05-07 [1] CRAN (R 4.0.0)
#>  pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.0)
#>  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.0)
#>  processx      3.4.2   2020-02-09 [1] CRAN (R 4.0.0)
#>  ps            1.3.3   2020-05-08 [1] CRAN (R 4.0.0)
#>  R6            2.4.1   2019-11-12 [1] CRAN (R 4.0.0)
#>  Rcpp          1.0.4.6 2020-04-09 [1] CRAN (R 4.0.0)
#>  remotes       2.1.1   2020-02-15 [1] CRAN (R 4.0.0)
#>  rlang         0.4.6   2020-05-02 [1] CRAN (R 4.0.0)
#>  rmarkdown     2.2     2020-05-31 [1] CRAN (R 4.0.0)
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 4.0.0)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
#>  stringi       1.4.6   2020-02-17 [1] CRAN (R 4.0.0)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.0)
#>  testthat      2.3.2   2020-03-02 [1] CRAN (R 4.0.0)
#>  usethis       1.6.1   2020-04-29 [1] CRAN (R 4.0.0)
#>  withr         2.2.0   2020-04-20 [1] CRAN (R 4.0.0)
#>  xfun          0.14    2020-05-20 [1] CRAN (R 4.0.0)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.0)
#> 
#> [1] C:/Users/BLISCJX/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.1/library

Is this what you want:

myrapply = function (x, myfun) {
  if ("data.frame" %in% class(x)) return(myfun(x))
  if ("list" %in% class(x)) return (purrr::map(x,~myrapply(.,myfun)))
  stop('myrapply: argument is neither data.frame or list')
}

removeNegative <- function(x) x[x[["measure"]] >= 0, ]

ex2 = myrapply(ex1,removeNegative)
myrapply("a",removeNegative)
1 Like

Here is a similar idea to @HanOostdijk's response.

TheFunc <- function(D) {
  removeNegative <- function(x) x[x[["measure"]] >= 0, ]
  if (is.data.frame(D)) { 
    return(removeNegative(D))
} else {
  lapply(D, TheFunc)
}
}
OUT <- lapply(ex1, TheFunc)
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

@HanOostdijk and @FJCC Thanks to you both for your quick replies! I learned a lot by reading both of your proposed solutions.

@HanOostdijk I selected your response as the solution since you were the first to suggest the idea of combining recursion with checking if the element is a data frame.

Here's the final solution I ended up using:

# Recursively apply function to all data frames in a nested list
dfrapply <- function(object, f, ...) {
  if (inherits(object, "data.frame")) {
    return(f(object, ...))
  }
  if (inherits(object, "list")) {
    return(lapply(object, function(x) dfrapply(x, f, ...)))
  }
  stop("List element must be either a data frame or another list")
}

And here is the result when it is applied to my original reprex:

> result <- dfrapply(ex1, removeNegative)
> result[[1]][[1]]
  group   measure
2     a 0.4579189
6     c 0.6518109
7     b 0.9167038