Is it a faux pas to nest functions?


#1

I haven't seen any info on this anywhere, but is it considered good practice to nest functions, with the assumption that I only use the inner function within the outer function? And if it isn't good practice, is there a better way?

Here's an example. I use get_betweenness() multiple times within get_data()

get_data <- function(my_season, my_team, my_league) {
  
  get_betweenness <- function(mydata) {
    mydata %>%
      filter(league == my_league) %>%
      filter(season == my_season) %>%
      filter(team == my_team) %>%
      select(-c(team, season, league, game_strength)) %>%
      as.matrix() %>%
      graph.edgelist(directed = TRUE) %>%
      betweenness(directed = TRUE, normalized = TRUE) %>%
      as.data.frame() %>%
      rownames_to_column() %>%
      rename(name = "rowname", betweenness = ".") %>%
      mutate(team = my_team, season = my_season, league = my_league)
    
  }
  
  data_all_situations <- goal_data %>%
    get_betweenness()

  data_five_on_five <- goal_data %>%
    filter(game_strength == "5v5") %>%
    get_betweenness()
  
  data_power_play <- goal_data %>%
    filter(game_strength == "5v4" | game_strength == "5v3" | game_strength == "4v3") %>%
    get_betweenness()
  
  data_even_strength <- goal_data %>%
    filter(game_strength == "5v5" | game_strength == "4v4" | game_strength == "3v3") %>%
    get_betweenness()
  
  full_data <- data_all_situations %>%
    left_join(data_five_on_five, by = c("name", "team", "season", "league")) %>%
    left_join(data_power_play, by = c("name", "team", "season", "league")) %>%
    left_join(data_even_strength, by = c("name", "team", "season", "league")) %>%
    select(name, team, season, league, betweenness_all = betweenness.x, betweenness_5v5 = betweenness.y, betweenness_pp = betweenness.x.x, betweenness_ev = betweenness.y.y) %>%
    as_tibble() %>%
    arrange(desc(betweenness_all))
    
  return(full_data)

}  

#2

On the contrary—I think it's a good idea!

In fact, you may eventually decide that you want pipes inside functions to work in a more arbitrary way, allowing you to pass column names and other expressions to the function (and on to the inner pipe), rather than just passing the data frame and using fixed column names. tidyeval tools can help you do that! Check out this section of Advanced R if the sound of that floats your boat :slightly_smiling_face:


#3

I agree with @rensa that this can be a good idea. As a reader of lots of tidyverse and r-lib code, it is something I have seen in multiple places. So I don't think you should worry that it is an anti-pattern.


#4

Thanks for the replies, @rensa & @jennybryan! I really appreciate it.

And @rensa, I'll take a look at that. Thanks for sharing.


#5

Keep in mind that if you are doing it with large datasets and/or lots of times, your data will stick around (more info here). So you might have a situation where you won't have enough RAM for what you are doing.


#6

Could I avoid RAM issues by running rm() within the function for all the data that I don't return() from the function?

Like this:

get_data <- function(my_season, my_team, my_league) {
  
  get_betweenness <- function(mydata) {
    mydata %>%
      filter(league == my_league) %>%
      filter(season == my_season) %>%
      filter(team == my_team) %>%
      select(-c(team, season, league, game_strength)) %>%
      as.matrix() %>%
      graph.edgelist(directed = TRUE) %>%
      betweenness(directed = TRUE, normalized = TRUE) %>%
      as.data.frame() %>%
      rownames_to_column() %>%
      rename(name = "rowname", betweenness = ".") %>%
      mutate(team = my_team, season = my_season, league = my_league)
    
  }
  
  data_all_situations <- goal_data %>%
    get_betweenness()

  data_five_on_five <- goal_data %>%
    filter(game_strength == "5v5") %>%
    get_betweenness()
  
  data_power_play <- goal_data %>%
    filter(game_strength == "5v4" | game_strength == "5v3" | game_strength == "4v3") %>%
    get_betweenness()
  
  data_even_strength <- goal_data %>%
    filter(game_strength == "5v5" | game_strength == "4v4" | game_strength == "3v3") %>%
    get_betweenness()
  
  full_data <- data_all_situations %>%
    left_join(data_five_on_five, by = c("name", "team", "season", "league")) %>%
    left_join(data_power_play, by = c("name", "team", "season", "league")) %>%
    left_join(data_even_strength, by = c("name", "team", "season", "league")) %>%
    select(name, team, season, league, betweenness_all = betweenness.x, betweenness_5v5 = betweenness.y, betweenness_pp = betweenness.x.x, betweenness_ev = betweenness.y.y) %>%
    as_tibble() %>%
    arrange(desc(betweenness_all))
    
  return(full_data)

  rm(data_all_situations, data_five_on_five, data_power_play, data_even_strength)

}  

#7

No that data is automatically destroyed - thats one of the good things about functions :slight_smile:


#8

I don't know if that helps.

My rule of thumb is that whenever I have a problem like this I ask myself if I need to solve it at all. Garbage collection is not simple, so if you are seeing that your current solution doesn't work then you can try changing solution. For example, I don't see why you need to have get_betweenness inside of the get_data at all. May be I'm missing something.

But keep in mind that R internally is not going to duplicate data all over the place. So, for example, if you run your function 50 times against the same data you are not going to get 50 copies of your dataset in memory. As one concrete example, I remember reading about cross-validation using recipes package and creating 50 datasets for CV only tripled amount of memory used.

So, bottom line, try to do it with your data and see if it works. If it doesn't, then you know at least where to look to fix it. If it works - great, you don't need to do anything :slight_smile:.


#9

My recommendation would be to limit nested functions to smaller functions. This is because nested functions are harder to debug. For example, you could run debug(get_data) or debugonce(get_data), but it's not possible to run debug(get_betweenness). Since this is your own custom code, you could instead use browser() inside of get_betweenness to enter the symobolic debugger when it is called. However, if you plan to share this script or put it inside a package, then it would be more important to avoid nested functions.

There is one package I use a lot that has large nested functions, and it is painful for me to quickly debug it. From ?debug:

To debug a function which is defined inside another function, single-step though to the end of its definition, and then call debug on its name.

Instead I have to resort to cloning the Git repo, manually adding a call to browser() into the source code, and re-installing the package locally.