Thoughts on internally mapping over a function within itself?

purrr

#1

A few weeks ago, I saw an example of someone using purrr::map() within their function to map over that same function. This way, the function can automatically handle vectors of whatever size without the user having to work some map() solution out by themselves.

Do you think this is a good practice to keep? I think it makes sense, and I do it in this function below. See the bottom of the function where I use map_dfr().

library(rvest)
library(tidyverse)

get_player_urls <- function(...) {
  
  .get_player_urls <- function(player_letter, ...) {
  
    if (tolower(player_letter) == "x") {
      
      warning("Uh, no X-Men in the NHL. Sorry :/")
      
      return(tibble())
      
      }
    
    else {
      
      page <- str_c("https://www.hockey-reference.com/players/", player_letter, "/") %>% read_html()
      
      player_url <- page %>%
        html_nodes("#div_players a") %>%
        html_attr("href") %>%
        str_c("https://www.hockey-reference.com", .) %>%
        as_tibble() %>%
        set_names("url")
      
      player_name <- page %>%
        html_nodes("#div_players a") %>%
        html_text() %>%
        as_tibble() %>%
        set_names("name")
      
      player_data <- bind_cols(player_name, player_url)
      
      return(player_data)
      
    }
    
  }
  
  all_player_data <- map_dfr(..., .get_player_urls)
  
  return(all_player_data)
  
}


#2

That sounds like it risks infinite loops.

But this is just fine, loop-wise. It's a good practice to explicitly define and name a non-trivial function for mapping, so cheers.

I do suggest replacing the "..." argument with a single variable, since anything beyond the first argument has no use in the code.


#3

Thanks for the response, @nwerth (and sorry for not responding sooner).

Just curious,

What do you mean by this?


#4

Names are great and helpful. When people have to assign a variable, they'll usually pick a good name. But when they don't have to assign one, like when parameters accept functions, there's a tendency to avoid naming. They'll use an "anonymous" function because "it's obvious and simple what's being done."

I've seen and, regrettably, written a lot of code that looks like this:

clean_data <- function(raw_data) {
  numeric_cols <- vapply(
    raw_data,
    function(x) {
      all(grepl('^-?\\d+(,\\d{3})*\\.\\d+$', x[!is.na(x)], perl = TRUE))
    },
    logical(1L)
  )
  integer_cols <- vapply(
    raw_data,
    function(x) {
      all(grepl('^-?\\d+(,\\d{3})*$', x[!is.na(x)], perl = TRUE))
    },
    logical(1L)
  )
  fips_cols <- grepl('FIPS', colnames(raw_data), fixed = TRUE)
  integer_cols <- integer_cols & !fips_cols
  raw_data[numeric_cols] <- lapply(
    raw_data[numeric_cols],
    function(x) {
      as.numeric(gsub(',', '', x, fixed = TRUE))
    }
  )
  raw_data[integer_cols] <- lapply(
    raw_data[integer_cols],
    function(x) {
      as.integer(gsub(',', '', x, fixed = TRUE))
    }
  )
  return(raw_data)
}

This function just takes data frames which store everything as characters and does common-sense-conversion for the numbers. But it looks more complicated than that. And anyone reading the code needs to keep 5 environments in mind: one for the main function and one for each of the anonymous functions. These anonymous functions are only a single line; imagine how much worse it'd be if they were each 5 lines with their own assignments or deeper mapping functions!

But here's the same function without anonymous inner functions:

is_numeric_column <- function(x) {
  all(grepl('^-?\\d+(,\\d{3})*\\.\\d+$', x[!is.na(x)], perl = TRUE))
}

is_integer_column <- function(x) {
  all(grepl('^-?\\d+(,\\d{3})*$', x[!is.na(x)], perl = TRUE))
}

fix_numeric <- function(x) {
  as.numeric(gsub(',', '', x, fixed = TRUE))
}

fix_integer <- function(x) {
  as.integer(gsub(',', '', x, fixed = TRUE))
}

clean_data <- function(raw_data) {
  numeric_cols <- vapply(raw_data, is_numeric_column, logical(1L))
  integer_cols <- vapply(raw_data, is_integer_column, logical(1L))
  fips_cols <- grepl('FIPS', colnames(raw_data), fixed = TRUE)
  integer_cols <- integer_cols & !fips_cols
  raw_data[numeric_cols] <- lapply(raw_data[numeric_cols], fix_numeric)
  raw_data[integer_cols] <- lapply(raw_data[integer_cols], fix_integer)
  return(raw_data)
}

Now it's understandable even without comments.