Anyone have a good way to retry a function?


#1

I'm looking to find a method retry some web-scraping functions in a package I'm writing. Right now, the simplest method I've found is warrenr::persistently(), which works fine, but I'm trying to reduce my package's dependencies.

Any ideas?

If you want to see a reprex for whatever reason, here's a function that sometimes poses issues:

library(tidyverse)
library(progress)

get_teams <- function(.league, .season, .progress = FALSE, ...) {
  
  leagues <- .league %>% 
    as_tibble() %>% 
    set_names(".league") %>% 
    mutate(.league = str_replace_all(.league, " ", "-"))
  
  seasons <- .season %>%
    as_tibble() %>%
    set_names(".season")
  
  mydata <- tidyr::crossing(leagues, seasons)
  
  if (.progress) {pb <- progress::progress_bar$new(format = ":what [:bar] :percent eta: :eta", clear = FALSE, total = nrow(mydata), width = 60)}
  
  league_team_data <- map2_dfr(mydata[[".league"]], mydata[[".season"]], function(.league, .season, ...) {
    
    if (.progress) {pb$tick(tokens = list(what = "get_teams()"))}
    
    seq(5, 10, by = 0.001) %>%
      sample(1) %>%
      Sys.sleep()
    
    page <- str_c("https://www.eliteprospects.com/league/", .league, "/", .season) %>% read_html()
    
    team_url <- page %>% 
      html_nodes("#standings .team a") %>% 
      html_attr("href") %>%
      str_c(., "?tab=stats") %>%
      as_tibble() %>%
      set_names("team_url")
    
    team <- page %>%
      html_nodes("#standings .team a") %>%
      html_text() %>%
      str_trim(side = "both") %>%
      as_tibble() %>%
      set_names("team")
    
    league <- page %>%
      html_nodes("small") %>%
      html_text() %>%
      str_trim(side = "both")
    
    season <- str_split(.season, "-", simplify = TRUE, n = 2)[,2] %>%
      str_sub(3, 4) %>%
      str_c(str_split(.season, "-", simplify = TRUE, n = 2)[,1], ., sep = "-")
    
    all_data <- team %>%
      bind_cols(team_url) %>% 
      mutate(league = league) %>%
      mutate(season = season)
    
    return(all_data)
    
  })
  
  return(league_team_data)
  
}

#2

You can try with purr::possibly. I wrote a blog post that details this: http://www.brodrigues.co/blog/2018-03-12-keep_trying/

I'm not sure it's a better solution than using warrenr::persistently() though (it does reduce the number of dependencies since you're already using the tidyverse).

However, keep in mind that you should not overload their servers with calls. Take also a look at {polite} to scrape politely: https://github.com/dmi3kno/polite


#3

That site has a crawl-delay of 30s, so set parameters accordingly if scraping multiple pages.

httr::RETRY may also be useful for intermittently functional pages. httr is a dependency of rvest, so it won't add to your dependency tree.


#4

Thanks for the reply. So I actually had been unfamiliar with robots.txt before this. Does that crawl-delay mean that there will be a forced delay of 30 seconds? Or does that mean if I don't set a manual delay for 30 seconds that my request won't be fulfilled?


#5

Awesome. I'm gonna look into all of that. Nice blog post by the way. I really enjoyed reading that


#6

Neither, necessarily. robots.txt is purely advisory—a standardized way for sites to set suggested limits on scraping. That said, site admins can absolutely block your IP if you cause undue stress on their website. Scraping a few dozen pages is unlikely to catch anyone's notice, but scraping thousands of pages in parallel is much more likely to cause a problem. Obeying robots.txt lets scrapers get what they need without causing problems.

More info on robots.txt:

Some principles of scraping responsibly:

Example in R:

Package for checking robots.txt from R:


#7

Thanks for the info! Definitely good to know while making a web-scraping package :slight_smile: