Difficulties with updating/adding/removing objects from within a function

splashr
web-scraping

#1

I'm trying to scrape a webpage using splashr, and I think that I need to reset my splashr container every time I run the function to scrape each link.

My issue is that, to do this, I run start_splash() and stop_splash() within the function, so that I should have a new container every time I run the function. However, this always fails, as I think my container isn't getting reset for some reason. I know this because I'll run splash_active() after I run the function, and it'll return TRUE, meaning that I still have an active container.

Here's the function in question:

get_box_score <- function(my_url) {
  
  progress_bar$tick()$print()

  sp <- start_splash()
  
  Sys.sleep(sample(seq(0, 0.1, by = 0.001), 1))
  
  render_html(url = my_url) %>%
    html_node("#boxgoals") %>%
    html_table() %>%
    as_tibble()

  stop_splash(sp)
}

Anyone know how to go about this? I attached some reproducible code below. Thanks!

library(tidyverse)
library(rvest)
library(splashr)

url <- "https://www.uscho.com/scoreboard/michigan/mens-hockey/"  

get_data <- function(myurl) {
  
  link_data <- myurl %>%
    read_html() %>%
    html_nodes("td:nth-child(13) a") %>%
    html_attr("href") %>%
    str_c("https://www.uscho.com", .) %>%
    as_tibble() %>%
    set_names("url")
  
  game_type <- myurl %>%
    read_html() %>%
    html_nodes("td:nth-child(12)") %>%
    html_text() %>%
    as_tibble() %>%
    filter(between(row_number(), 2, n())) %>%
    set_names("game_type")

  as_tibble(data.frame(link_data, game_type))
  
}

link_list <- get_data(url)


urls <- link_list %>%
  filter(game_type != "EX") %>%
  pull(url)

get_box_score <- function(my_url) {
  
  progress_bar$tick()$print()

  sp <- start_splash()
  
  Sys.sleep(sample(seq(0, 0.1, by = 0.001), 1))
  
  render_html(url = my_url) %>%
    html_node("#boxgoals") %>%
    html_table() %>%
    as_tibble()

  stop_splash(sp)
}

persistently_get_box_score <- warrenr::persistently(get_box_score, max_attempts = 15, wait_seconds = 0.001)

try_get_box_score <- function(url) {
  tryCatch(persistently_get_box_score(url), error = function(e) {data.frame()})
}

progress_bar <- link_list %>%
  filter(game_type != "EX") %>%
  tally() %>%
  progress_estimated(min_time = 0)


mydata <- pmap_df(list(urls), try_get_box_score)

#2

Do you think this is an issue with splashr? If not, could you point to the place in your code where you think tidyverse stuff is going awry? (Knowing zero about splashr, I'm having trouble identifying the location of the problem).

Also, it looks like you filed an issue in the splashr repo:


but you might have better luck getting help from Bob (the maintainer), if you include a more built-out example (such as the one here).

Lastly (again, knowing little of splashr) I'm not sure how similar your two questions are:

However, either way, you might consider posting on StackOverflow if you don't get an answer here. This is still a relatively small community, so it might be a good idea to widen your audience, especially since you're dealing with a specific framework. If you do post on SO, please link back to your question here with the answer etc. so there's no inadvertent duplication of effort! :slightly_smiling_face:


#3

Thanks for the response, Mara. Honestly ... I don't know. At first, I thought this was an issue with splashr, but then I thought I fixed it, and then I got HTTP 504 errors. And then I thought I fixed that, and then I thought I just got general R errors. So I really don't know what's going on

Yeah, I've been trying to contact Bob. We'll see how that goes. Thanks!


#4

If you wrap the cleanup in on.exit(), that section will run whenever the function ends, even if there's an error.

get_box_score <- function(my_url) {
  
  progress_bar$tick()$print()

  sp <- start_splash()
  on.exit(stop_splash(sp))
  
  Sys.sleep(sample(seq(0, 0.1, by = 0.001), 1))
  
  render_html(url = my_url) %>%
    html_node("#boxgoals") %>%
    html_table() %>%
    as_tibble()
}

If that still doesn't work, try replacing stop_splash(sp) with killall_splash() (of course, this kills all containers, so be sure it's what you want).

Forgive me for not running your example, but I'm reluctant to run anyone else's internet-connecting code.


#5

oh these are great ideas. Thanks! I'll let you know how they work. Right now, I think my biggest issue is HTTP 504 Bad Gateway Errors. So I'm trying to work through those first. I think the splashr issues are stemming from that


#6

So I got this to work. I'm not sure what exactly did it, but here are a few things that I'm doing now that may or may not be contributing to its success. I figured I'd write this here in case anybody has similar problems in the future.

  1. Huge sleep times.
    I'm talking ~30 seconds of Sys.sleep() per scrape. This makes the whole process take forever, but I've gotten way fewer errors in general while doing this.

  2. Using on.exit(stop_splash(sp))
    From @nwerth's suggestion, I implemented this, and it seems to work, so I've been sticking with it.

  3. Using the long-form splashr structure rather than just render_html()
    So, instead of what I did above, I now use this code below. Honestly, I don't know if this is doing anything, but it works, so I'm sticking with it.

get_box_score <- function(my_url) {
  
  progress_bar$tick()$print()
  
  splash_container <- start_splash()
  on.exit(stop_splash(splash_container))
  
  Sys.sleep(runif(1, 20, 35))
  
  mydata <- splash_local %>%
    splash_response_body(TRUE) %>%
    splash_enable_javascript(TRUE) %>%
    splash_plugins(TRUE) %>%
    splash_user_agent(ua_win10_chrome) %>%
    splash_go(my_url) %>%
    splash_wait(runif(1, 10, 15)) %>%
    splash_html() %>%
    html_node("#boxgoals") %>%
    html_table(fill = TRUE) %>%
    as_tibble() %>%
    mutate_all(as.character)

  return(mydata)
}