Scraping table weirdness with rvest (undesired {xml_nodeset (0)})

rvest

#1

I'm trying to scrape a table (I think in HTML?), and I can't seem to find the right code with CSS Selector to scrape the table for goals scored -- I just get a {xml_nodeset (0)}

Any ideas? (also, please let me know if this is the type of question that I shouldn't be asking here)

Here's the code:

library(tidyverse)
library(rvest)

url <- "http://www.uscho.com/scoreboard/division-i-men/20172018/composite-schedule/"

link_list <- url %>%
  read_html() %>%
  html_nodes("td:nth-child(13) a") %>%
  html_attr("href") %>%
  {paste0("http://www.uscho.com", .)}
  
link_list[200] %>%
  read_html() %>%
  html_nodes("#boxgoals td")
#> {xml_nodeset (0)}

Created on 2018-04-02 by the reprex package (v0.2.0).


#2

What are you trying to select with html_nodes("td:nth-child(13) a")?


#3

I'm trying to go down the list for each games' Game Summary and scrape the goal data in #boxgoals

Everything seems to be working for me up until the last chunk of code. Maybe it's in JS, I'm not really sure.


#4

Oh, ok— I couldn't see that far into the table and didn't know there was more (was thinking maybe you'd inadvertently been trying to select a column that wasn't there).


#5

The data is being loaded with JavaScript. If you try to select tables in the scraped HTML, there aren't any:

library(rvest)
#> Loading required package: xml2

h <- read_html('http://www.uscho.com/recaplink.php?gid=1_970_20172018')

h %>% html_nodes('table')
#> {xml_nodeset (0)}

If you load it in a browser, depending on how fast your connection is, you'll also see a brief "Loading" message for each table, which also tells you the data isn't baked into the HTML originally. On the R side, you can scan through h %>% html_structure(), and you'll see that it looks different than the live page rendered in a browser, and doesn't contain the information you need.

The most direct way to get the data is to run the JavaScript just like your browser would, e.g. by scraping with RSelenium or splashr, and then grab the HTML. (After you scrape the source, you can still parse the HTML with rvest.)

There are sometimes clever ways around such an approach (RSelenium and splashr are decidedly heavier than rvest), but they require looking deeper into how the data is loaded.


#6

thanks for the response! makes sense -- I'm trying to use RSelenium, and the whole download and set-up process is making me rip my hair out...


#7

Yeah, it's a bit of a bear. The examples in the docs are helpful, though; you can often adapt them to what you need. The package is object-oriented in a way that most in R aren't; a lot of the functions you need will be methods of the remote driver. What works for me (but may or may not for you, annoyingly):

library(RSelenium)
library(rvest)

rd <- rsDriver()

rd$client$navigate('http://www.uscho.com/recaplink.php?gid=1_970_20172018')
h <- rd$client$getPageSource()
h <- h[[1]] %>% read_html()

rd$client$close()
rd$server$stop()
rm(rd)

boxgoals <- h %>% 
    html_node('#boxgoals') %>% 
    html_table()

boxgoals
#>   Per             Team       Scorer      Assist 1      Assist 2   Goal Type  Time
#> 1   1 Boston College-1 Connor Moore    Mike Booth Casey Carreau             15:30
#> 2   2     Providence-1   Erik Foley Spenser Young                       4x4 08:22
#> 3   2     Providence-2 Ben Mirageas  Scott Conway Spenser Young GWG PPG 5x4 19:14

This works, but is sort of a pain. splashr is a newer alternative that is built to contain a lot of the messiness in docker. Also nicely, its render_html function returns an xml2 object like rvest uses, so it can integrate directly. Note you'll need to install and start docker before the following will work.

library(splashr)
library(rvest)

# install_splash()    # run this once to install the docker image
sp <- start_splash()

pg <- render_html(url = 'http://www.uscho.com/recaplink.php?gid=1_970_20172018')

stop_splash(sp)

boxgoals <- pg %>% 
    html_node('#boxgoals') %>% 
    html_table()

boxgoals
#>   Per             Team       Scorer      Assist 1      Assist 2   Goal Type  Time
#> 1   1 Boston College-1 Connor Moore    Mike Booth Casey Carreau             15:30
#> 2   2     Providence-1   Erik Foley Spenser Young                       4x4 08:22
#> 3   2     Providence-2 Ben Mirageas  Scott Conway Spenser Young GWG PPG 5x4 19:14

There's much more to using docker fully, of course. Here's a nice tutorial to get you started. In this case, you don't really need to know much, but it is important to realize that install_splash will download a 1.2Gb docker image to your machine. The above tutorial explains how to delete it afterwards if you want your disk space back.


#8

Woaaaaaaaaaaaaah. Thanks so much! This is fantastic. Now I just gotta get Docker figured out

At this point, I wouldn't be shocked if I already deleted System32...


#9

For future reference, I just spent hours going insane, trying to figure out why I couldn't use Docker. I'm on Windows 10 and I needed to enable virtualization. So if you need to do that, google "enable virtualization windows 10" and it should help you.


#10

@eoppe1022 If you question has been answered, please mark the response as the solution! Thanks!


#11

You might have a look at PhantomJS. It's a headless browser that should allow you to render and then save pages, then scrape the saved page, with tables now in HTML.

You can see an example at https://www.r-bloggers.com/web-scraping-javascript-rendered-sites/.


#12

I've had success with splashr (and figured out the whole docker thing). I found this helpful: https://rud.is/b/2017/02/09/diving-into-dynamic-website-content-with-splashr/


#13

Thanks for the info! I've looked into PhantomJS, but it looks like a huge pain in the ass