Scraping table weirdness with rvest (undesired {xml_nodeset (0)})

eoppe1022 · April 2, 2018, 8:31pm

I'm trying to scrape a table (I think in HTML?), and I can't seem to find the right code with CSS Selector to scrape the table for goals scored -- I just get a {xml_nodeset (0)}

Any ideas? (also, please let me know if this is the type of question that I shouldn't be asking here)

Here's the code:

library(tidyverse)
library(rvest)

url <- "http://www.uscho.com/scoreboard/division-i-men/20172018/composite-schedule/"

link_list <- url %>%
  read_html() %>%
  html_nodes("td:nth-child(13) a") %>%
  html_attr("href") %>%
  {paste0("http://www.uscho.com", .)}
  
link_list[200] %>%
  read_html() %>%
  html_nodes("#boxgoals td")
#> {xml_nodeset (0)}

Created on 2018-04-02 by the reprex package (v0.2.0).

mara · April 3, 2018, 4:36pm

What are you trying to select with html_nodes("td:nth-child(13) a")?

eoppe1022 · April 3, 2018, 4:43pm

I'm trying to go down the list for each games' Game Summary and scrape the goal data in #boxgoals

Everything seems to be working for me up until the last chunk of code. Maybe it's in JS, I'm not really sure.

mara · April 3, 2018, 4:57pm

Oh, ok— I couldn't see that far into the table and didn't know there was more (was thinking maybe you'd inadvertently been trying to select a column that wasn't there).

alistaire · April 3, 2018, 8:07pm

The data is being loaded with JavaScript. If you try to select tables in the scraped HTML, there aren't any:

library(rvest)
#> Loading required package: xml2

h <- read_html('http://www.uscho.com/recaplink.php?gid=1_970_20172018')

h %>% html_nodes('table')
#> {xml_nodeset (0)}

If you load it in a browser, depending on how fast your connection is, you'll also see a brief "Loading" message for each table, which also tells you the data isn't baked into the HTML originally. On the R side, you can scan through h %>% html_structure(), and you'll see that it looks different than the live page rendered in a browser, and doesn't contain the information you need.

The most direct way to get the data is to run the JavaScript just like your browser would, e.g. by scraping with RSelenium or splashr, and then grab the HTML. (After you scrape the source, you can still parse the HTML with rvest.)

There are sometimes clever ways around such an approach (RSelenium and splashr are decidedly heavier than rvest), but they require looking deeper into how the data is loaded.

eoppe1022 · April 3, 2018, 11:11pm

thanks for the response! makes sense -- I'm trying to use RSelenium, and the whole download and set-up process is making me rip my hair out...

alistaire · April 4, 2018, 12:36am

Yeah, it's a bit of a bear. The examples in the docs are helpful, though; you can often adapt them to what you need. The package is object-oriented in a way that most in R aren't; a lot of the functions you need will be methods of the remote driver. What works for me (but may or may not for you, annoyingly):

library(RSelenium)
library(rvest)

rd <- rsDriver()

rd$client$navigate('http://www.uscho.com/recaplink.php?gid=1_970_20172018')
h <- rd$client$getPageSource()
h <- h[[1]] %>% read_html()

rd$client$close()
rd$server$stop()
rm(rd)

boxgoals <- h %>% 
    html_node('#boxgoals') %>% 
    html_table()

boxgoals
#>   Per             Team       Scorer      Assist 1      Assist 2   Goal Type  Time
#> 1   1 Boston College-1 Connor Moore    Mike Booth Casey Carreau             15:30
#> 2   2     Providence-1   Erik Foley Spenser Young                       4x4 08:22
#> 3   2     Providence-2 Ben Mirageas  Scott Conway Spenser Young GWG PPG 5x4 19:14

This works, but is sort of a pain. splashr is a newer alternative that is built to contain a lot of the messiness in docker. Also nicely, its render_html function returns an xml2 object like rvest uses, so it can integrate directly. Note you'll need to install and start docker before the following will work.

library(splashr)
library(rvest)

# install_splash()    # run this once to install the docker image
sp <- start_splash()

pg <- render_html(url = 'http://www.uscho.com/recaplink.php?gid=1_970_20172018')

stop_splash(sp)

boxgoals <- pg %>% 
    html_node('#boxgoals') %>% 
    html_table()

boxgoals
#>   Per             Team       Scorer      Assist 1      Assist 2   Goal Type  Time
#> 1   1 Boston College-1 Connor Moore    Mike Booth Casey Carreau             15:30
#> 2   2     Providence-1   Erik Foley Spenser Young                       4x4 08:22
#> 3   2     Providence-2 Ben Mirageas  Scott Conway Spenser Young GWG PPG 5x4 19:14

There's much more to using docker fully, of course. Here's a nice tutorial to get you started. In this case, you don't really need to know much, but it is important to realize that install_splash will download a 1.2Gb docker image to your machine. The above tutorial explains how to delete it afterwards if you want your disk space back.

eoppe1022 · April 4, 2018, 12:39am

Woaaaaaaaaaaaaah. Thanks so much! This is fantastic. Now I just gotta get Docker figured out

At this point, I wouldn't be shocked if I already deleted System32...

eoppe1022 · April 4, 2018, 1:37am

For future reference, I just spent hours going insane, trying to figure out why I couldn't use Docker. I'm on Windows 10 and I needed to enable virtualization. So if you need to do that, google "enable virtualization windows 10" and it should help you.

tbradley · April 4, 2018, 12:46pm

@eoppe1022 If you question has been answered, please mark the response as the solution! Thanks!

tomhopper · April 7, 2018, 3:03pm

You might have a look at PhantomJS. It's a headless browser that should allow you to render and then save pages, then scrape the saved page, with tables now in HTML.

You can see an example at https://www.r-bloggers.com/web-scraping-javascript-rendered-sites/.

kenbutler · April 8, 2018, 5:00pm

I've had success with splashr (and figured out the whole docker thing). I found this helpful: https://rud.is/b/2017/02/09/diving-into-dynamic-website-content-with-splashr/

eoppe1022 · April 8, 2018, 5:13pm

Thanks for the info! I've looked into PhantomJS, but it looks like a huge pain in the ass

hrbrmstr · September 24, 2018, 11:49pm

Take a look at decapitated via gitlab.com/hrbrmstr/decapitated (or github for legacy code sharing service users). It's much less complex than splashr and may get you what you need.

phanotmjs is also in "perhaps the community will keep it going" mode ever since headless chrome (what decapitated uses) came on the scene.

eoppe1022 · September 25, 2018, 12:24am

oh my god this is awesome! Any idea why a new Chrome window pops up every time I run chrome_read_html()?

hrbrmstr · September 25, 2018, 11:10am

I'd strongly suggest (for a number of reasons) using the decapitated::download_chromium() function. After doing to, it will tell you the environment variable setting you need to add to ~/.Renviron. That way the browser automation ops are kept separate from your main Chrome binary so there's no possible corruption of your own Chrome profile and no chance it will ever not be "headless" (and also means you can ditch the Google-spying Chrome and use the far superior Firefox Developer Edition

eoppe1022 · September 26, 2018, 5:32pm

Ah okay. By the way, I've had some issues with download_chromium(), so I raised an issue on github, if you wouldnt mind taking a look

nimaaax · June 24, 2019, 5:54pm

Hey Eoppe,

My issue looks like yours. Did you succeed to solve it? How did your code looks like at the end?

Many thanks,
Nima

alistaire · June 30, 2019, 2:24am

At this point I'd probably recommend using hrbrmstr's decapitated package he linked above, which is less of a pain than the other options. Install the package, configure it (meaning probably use the helper to install chromium, set the environment variable in ~/.Renviron, and restart R), and then you can use chrome_read_html to grab and xml2 object you can parse normally with rvest.