Scraping past html comments with rvest

salamander · July 10, 2018, 1:23am

For background, I am trying to get the list of channels each NBA game was televised on. When I access the below html node, it should yield 14 div-s (i counted the inspect page) but it only gives 2. I am somewhat confident it is because there is a comment after the second child in the list on the inspect page. I think rvest stops reading once it hits a comment. Is there a fix for this? Please advise, and thank you in advance.

library("rvest")
#> Loading required package: xml2
library("magrittr")

#rm(list = ls())

"https://stats.nba.com/scores/04/11/2018" %>%
  read_html() -> scrape

scrape %>%
  html_nodes("#scoresPage > div.row.collapse > div.scores__inner.large-9.medium-8.columns > div > div") %>%
  html_children()
#> {xml_nodeset (3)}
#> [1] <div stats-loader="isLoading"></div>
#> [2] <div stats-no-data-msg="noGames &amp;&amp; !isLoading" text="No Game ...
#> [3] <div class="game" ng-show="!isLoading" ng-repeat="(i, game) in games ...

dcruvolo · July 10, 2018, 3:55am

The returned values are blank as the site's content (games, scores, channels, et.c) is generated server side.

This was verified by running the following:

Changed html_children() to html_structure()
Verifying the content through the JS console (I modified the path slightly to specifically call the elements that contain the broadcast channel). This returns a list of 12 nodes one for each game on the broadcasted on the date.


# updated selector path
el = document.querySelectorAll('#scoresPage > .row:nth-child(2) > .scores__inner > div:nth-child(1) > .linescores-container > .game > .row > .large-12 > .linescore-header > .scores__inner__broadcaster > span:nth-child(2)')

# view all
for(i = 0; i < el.length; i++){
    console.log(el[i]);
}

Using Rselenium is a better option in this situation. It would look something like this:

# install --init only
devtools::install_github("johndharrison/binman")
devtools::install_github("johndharrison/wdman")
devtools::install_github("ropensci/RSelenium")

# set up
require(RSelenium)
rsd <- RSelenium::rsDriver(browser = "chrome")  # or other browser
rsc <- rsd$client

# navigate to page
rsc$navigate("https://stats.nba.com/scores/04/11/2018")


# set path
path <- "#scoresPage > .row:nth-child(2) > .scores__inner > div:nth-child(1) > .linescores-container > .game > .row > .large-12 > .linescore-header > .scores__inner__broadcaster > span:nth-child(2)"

# scrape elements
el <- rsc$findElements(using = "css",value=path)

# extract text
out <- sapply(el, function(x){x$getElementText()})
channels <- data.matrix(out)

# continue transformations here

You can find more information here: http://ropensci.github.io/RSelenium/

Hope that helps!

salamander · July 11, 2018, 8:11pm

Thank you!! This is excellent and almost exactly what I needed. However, I have two follow up questions:

Can I ask how you obtained that path variable?
Also, if I change your line:

path <- "#scoresPage > .row:nth-child(2) > .scores__inner > div:nth-child(1) > .linescores-container > .game > .row > .large-12 > .linescore-header > .scores__inner__broadcaster > span:nth-child(2)"

to:

path <- "#scoresPage > .row:nth-child(2) > .scores__inner > div:nth-child(1) > .linescores-container > .game > .row > .large-12 > .linescore-header > .scores__inner__broadcaster"

I can obtain all channels and not just the first one. However, it is difficult to get ESPN or NBATV as they are not spans, they are images/divs. It seems that nationally televised games come like this. Would you know of any easy way to pull down those in addition to the text versions?

Thank you again by the way, this was a huge help.

dcruvolo · July 13, 2018, 11:56am

Glad it worked out! I used Inspect Element and typed out the css path by reading the source code. Sorry, it looked like the previous version dropped some elements (I'm not sure what I was thinking by using span:nth-child(2)). I like the changes in the css path. The data is a better format too.

Where there are images instead of text, you can extract the value in the bc attribute located in <stats-broadcaster-logo>. This path is defined below.

# set paths: for <span> and for <stats-broadcaster-logo>
path <- "#scoresPage > .row:nth-child(2) > .scores__inner > div:nth-child(1) > .linescores-container > .game > .row > .large-12 > .linescore-header > .scores__inner__broadcaster"
img.path <- paste0(path," > stats-broadcaster-logo")

Then, use the getElementAttribute function to extract the text in the attribute bc.

# scrape elements
logo <- rsc$findElements(using = "css",value = img.path)

# extract text
imgs <- sapply(logo, function(x){ x$getElementAttribute("bc") })
imgs <- data.matrix(imgs)

Here's the full r code.

# set up
require(RSelenium)
rsd <- RSelenium::rsDriver(browser = "chrome")
rsc <- rsd$client

# navigate to page
rsc$navigate("https://stats.nba.com/scores/04/11/2018")

# set paths: for <span> and for <stats-broadcaster-logo>

path <- "#scoresPage > .row:nth-child(2) > .scores__inner > div:nth-child(1) > .linescores-container > .game > .row > .large-12 > .linescore-header > .scores__inner__broadcaster"

img.path <- paste0(path," > stats-broadcaster-logo")

# scrape elements
el <- rsc$findElements(using = "css",value=path)
logo <- rsc$findElements(using = "css",value = img.path)

# extract text
out <- sapply(el, function(x){x$getElementText()})
channels <- data.matrix(out)

# extract attributes
imgs <- sapply(logo, function(x){ x$getElementAttribute("bc") })
imgs <- data.matrix(imgs)

# view
channels
imgs

# continue with transformations

# close all connections
rsc$close()

Hope that helps!