Extracting "href" when html_attr("href") doesn't work

rvest

#1

I'm trying to extract the "href" from this xml_nodeset, but html_attr("href") -- which usually works -- won't work here. Any idea how I can extract the "href" part of this? Thanks!

library(rvest)
library(splashr)
library(rvest)

sp <- start_splash()

page <- splashr::render_html(url = "https://www.nhl.com/gamecenter/phi-vs-bos/1974/05/07/1973030311#game=1973030311,game_state=final")

stop_splash(sp)

page %>% html_nodes('[class="name"]')

# {xml_nodeset (5)}
# [1] <div class="name"><strong><a href="https://www.nhl.com/player/wayne-cashman-8446002" data-player-link="8446002" ...
# [2] <div class="name"><strong><a href="https://www.nhl.com/player/gregg-sheppard-8451335" data-player-link="8451335 ...
# [3] <div class="name"><strong><a href="https://www.nhl.com/player/orest-kindrachuk-8448495" data-player-link="84484 ...
# [4] <div class="name"><strong><a href="https://www.nhl.com/player/bobby-clarke-8446098" data-player-link="8446098"> ...
# [5] <div class="name"><strong><a href="https://www.nhl.com/player/bobby-orr-8450070" data-player-link="8450070">Bob ...

page %>% html_nodes('[class="name"]') %>% html_attr("href")

# [1] NA NA NA NA NA

#2

It seems according to your example that you need to select two nodes under the current one to get the <a node and get the href attributes. Currently you are trying to get href from the div of class name, and it does not have href.

You should use XPATH or css selectors to get to these nodes. Or navigate into the xml structure using xml_children and friends.

I can't make an example because I do not have my computer right now. Hope it is clear enough


#3

I can now :smile:

With this selector "div.name > strong > a", it is working:

  • select all <a> under a <strong> that is under a <div> of class "name"
library(splashr)
#> Warning: le package 'splashr' a été compilé avec la version R 3.4.4
sp <- splash("192.168.99.100")
page <- render_html(sp, url = "https://www.nhl.com/gamecenter/phi-vs-bos/1974/05/07/1973030311#game=1973030311,game_state=final")

library(rvest)
#> Le chargement a nécessité le package : xml2
page %>%
  html_nodes("div.name > strong > a") %>%
  html_attr("href")
#> [1] "https://www.nhl.com/player/wayne-cashman-8446002"   
#> [2] "https://www.nhl.com/player/gregg-sheppard-8451335"  
#> [3] "https://www.nhl.com/player/orest-kindrachuk-8448495"
#> [4] "https://www.nhl.com/player/bobby-clarke-8446098"    
#> [5] "https://www.nhl.com/player/bobby-orr-8450070"

Created on 2018-08-10 by the reprex package (v0.2.0).


#4

Goddamn, you're always so helpful :). Thanks, @cderv.