WEBSCARPING: RVEST output List of 0

Morning,
i trying to scrape some data from SoFifa.com, i detected a problem in parsing a button that contains a list of hyperlinks.
My goal is to capture the values from this botton with drop-down menu and then parse for each ipelink of some objects. I have no problems with single items on the menu so I can't find any way to take interest values. Does anyone have ideas?

EXAMPLE and TEST:
Webpage is


button circled in red.

if i try with CSS selector or XPATH on button list's singular values i obtain values only for button label, but for the interest values R give me:
{xml_missing}

here simple code to test

# insert URL
url <- paste0("https://sofifa.com//player/230621")

#parsing html
html <- xml2::read_html(url)

#test history button label
test <- html %>% html_node("#version-jump > option:nth-child(1)") %>% 
                 html_text()

#test history button values
test <- html %>% html_node("#version-jump > option:nth-child(2)") 

I try to inspect object but i don't understand how to grep singular values to create a function to take all hiperlinks.

thank you so much for any help
on hold

MC

The page you are trying to scrape is dynamically loaded using some js script.
You can see that because, in the html code you get, there is one node for #version-jump, so you get nothing when asking for the second node

library(rvest)
#> Le chargement a nécessité le package : xml2
url <- paste0("https://sofifa.com//player/230621")
html <- xml2::read_html(url)

html %>% html_nodes("#version-jump")
#> {xml_nodeset (1)}
#> [1] <select id="version-jump" class="form-select redirect"><option value ...
html %>% html_nodes("#version-jump > option")
#> {xml_nodeset (1)}
#> [1] <option value="">History Version</option>

Created on 2019-05-01 by the reprex package (v0.2.1.9000)

You need to use a package that can scrape JS rendered website. There is several options

All this option won't necessarly work but some will

example with decapitated:

library(decapitated)
library(rvest)
#> Le chargement a nécessité le package : xml2
url <- "https://sofifa.com/player/230621"
html <- chrome_read_html(url)
html %>% 
  html_nodes("#version-jump > option") %>%
  length()
#> [1] 295

html %>% html_node("#version-jump > option:nth-child(1)") %>% html_text()
#> [1] "History Version"
html %>% html_node("#version-jump > option:nth-child(2)") %>% html_text()
#> [1] "Apr 25, 2019"

Created on 2019-05-01 by the reprex package (v0.2.1.9000)

Example with crrri

It is a low level :package: for now and still in dev so it can evolve quickly but you can control the chrome browser from R directly.
A dump_DOM function needs to be create to get the html rendered by JS to read using rvest after. A new :package: should contain those functions soon.

library(crrri)

dump_DOM <- function(url) {
  # require for crrri to be configured to find chrom
  chrome <- Chrome$new()
  on.exit(chrome$close())
  client <- hold(chrome$connect())
  Network <- client$Network
  Page <- client$Page
  Runtime <- client$Runtime
  Page$enable() %...>% {
    Network$enable()
  } %...>% {
    Network$setCacheDisabled(cacheDisabled = TRUE)
  } %...>% {
    Page$navigate(url)
  } %...>% {
    Page$loadEventFired()
  } %...>% {
    Runtime$evaluate(
      expression = 'document.documentElement.outerHTML'
    )
  } %>% {
    hold(.)$result$value
  }
}

dom <- dump_DOM(url = "https://sofifa.com/player/230621")
#> Running "C:/Users/chris/Documents/Chrome/chrome-win32/chrome.exe" \
#>   --no-first-run --headless \
#>   "--user-data-dir=C:\Users\chris\AppData\Local\r-crrri\r-crrri\chrome-data-dir-rouneflg" \
#>   "--remote-debugging-port=9222" --disable-gpu --no-sandbox
library(rvest)
#> Le chargement a nécessité le package : xml2
html <- read_html(dom)
html %>% 
  html_nodes("#version-jump > option") %>%
  length()
#> [1] 295

html %>% html_node("#version-jump > option:nth-child(1)") %>% html_text()
#> [1] "History Version"
html %>% html_node("#version-jump > option:nth-child(2)") %>% html_text()
#> [1] "Apr 25, 2019"

Created on 2019-05-01 by the reprex package (v0.2.1.9000)

4 Likes

Another package you may want to try is webdriver: https://cran.r-project.org/web/packages/webdriver/

webdriver is a great package. It works well with PhantomJS but the problem is PhantomJS project has been stopped... :neutral_face:

2 Likes

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.