WEBSCARPING: RVEST output List of 0

The page you are trying to scrape is dynamically loaded using some js script.
You can see that because, in the html code you get, there is one node for #version-jump, so you get nothing when asking for the second node

library(rvest)
#> Le chargement a nécessité le package : xml2
url <- paste0("https://sofifa.com//player/230621")
html <- xml2::read_html(url)

html %>% html_nodes("#version-jump")
#> {xml_nodeset (1)}
#> [1] <select id="version-jump" class="form-select redirect"><option value ...
html %>% html_nodes("#version-jump > option")
#> {xml_nodeset (1)}
#> [1] <option value="">History Version</option>

Created on 2019-05-01 by the reprex package (v0.2.1.9000)

You need to use a package that can scrape JS rendered website. There is several options

All this option won't necessarly work but some will

example with decapitated:

library(decapitated)
library(rvest)
#> Le chargement a nécessité le package : xml2
url <- "https://sofifa.com/player/230621"
html <- chrome_read_html(url)
html %>% 
  html_nodes("#version-jump > option") %>%
  length()
#> [1] 295

html %>% html_node("#version-jump > option:nth-child(1)") %>% html_text()
#> [1] "History Version"
html %>% html_node("#version-jump > option:nth-child(2)") %>% html_text()
#> [1] "Apr 25, 2019"

Created on 2019-05-01 by the reprex package (v0.2.1.9000)

Example with crrri

It is a low level :package: for now and still in dev so it can evolve quickly but you can control the chrome browser from R directly.
A dump_DOM function needs to be create to get the html rendered by JS to read using rvest after. A new :package: should contain those functions soon.

library(crrri)

dump_DOM <- function(url) {
  # require for crrri to be configured to find chrom
  chrome <- Chrome$new()
  on.exit(chrome$close())
  client <- hold(chrome$connect())
  Network <- client$Network
  Page <- client$Page
  Runtime <- client$Runtime
  Page$enable() %...>% {
    Network$enable()
  } %...>% {
    Network$setCacheDisabled(cacheDisabled = TRUE)
  } %...>% {
    Page$navigate(url)
  } %...>% {
    Page$loadEventFired()
  } %...>% {
    Runtime$evaluate(
      expression = 'document.documentElement.outerHTML'
    )
  } %>% {
    hold(.)$result$value
  }
}

dom <- dump_DOM(url = "https://sofifa.com/player/230621")
#> Running "C:/Users/chris/Documents/Chrome/chrome-win32/chrome.exe" \
#>   --no-first-run --headless \
#>   "--user-data-dir=C:\Users\chris\AppData\Local\r-crrri\r-crrri\chrome-data-dir-rouneflg" \
#>   "--remote-debugging-port=9222" --disable-gpu --no-sandbox
library(rvest)
#> Le chargement a nécessité le package : xml2
html <- read_html(dom)
html %>% 
  html_nodes("#version-jump > option") %>%
  length()
#> [1] 295

html %>% html_node("#version-jump > option:nth-child(1)") %>% html_text()
#> [1] "History Version"
html %>% html_node("#version-jump > option:nth-child(2)") %>% html_text()
#> [1] "Apr 25, 2019"

Created on 2019-05-01 by the reprex package (v0.2.1.9000)

4 Likes