Scraping web table with rvest

maio · March 13, 2020, 3:25pm

I'm new in web scraping using R. I'm trying to scrape the table generated by this link: https://gd.eppo.int/search?k=saperda+tridentata. In this specific case, it's just one record in the table but it could be more (I am actually interested in the first column but the whole table is ok).
I already looked for this issue somewhere else and I tried to apply the tips found here but with no success (maybe for my little knowledge on how web pages work): https://stackoverflow.com/questions/59312399/rvest-table-with-thead-and-tbody-tags. Maybe I am not following correctly the suggested step "# Get the JSON as plain text from the link generated by Javascript on the page". Where can I get this link? In this specific case I used the json link reported in the script below... is that the correct one in the case of the page I am exploring?

Below you have my code. Thank you in advance!

library(httr)
library(rlist)
library(rvest)
library(jsonlite)
library(dplyr)

pest.name <- "saperda+tridentata"

url <- paste("https://gd.eppo.int/search?k=",pest.name, sep="")
resp <- GET(url) %>% content("text") 

json_url <- "https://gd.eppo.int/media/js/application/zzsearch.js?7"
JSON <- GET(json_url) %>% content("text", encoding = "utf8") 

table_contents <- JSON     %>%
  {gsub("\\\\n", "\n", .)}  %>%
  {gsub("\\\\/", "/", .)}   %>%
  {gsub("\\\\\"", "\"", .)} %>%
  strsplit("html\":\"")    %>%
  unlist                   %>%
  extract(2)               %>%
  substr(1, nchar(.) -2)   %>% 
  paste0("</tbody>")

new_page <- gsub("</tbody>", table_contents, resp)

read_html(new_page)   %>%
  html_nodes("table") %>%
  html_table()

HanOostdijk · March 14, 2020, 12:40am

I have a solution with the package RSelenium. If you only need the data ...

library(RSelenium)
library(rvest)

rD <- rsDriver(browser = 'firefox') 
remDr <- rD[["client"]]

pest.name <- "saperda+tridentata"
url <- paste("https://gd.eppo.int/search?k=",pest.name, sep="")
remDr$navigate(url)

remDr$switchToFrame(NULL)
doc = xml2::read_html(remDr$getPageSource()[[1]])

df= rvest::html_table(doc)[[1]]

gives

Maybe this helps

maio · March 14, 2020, 1:04pm

Hi Han!
Thanks for your reply!!

I already explored the option RSelenium... but I had so many problems that I quit.

In my work laptop there might be problems related to corporate limitations (I am not administrator of my machine). So I cannot even establish a connection. Error message: "Undefined error in httr call. httr output: Failed to connect to localhost port 444: Connection refused". The same with port 4567.
Using my personal pc I copied and pasted your script. I tried using both Chrome and Firefox, but I always get an empty table.... this is really weird.... I have no idea why this is happening. The result of this is an empty table again (0 obs. of 5 variables). I am wondering if this is happening for an html issue. Not sure I am reading correctly the html nodes. But I am not at all expert on html so I cannot even explain what I am trying to say

any suggestion?

Thank again
kind regards

HanOostdijk · March 14, 2020, 1:20pm

Hello Maio,

no suggestions for 1. If you get it to work on personal device, you could ask administrator for advice.
the second time I run the script it get errors with the message that the port is already in use (even when I close with

remDr$close()
# stop the selenium server
rD[["server"]]$stop() 
rm(rD)
gc(verbose=F)

I have to close RStudio before proceeding or use another port number. But you don't get errors, only a empty table (?) Could you include the relevant portion of the code that you use? Or is it exactly the same as the code that I used?

Good luck, Han

maio · March 14, 2020, 1:42pm

I am having the same port issue: I am not able to close it.

When the port is working, I use exactly the same code that you use... and this is why it's weird that you can get the data and I can't...

I am afraid that there is some issue (in my machine) with the correct interpretation of the html tag tbody which determines where the data in a html table are (as far as I understood...)

Can you please try to use the following code and tell me if you can get the table?

library(XML)
library(RCurl)
library(rlist)
library(rvest)

pest.name <- "saperda+tridentata"
pest.html <- read_html(paste("https://gd.eppo.int/search?k=",pest.name, sep="")) %>%
html_nodes("table") %>%
html_table(fill = FALSE)
pest.html[[1]]

HanOostdijk · March 14, 2020, 1:56pm

Why would this work? You will not pick up the effects of the javascript.
Anyway when I run your code I get

> pest.html[[1]]
[1] EPPOCode  Name      Type      Language  Preferred
<0 rows> (or 0-length row.names)
>

I work with a new version of RStudio (Version 1.3.904) in Windows 10 :

> sessionInfo() 
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Regards Han

maio · March 14, 2020, 3:25pm

Hi Han,
checking better the code in my personal pc I realized I forgot a character in the web site string.
Your code is working! Thank you!!

Now I have to understand how to run it on my work machine.

Kind regards

Andrea

system · April 4, 2020, 3:25pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.