Scraping web table with rvest

I'm new in web scraping using R. I'm trying to scrape the table generated by this link: https://gd.eppo.int/search?k=saperda+tridentata. In this specific case, it's just one record in the table but it could be more (I am actually interested in the first column but the whole table is ok).
I already looked for this issue somewhere else and I tried to apply the tips found here but with no success (maybe for my little knowledge on how web pages work): https://stackoverflow.com/questions/59312399/rvest-table-with-thead-and-tbody-tags. Maybe I am not following correctly the suggested step "# Get the JSON as plain text from the link generated by Javascript on the page". Where can I get this link? In this specific case I used the json link reported in the script below... is that the correct one in the case of the page I am exploring?

Below you have my code. Thank you in advance!

library(httr)
library(rlist)
library(rvest)
library(jsonlite)
library(dplyr)

pest.name <- "saperda+tridentata"

url <- paste("https://gd.eppo.int/search?k=",pest.name, sep="")
resp <- GET(url) %>% content("text") 

json_url <- "https://gd.eppo.int/media/js/application/zzsearch.js?7"
JSON <- GET(json_url) %>% content("text", encoding = "utf8") 

table_contents <- JSON     %>%
  {gsub("\\\\n", "\n", .)}  %>%
  {gsub("\\\\/", "/", .)}   %>%
  {gsub("\\\\\"", "\"", .)} %>%
  strsplit("html\":\"")    %>%
  unlist                   %>%
  extract(2)               %>%
  substr(1, nchar(.) -2)   %>% 
  paste0("</tbody>")

new_page <- gsub("</tbody>", table_contents, resp)

read_html(new_page)   %>%
  html_nodes("table") %>%
  html_table()

I have a solution with the package RSelenium. If you only need the data ...

library(RSelenium)
library(rvest)

rD <- rsDriver(browser = 'firefox') 
remDr <- rD[["client"]]

pest.name <- "saperda+tridentata"
url <- paste("https://gd.eppo.int/search?k=",pest.name, sep="")
remDr$navigate(url)

remDr$switchToFrame(NULL)
doc = xml2::read_html(remDr$getPageSource()[[1]])

df= rvest::html_table(doc)[[1]]

gives

image

Maybe this helps

1 Like

Hi Han!
Thanks for your reply!!

I already explored the option RSelenium... but I had so many problems that I quit.

  1. In my work laptop there might be problems related to corporate limitations (I am not administrator of my machine). So I cannot even establish a connection. Error message: "Undefined error in httr call. httr output: Failed to connect to localhost port 444: Connection refused". The same with port 4567.

  2. Using my personal pc I copied and pasted your script. I tried using both Chrome and Firefox, but I always get an empty table.... this is really weird.... I have no idea why this is happening. The result of this is an empty table again (0 obs. of 5 variables). I am wondering if this is happening for an html issue. Not sure I am reading correctly the html nodes. But I am not at all expert on html so I cannot even explain what I am trying to say :smiley:

any suggestion?

Thank again
kind regards

Hello Maio,

  1. no suggestions for 1. If you get it to work on personal device, you could ask administrator for advice.

  2. the second time I run the script it get errors with the message that the port is already in use (even when I close with

remDr$close()
# stop the selenium server
rD[["server"]]$stop() 
rm(rD)
gc(verbose=F)

I have to close RStudio before proceeding or use another port number. But you don't get errors, only a empty table (?) Could you include the relevant portion of the code that you use? Or is it exactly the same as the code that I used?

Good luck, Han

I am having the same port issue: I am not able to close it.

When the port is working, I use exactly the same code that you use... and this is why it's weird that you can get the data and I can't...

I am afraid that there is some issue (in my machine) with the correct interpretation of the html tag tbody which determines where the data in a html table are (as far as I understood...)

Can you please try to use the following code and tell me if you can get the table?

library(XML)
library(RCurl)
library(rlist)
library(rvest)

pest.name <- "saperda+tridentata"
pest.html <- read_html(paste("https://gd.eppo.int/search?k=",pest.name, sep="")) %>%
html_nodes("table") %>%
html_table(fill = FALSE)
pest.html[[1]]

Why would this work? You will not pick up the effects of the javascript.
Anyway when I run your code I get

> pest.html[[1]]
[1] EPPOCode  Name      Type      Language  Preferred
<0 rows> (or 0-length row.names)
>

I work with a new version of RStudio (Version 1.3.904) in Windows 10 :

> sessionInfo() 
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Regards Han

Hi Han,
checking better the code in my personal pc I realized I forgot a character in the web site string.
Your code is working! Thank you!!

Now I have to understand how to run it on my work machine.

Kind regards

Andrea

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.