Web-scraping tables using rvest

Edited:

How do I use rvest to extract the table in this page? "https://fundf10.eastmoney.com/F10DataApi.aspx?type=lsjz&code=510300&page=1&sdate=2019-01-01&edate=2021-02-13&per=40"

My code (tried all 3 ways):

#1
fund_table <- read_html(fund_link) %>% html_node(".lsjz") %>% html_table() 
#2
fund_link %>% html() %>% html_nodes(xpath='/html/body/table') %>% html_table()
#3
fund_link %>% read_html() %>% html_element(css = css_selector) %>% html_table()

As you can see above, I tried table.w782.comm.lsjz and table#jztable as my nodes but no dice so far, it just returns an empty list

The relevant html code seems to be:

<table class="w782 comm lsjz selectorgadget_selected">
   ...
</table>

Would really appreciate some help on this plz!

In my archive :grinning: I found a piece of code that (adapted) could read (part of) the table.
Maybe it works for you. I work with Windows 10.

library(RSelenium)
library(rvest)

fund_link <- "https://fundf10.eastmoney.com/jjjz_510300.html"

rD <- rsDriver(browser = 'firefox',port=4567L,verbose=F) 
remDr <- rD[["client"]]

remDr$navigate(fund_link)

remDr$switchToFrame(NULL)
fund_page = xml2::read_html(remDr$getPageSource()[[1]])

fund_page %>% html_elements("#jztable") %>% html_table() %>% print(width=5000)


remDr$close()
# stop the selenium server
rD[["server"]]$stop() 
#> [1] TRUE
rm(rD)

hmm doesn't work for me, says I don't have JAVA installed haha. Anyway I really appreciate you taking the time to help, and even though I couldn't get it to work yet I hope someone else can make use of this, so thanks a lot!

Hi @Krim and welcome to RStudio Community :partying_face: :partying_face: :partying_face: :partying_face: :partying_face:

The table of data on the webpage is loaded via JavaScript and this is the reason why using {rvest} is not ideal to scrape it. This is because the table takes a few seconds to load when you visit the url: https://fundf10.eastmoney.com/jjjz_510300.html, so rvest::read_html() is not able to capture it as it only captures everything available on the webpage immediately after the site launches (i.e. the static HTML).

So, we are going to use the {RSelenium} package for this task. It is a package, which allows you to manipulate your browser right from your code using a Selenium server. The only downside is that it takes a few steps to setup. Everything is not ready out of the box when you install the package with install.packages("RSelenium"), but I'll do my best to walk you through all steps with as many details as possible. Also, it is important to mention that I am a Windows user.

Setup

  1. Install the latest version of Java: https://java.com/en/download/. Restart your computer when the installation process is over.
  2. Install Firefox. In my personal experience, manipulating firefox is easier from Selenium: https://www.mozilla.org/en-US/firefox/new/

Connect to Selenium server from R

# Load packages ----
pacman::p_load(RSelenium, purrr, rvest, glue)

# Start a Selenium server
driver <- rsDriver(port = 4444L, browser = "firefox")
remote_driver <- driver$client

Hopefully, everything has worked for you so far.

Mini tutorial

Now, let me walk you through a quick tutorial in which we will scrape the second page of the table on the webpage.

# Open browser ----
remote_driver$open() # This code will actually open the firefox browser

# Navigate to URL ----
url <- "https://fundf10.eastmoney.com/jjjz_510300.html"
remote_driver$navigate(url) # This code will actually open the website in the browser that opened up earlier

# Navigate to page 2 of the table ----

# ** Find page 2 button
page2_btn <- remote_driver$findElement(using = "css", value = glue(".pagebtns > label[value='2']"))

# **  Move pointer to button
remote_driver$mouseMoveToLocation(webElement = page2_btn) 

# ** Click on page 2 button
page2_btn$click() # Notice how the browser goes to page 2 of the table

#  Find table element in HTML page ----
table_el <- remote_driver$findElement(using = "css", value = "#jztable")

#  Scrape table ----
table_page2 <- table_el$getElementAttribute("innerHTML") %>%
  .[[1]] %>%
  read_html() %>%
  html_table() %>%
  .[[1]]

table_page2

# A tibble: 20 x 7
   `<U+51C0><U+503C><U+65E5><U+671F>` `<U+5355><U+4F4D><U+51C0><U+503C>` `<U+7D2F><U+8BA1><U+51C0><U+503C>` `<U+65E5><U+589E><U+957F><U+7387>` `<U+7533><U+8D2D><U+72B6><U+6001>` `<U+8D4E><U+56DE><U+72B6><U+6001>` `<U+5206><U+7EA2><U+9001><U+914D>`
   <chr>           <dbl>      <dbl> <chr>      <chr>      <chr>      <lgl>     
 1 2021-07-07       5.19       2.09 1.17%      <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA        
 2 2021-07-06       5.13       2.07 0.02%      <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA        
 3 2021-07-05       5.13       2.07 0.09%      <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA        
 4 2021-07-02       5.12       2.07 -2.82%     <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA        
 5 2021-07-01       5.27       2.13 0.09%      <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA        
 6 2021-06-30       5.26       2.12 0.69%      <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA        
 7 2021-06-29       5.23       2.11 -1.10%     <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA        
 8 2021-06-28       5.29       2.13 0.22%      <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA        
 9 2021-06-25       5.28       2.13 1.70%      <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA        
10 2021-06-24       5.19       2.10 0.19%      <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA        
11 2021-06-23       5.18       2.09 0.52%      <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA        
12 2021-06-22       5.15       2.08 0.63%      <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA        
13 2021-06-21       5.12       2.07 -0.25%     <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA        
14 2021-06-18       5.13       2.07 0.05%      <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA        
15 2021-06-17       5.13       2.07 0.47%      <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA        
16 2021-06-16       5.10       2.06 -1.63%     <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA        
17 2021-06-15       5.19       2.10 -1.12%     <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA        
18 2021-06-11       5.25       2.12 -0.81%     <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA        
19 2021-06-10       5.29       2.13 0.69%      <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA        
20 2021-06-09       5.25       2.12 0.11%      <U+573A><U+5185><U+4E70><U+5165>   <U+573A><U+5185><U+5356><U+51FA>   NA   

Scrape the full table

The mini tutorial shows all the steps needed to scrape the data from a specific page of the table. Now, we will package all these steps into a function and automate the scraping of all pages

# Find total number of pages ----

div_page_btns <- remote_driver$findElements(using = "css", value = "div.pagebtns")

n_pages <- div_page_btns[[1]]$findChildElements(using = "css", value = "label[value]") %>%
  map_chr(~ unlist(.x$getElementText())) %>%
  as.numeric() %>% 
  max(na.rm = TRUE)

# Create function (it uses all the steps in the mini tutorial) ----

scrape_table_page <- function(page){
  
  message(glue::glue("Scraping data on page {page}."))
  
  page_btn <- remote_driver$findElement(using = "css", value = glue::glue("div.pagebtns > label[value = '{page}']"))
  remote_driver$mouseMoveToLocation(webElement = page_btn)
  page_btn$click()
  
  Sys.sleep(1) # Give browser a second to load the data on the new page
  
  table_el <- remote_driver$findElement(using = "css", value = "#jztable")
  
  table_el$getElementAttribute("innerHTML") %>%
    .[[1]] %>%
    read_html() %>%
    html_table() %>%
    .[[1]]
  
}

Now we can apply the function to all pages

mydata <- map_dfr(seq_len(n_pages), scrape_table_page)

Let me know if you have questions.

2 Likes

Never ask if we have questions !
So here they come:

  • your mini tutorial just shows page 1 . ???
  • in the full table part you introduce client where in the mini tutorial you used remote_driver
  • after changing that the full table part works fine if you see the page numbers on your firefox screen.
    Is it possible to force this with a RSelinium command? Otherwise you have to do this manipulation manually; if not the code runs in an error :
Error: 	 Summary: MoveTargetOutOfBounds
 	 Detail: Target provided for a move action is out of bounds.
 	 class: org.openqa.selenium.interactions.MoveTargetOutOfBoundsException
	 Further Details: run errorDetails method

This may sound a bit criticising, but it is the first example I encounter of really navigating such a series of table pages. So your example is very much appreciated. Where did you find the commands that you used?

1 Like

Here are your answers @HanOostdijk :

  1. The mini tutorial scrapes page 2 instead. I did it on purpose just to see how to manipulate the browser with code (i.e. clicking on the page 2 button and so on). This is exactly what is being done in the function.

  2. You are right, it took me a long time to put everything together and I did not clean up the code entirely before posting (I was tired). Good catch! I just made the necessary changes in my response.

  3. Think about it this way: if you cannot see a page number on your firefox screen, then you cannot click on it (with your mouse). So that's not something you can do with RSelenium. However, you can refresh your webpage, which should go back to the first page of the table (remote_driver$navigate(url)). But... the mini tutorial was just a way to provide a detailed guide on what is happening in the function. You don't have to run the mini tutorial code before trying to scrape the full table. You can just connect to the Selenium server (shown before the mini tutorial), then scrape the table with the custom function I provided.

  4. I found them in several places online (forums, Stackoverflow, ...). This video really helped me a lot as well: RSelenium test - understat.com Ligue 1 - YouTube

1 Like

Thank you! This is great, though I do need some time to digest all of this...

Aha take your time! I tried to make the response as detailed as I could with a mini tutorial @Krim

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.