Web Scraping PHP tables

Hello everyone. I am new in R and am trying to web scrap data table from the following site:

http://www.css.gob.pa/p/grid_defensoria/

I want to download it or access it, clean it and save it as csv, xml, etc... I have tried the following options and didn´t do what I need.

#1

content <- read_html("http://www.css.gob.pa/p/grid_defensoria/")
table <- content %>% html_table(fill= TRUE)

#2
fileurl_CSS <- "http://www.css.gob.pa/p/grid_defensoria/"
planilla_CSS <- readHTMLTable(fileurl_CSS, header = T, which = 2, stringAsFactors=T)

Thanks for your help.

1 Like

I've been loving {chromote} for web scraping projects lately, so here's what I would do. Let me know if this works for you, or if you'd like any more explanation about the different pieces.

library(rvest)
library(chromote)
library(tidyverse)
library(tictoc)

# Scrapes the table in the website's current status
scrape_table <- function(chromote_obj) {
  chromote_obj$Runtime$evaluate('document.querySelector("#sc-ui-grid-body-c4716e4a").outerHTML')$result$value %>% 
    read_html() %>% 
    html_nodes("#sc-ui-grid-body-c4716e4a") %>% 
    html_table()
}

# Clicks through to the next page of the table
click_next <- function(chromote_obj) {
  js_click <- '$("#sc_grid_toobar_bot > table > tbody > tr > td:nth-child(2) > a:nth-child(8)")[0].click()'
  consult$Runtime$evaluate(js_click)
}

# Checks whether the "next" button is enabled or disabled
next_enabled <- function(chromote_obj) {
  img_html <- chromote_obj$
    Runtime$
    evaluate('$("#sc_grid_toobar_bot > table > tbody > tr > td:nth-child(2) > a:nth-child(8)")[0].innerHTML')$
    result$
    value
  str_detect(img_html, "enabled")
}

# Initialize a `chromote` session
consult <- ChromoteSession$new()

# Show the session (Not necessary for running, but it shows what's happening)
consult$view()

# Navigate to the page
consult$Page$navigate("http://www.css.gob.pa/p/grid_defensoria/")

# Initialize a tibble to store results
t <- tibble()
i <- 0

# While the next button is clickable, scrape the table and click the "next"
# button. Wait 4 seconds between requests to be polite
while (next_enabled(consult)) {
  t <- bind_rows(t, scrape_table(consult))
  click_next(consult)
  Sys.sleep(4)
  
  ################ I didn't actually want to run it for all 3500 pages of the
  ################ table, so these lines break afer 5 iterations. When you want
  ################ to scrape everything, remove these 2 lines.
  i <- i + 1
  if (i > 4) break
  ################
}

# Scrape the table on the last page
t <- bind_rows(t, scrape_table(consult))

Created on 2021-05-18 by the reprex package (v2.0.0)

2 Likes

Caveat that I am very new to web scraping, and this is not production-grade code. The 4 second pause was more than enough time to wait when I was on the site earlier, but if your connection is slower, or the site is moving more slowly, you could end up scraping the same page multiple times.

Ok. Thanks for your response. I will try it and let you know how I did. Thanks again.

Ok. I understand. Let me try it and let you know.

An error came up. Remember that I am very much new in R and I am still learning...

Initialize a chromote session

consult <- ChromoteSession$new()
Error: object 'ChromoteSession' not found

Please let me know if something is missing...

I think I didn´t have the chromote package installed. Let me try this and let you know.

1 Like

My fear was correct. I didn´t have chromate package installed and now I am having trouble installing it. Another error came up. Check...

Error: Failed to install 'chromote' from GitHub:
(converted from warning) cannot remove prior installation of package ‘rlang’

If you could guide me solve this, I appreciate it...

1 Like

Oh, this is a common pain point, unfortunately. I don't actually remember how I solve it, but try some of these:

  • Reset your R session (Ctrl + Shift + F10 in RStudio IDE) and try to reinstall {chromote}
  • If that doesn't work, reinstall {rlang} with install.packages("rlang") and try to reinstall {chromote}
  • If that still doesn't work, install the dev version of {rlang} with remotes::install_github("r-lib/rlang") and try to reinstall {chromote}

If none of these works, I know of a couple people I could contact who might be able to help, but one of these should do it. I'm sorry you're having trouble with this, but know that you are very much not alone, and it is fixable!

Good morning... Thanks for the help... I restarted R, reinstalled both packages and worked fine. The only thing is that I didn´t have results with the last part of the code. I left R running all night long and the last piece of the code never ended. I had to interrupted and only got a 21 x 15 tibble instead of the long one (+37k x 15) that was supposed to have at the end.

I appreciate your help againg. Have a nice one!

How frustrating! I'm sorry it never finished, I'm not sure why that would be. Did you see what the state of the webpage was when you interrupted?

It stopped. I think at the 397 page (it was yellow highlighted) . There was a message about Dev Tools was interrupted. I lost a screen print I made so I can´t show it. Is there a way we could pick 50 rows to make the selection in less pages? Would it make the process shorter? The 4 seconds wait, could it be less? I really thank you for the attention to this matter. Check the following CSS site screen capture.

If you add this, it will do 50 records per page:

# Navigate to the page
consult$Page$navigate("http://www.css.gob.pa/p/grid_defensoria/")

################################
# New section
# Set the number of records to 50
consult$Runtime$evaluate('document.querySelector("#quant_linhas_f0_bot").value = 50')
consult$Runtime$evaluate('document.querySelector("#quant_linhas_f0_bot").dispatchEvent(new Event("change"))')
################################

# Initialize a tibble to store results
t <- tibble()

That will cut down the time quite a bit, so hopefully it helps. I would have expected the prior version to finish after running all night, but :man_shrugging:

You could probably cut the wait time down to 3 seconds. Maaaaaaaybe 2, but at that point I think you really risk duplicating data you've pulled and risk putting too much traffic on their server, which is rude and could get you blocked. If I knew more javascript I could write something that waits until the site responds, which is the safest, but alas, I don't.

Put this in for the loop section, and it will write out a log as it goes. Also adjusts wait time to 3:

# Initialize a tibble to store results
t <- tibble()
i <- 0
cat("", file = "log.csv", append = FALSE)

# While the next button is clickable, scrape the table and click the "next"
# button. Wait 3 seconds between requests to be polite
while (next_enabled(consult)) {
  t <- bind_rows(t, scrape_table(consult))
  click_next(consult)
  Sys.sleep(3)
  i <- i + 1
  cat(i, ",", consult$Runtime$evaluate('$("#sc_grid_toobar_bot > table > tbody > tr > td:nth-child(2)")[0].innerText')$result$value, "\n",
      sep = "", file = "log.csv", append = TRUE)
}

Ok. I´ll try it tonight. I let you know how I did. Thanks again.

1 Like

Oh, a couple other things: For running it overnight, you could cut these lines. They may slow things down, and the error about DevTools makes me wonder if it's related to viewing it.

# Show the session (Not necessary for running, but it shows what's happening)
consult$view()

And you're probably already doing this, but it might cause issues if your computer goes to sleep mid-run. Not really sure about that one.

Good morning. I tried your modifications to the code and unfortunately after having it running all night, it didn´t have results. I don´t know if I did something wrong. I would appreciate you check againg please.

Well, that sounds frustrating. I'm sorry this has been such a thorny problem. :face_with_raised_eyebrow: Would you be able to upload the file "log.csv" that should have been created in your working directory when you ran it?

Also, did t have any records in it, or was it still an empty tibble?

I am trying to run the full script right now and will update with any changes.

I attach two images of the "log.csv" file, head and tail, so you see them. The t did not have any record at all.


1 Like

Thank you, those are helpful! So the first thing I note is the number of duplicate rows in column B. That means that something is hanging up while trying to click through to the next page. Are you able to attach to a wired internet connection, like ethernet? If you are on Wi-Fi, that could make it run more slowly.

I've also edited the code to give some more detailed output. It also does wait longer, but this is necessary in order to make sure you're not scraping the same page multiple times. I will work on a version that is more stable and simply waits until the page has actually changed before moving on. When I ran the prior version, it completed in ~35 minutes (I have pretty fast internet). Run this for 15-20 minutes, then kill the R session and post log.csv and t.csv here. Hopefully that will help us iterate a little faster and get this problem solved for you!

library(rvest)
library(chromote)
library(tidyverse)
library(tictoc)

# Scrapes the table in the website's current status
scrape_table <- function(chromote_obj) {
    chromote_obj$Runtime$evaluate('document.querySelector("#sc-ui-grid-body-c4716e4a").outerHTML')$result$value %>% 
        read_html() %>% 
        html_nodes("#sc-ui-grid-body-c4716e4a") %>% 
        html_table()
}

# Clicks through to the next page of the table
click_next <- function(chromote_obj) {
    js_click <- '$("#sc_grid_toobar_bot > table > tbody > tr > td:nth-child(2) > a:nth-child(8)")[0].click()'
    consult$Runtime$evaluate(js_click)
}

# Checks whether the "next" button is enabled or disabled
next_enabled <- function(chromote_obj) {
    img_html <- chromote_obj$
        Runtime$
        evaluate('$("#sc_grid_toobar_bot > table > tbody > tr > td:nth-child(2) > a:nth-child(8)")[0].innerHTML')$
        result$
        value
    str_detect(img_html, "enabled")
}

# Initialize a `chromote` session
consult <- ChromoteSession$new()

# Show the session (Not necessary for running, but it shows what's happening)
# consult$view()

# Navigate to the page
message(consult$Page$navigate("http://www.css.gob.pa/p/grid_defensoria/"))
Sys.sleep(3)

# Set the number of records to 50
message(consult$Runtime$evaluate('document.querySelector("#quant_linhas_f0_bot").value = 50'))
Sys.sleep(3)
message(consult$Runtime$evaluate('document.querySelector("#quant_linhas_f0_bot").dispatchEvent(new Event("change"))'))
Sys.sleep(3)

# Initialize a tibble to store results
t <- tibble()
i <- 0
cat("", file = "log.csv", append = FALSE)

# While the next button is clickable, scrape the table and click the "next"
# button. Wait 3 seconds between requests to be polite
while (next_enabled(consult)) {
  t <- bind_rows(t, scrape_table(consult))
  Sys.sleep(4)
  click_next(consult)
  Sys.sleep(10)
  i <- i + 1
  cat(i, ",",
      consult$
        Runtime$
        evaluate('$("#sc_grid_toobar_bot > table > tbody > tr > td:nth-child(2)")[0].innerText')$
        result$
        value,
      ",",
      consult$
        Runtime$
        evaluate('$("#sc_grid_toobar_bot > table > tbody > tr > td:nth-child(2) > a:nth-child(8)")[0].innerHTML')$
        result$
        value,
      "\n",
      sep = "", file = "log.csv", append = TRUE)
  write_csv(t, "t.csv")
}

# Scrape the table on the last page
t <- bind_rows(t, scrape_table(consult))

Ok. I will try it tonight and I am going to plug my laptop directly to my switch. I am also increasing my internet bandwidth to 250 Mbps. It should work better. I let you know tomorrow how I did. Thanks again for your time and dedication to this.

1 Like