Web Scraping PHP tables

I am back. Sorry, I was busy with other tasks from my job. I was not able to have positive results. I will attach some images so you get feedback of what I did. I don´t have the tic toc pagkage installed, so maybe that is part of the problem. The log and t files were downloaded. The only thing is that the t file repeated the first page over and over. If you could check and tell me about it. I will also attach the t and log files.

Thanks for your persistence, this seems incredibly frustrating! :face_with_raised_eyebrow: I have a version in my mind that tests if a page has loaded before moving on, and I will post it here as soon as I have time to write it up and test it.

I really appreciate it... But let me know if I have to get the tic toc package because I would have to update my R version.

That's just a timer for me, so you shouldn't need it. Meant to cut that line before posting!

Ok. That´s better for me.

Good morning... any progress in this topic? Let me know please. I really appreciate it...

No, I'm sorry, I haven't been able to look at it unfortunately. Work has been pretty busy lately. But it is on my bookmarks, and I'm hoping to get to it soon when I have some time!

1 Like

Hey getting back here. Please let me know it you could help me. I really appreciatte it and need it. Thanks in advance.

Yes, thanks for checking in! I'm hoping to work on this a bit this afternoon, hopefully I'll have an update soon!

Okay, I think I have something safer to run! I don't know if this will solve whatever problems were occurring before, but it should at least be certain to not repeat a page. Hope it helps!

This time, I left the call to view() in there, so that you can look at what's happening as it goes. One time I ran it, and while all the code was all running fine, one time the page just wasn't loading, so you should be able to see that if it happens.

library(rvest)
library(chromote)
library(tidyverse)

# Scrapes the table in the website's current status
scrape_table <- function(chromote_obj) {
  chromote_obj$Runtime$evaluate('document.querySelector("#sc-ui-grid-body-c4716e4a").outerHTML')$result$value %>% 
    read_html() %>% 
    html_nodes("#sc-ui-grid-body-c4716e4a") %>% 
    html_table()
}

# Clicks through to the next page of the table. The `repeat` block waits to make
# sure the new page has actually loaded
click_next <- function(chromote_obj) {
  cur_nav <- nav_text(chromote_obj)
  cur_null <- null_loc(chromote_obj)
  
  js_click <- '$("#sc_grid_toobar_bot > table > tbody > tr > td:nth-child(2) > a:nth-child(8)")[0].click()'
  consult$Runtime$evaluate(js_click)
  
  repeat {
    if(cur_nav != nav_text(chromote_obj) || cur_null != null_loc(chromote_obj)) {
      break
    }
  }
}

# Checks whether the "next" button is enabled or disabled
next_enabled <- function(chromote_obj) {
  img_html <- chromote_obj$
    Runtime$
    evaluate('$("#sc_grid_toobar_bot > table > tbody > tr > td:nth-child(2) > a:nth-child(8)")[0].innerHTML')$
    result$
    value
  str_detect(img_html, "enabled")
}


# Get the page's bottom text
nav_text <- function(chromote_obj) {
  chromote_obj$
    Runtime$
    evaluate('$("#sc_grid_toobar_bot > table > tbody > tr > td:nth-child(2)")[0].innerText')$
    result$
    value
}

# Get the page's bottom HTML
nav_html <- function(chromote_obj, child) {
  consult$
    Runtime$
    evaluate(str_glue('$("#sc_grid_toobar_bot > table > tbody > tr > td:nth-child(2) > a:nth-child({child})")[0].innerHTML'))$
    result$
    value
}

# Get the index of the null entry (current page marker)
null_loc <- function(chromote_obj) {
  map(1:9, ~nav_html(chromote_obj, .)) %>% 
    set_names(1:9) %>% 
    keep(is.null) %>% 
    names()
}

# Checks whether a new page has actually loaded


# Initialize a `chromote` session
consult <- ChromoteSession$new()

# Show the session (Not necessary for running, but it shows what's happening)
consult$view()

# Navigate to the page
message(consult$Page$navigate("http://www.css.gob.pa/p/grid_defensoria/"))
Sys.sleep(30)

# Set the number of records to 50
message(consult$Runtime$evaluate('document.querySelector("#quant_linhas_f0_bot").value = 50'))
Sys.sleep(10)
message(consult$Runtime$evaluate('document.querySelector("#quant_linhas_f0_bot").dispatchEvent(new Event("change"))'))
Sys.sleep(10)

# Initialize a tibble to store results
t <- tibble()
i <- 0
cat("", file = "log.csv", append = FALSE)

t <- bind_rows(t, scrape_table(consult))
click_next(consult)

t <- bind_rows(t, scrape_table(consult))
click_next(consult)

# While the next button is clickable, scrape the table and click the "next"
# button. Wait 3 seconds between requests to be polite
while (next_enabled(consult)) {
  t <- bind_rows(t, scrape_table(consult))
  click_next(consult)
  i <- i + 1
  cat(i, ",",
      consult$
        Runtime$
        evaluate('$("#sc_grid_toobar_bot > table > tbody > tr > td:nth-child(2)")[0].innerText')$
        result$
        value,
      ",",
      consult$
        Runtime$
        evaluate('$("#sc_grid_toobar_bot > table > tbody > tr > td:nth-child(2) > a:nth-child(8)")[0].innerHTML')$
        result$
        value,
      "\n",
      sep = "", file = "log.csv", append = TRUE)
  write_csv(t, "t.csv")
  Sys.sleep(4)
}

# Scrape the table on the last page
t <- bind_rows(t, scrape_table(consult))
1 Like

Ok thanks very much. I will run it tonight to see how it works. I for sure will let you know how I did. Thanks again!

I ran it and it seems that worked. Any how I will check it tonight. Thanks very much.

1 Like

I did run the code and it almost completed the process. Somehow it stopped running at de 60k+ row out of 70k. I checked the DevTools page and it seems there were errors on it (check at right bottom). I made a capture to show it here if it offers any help. I also attach a t.csv file screen capture. I don´t know at the end if there could be a more straight forward code to have this done easier. I really appreciate your help on this matter. Any further advice will be very well welcome. Thanks again.

Okay, thanks for the update. That's really encouraging that it almost completed! I don't know much about POST requests, but from the little I do know, that would be a glitch in the site itself. So would you mind running it one more time? Perhaps to be safe, even make that last Sys.sleep() pause for 5 or 6 seconds to be extra careful.

For the record, I don't know if increasing the wait time will make any difference, but it just seems safer if the error that happened was on the site's side.

Sure. I will run it tonight and let you know. The other thing I want to ask you is if there is an option to maybe pick from where it stopped. That could help a lot too. I let you know tomorrow. Thanks again for your time and effort.

Regards,

Olmedo

That's such a good idea, thanks for suggesting it! I can work on that if it fails again tonight

Hi again. I ran the code last night and unfortunately it stopped processing about 10,900 rows. However, the R session didn´t. I had to interrupt it. Check the option I mentioned to you yesterday to modify the code for a pick up option I don´t know how to call it. But to continue from where it left. Thanks again.

Hey good morning. Have you have the chance to check the code to see if we finish it. I really appreciate it.

Hey, I haven't had a chance to look over it, but hopefully soon! I'm not sure yet about how to add an option to pick up at a certain point, but I'll see what I can do.

1 Like