Scraping xlsx tables

Hi everyone.
I'm new at rvest and web scraping data. I've been trying automate data colect retrieved at http://www6.sistemaindustria.org.br/gpc/externo/listaResultados.faces?codPesquisa=100.
The data is in excel format and we download it at the page.
What I have been done is to save all the table nodes in a object named "tables" (see bellow) but I can't find where the desired informations (excel tables) are retrieved. They must be inside the "table nodes" object but after investigating carefully I could not find them.
Does anyone coul gently help me?

#Code:

URL base

www <- "http://www6.sistemaindustria.org.br/gpc/externo/listaResultados.faces?codPesquisa=100."

pg <- rvest::read_html(www)
tables <- pg%>%html_nodes('table')

1 Like

Hi @IVS , you need check the specific node or xpath for better download the table.

The final table show 178 rows, the same of the page.

library(rvest)
library(tidyverse)
library(stringr)
url_data <- "http://www6.sistemaindustria.org.br/gpc/externo/listaResultados.faces?codPesquisa=100"

## Read URL, and select the node. In this case xpath of table.
url_data2 <- url_data %>%
  read_html() %>% 
  html_nodes(xpath='//*[@id="listaResultadoForm:dtResult:n"]') %>% 
  html_table()

url_data2 <- data.frame(url_data2)

url_data2 <- url_data2[ -1, ] # delete first row

names(url_data2)[1] <- 'Pesquisa' 
names(url_data2)[2] <- 'Variabel' 
names(url_data2)[3] <- 'Segmentacion'
names(url_data2)[4] <- 'Tema'
names(url_data2)[5] <- 'Tipo Indice' 
names(url_data2)[6] <- 'Pond' 
names(url_data2)[7] <- 'Defl' 
names(url_data2)[8] <- 'Dess'
names(url_data2)[9] <- 'Intervalo'

Dear @M_AcostaCH , thank you for the answer. Actually I've been trying to downlad some data bases that are inside of page not that table itself (please, see image attached).
In fact, the table you showed me I was able to download without any problems.
I have no idea how to automate this data download. Any help will be appretiated.

1 Like

Maybe with Rselenium is better. Im going to find some help for this question.

1 Like

Maybe RSelenium is an option:

But you need install a specific java version and chrome drivers.

Im use xpath for make click in a element. In the page the xpath change in numbers Result:0:checkValores for select other items. You could replace this number if you want others.

library(RSelenium)

# open cmd for set the drivers folders.
# run .jar file by cmd
# java -jar selenium-server-standalone-3.2.0.jar -port 4444

# Make conection--------------------------------------------------
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4444, browserName = "chrome")
remDr$open()
Sys.sleep(5)


url_data <- "http://www6.sistemaindustria.org.br/gpc/externo/listaResultados.faces?codPesquisa=100"

remDr$navigate(url = "http://www6.sistemaindustria.org.br/gpc/externo/listaResultados.faces?codPesquisa=100")


# put click in [Valores] -----------------------------------------------------
Sys.sleep(1)
remDr$findElement(using = 'xpath', "//*[@id='listaResultadoForm:dtResult:0:checkValores']")$clickElement()

# put click in [Valores2] -----------------------------------------------------
Sys.sleep(1)
remDr$findElement(using = 'xpath', "//*[@id='listaResultadoForm:dtResult:1:checkValores']")$clickElement()

# download the select files                                   
Sys.sleep(1)
remDr$findElement(using = 'xpath', "//*[@id='listaResultadoForm:btnExportarExcel2']")$clickElement()

1 Like

@M_AcostaCH This is really great. Your code checks the checkboxes before finally pressing the "Export" button. However, the user still needs to interact manually with the browser by pressing the "OK" button on the download popup. Do you know if there is any way to do this programmatically?

1 Like

The user should not interact since after selecting the boxes to download, it tells Rselenium that it can download the selected files. In the last part of the script download the select files this is done automatically.

1 Like

@M_AcostaCH , thank you so much. Your solution was very helpful and clear.
Once again, thanks.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.