Cleaning scraping code

jose_perez · March 14, 2023, 10:02pm

I have this code:

library(rvest)

url <- "http://www.example.com/ranking/unit?pagina=1"
url_html <- read_html(url)
whole_table <- url_html %>% 
        html_nodes('table') %>%
        html_table(fill = TRUE) %>%
        .[[1]]

url2 <- "http://www.example.com/ranking/unit?pagina=2"
url_html2 <- read_html(url2)
whole_table2 <- url_html2 %>% 
        html_nodes('table') %>%
        html_table(fill = TRUE) %>%
        .[[1]]

I run this code up to a dozen times, changing the URL consecutively, and finally join the different whole_table into a single dataframe. I wonder if there's a more elegant solution without having to repeat these six lines of code twelve times with different numbering.

technocrat · March 14, 2023, 10:36pm

Try this on a site that actually has pages in this form

library(rvest)
paginas <- 1:2
baseurl <- "http://www.example.com/ranking/unit?pagina="
mk_url  <- function(x) paste0(baseurl,x)
get_url <- function(x) read_html(mk_url(x))
sapply(paginas,get_url)
#> Error in open.connection(x, "rb"): HTTP error 404.

^{Created on 2023-03-14 with reprex v2.0.2}

jose_perez · March 15, 2023, 8:46am

Thanks technocrat, but I only get this:

sapply(paginas,get_url)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
node ?    ?    ?    ?    ?    ?    ?    ?    ?    ?     ?     ?     ?     ?    
doc  ?    ?    ?    ?    ?    ?    ?    ?    ?    ?     ?     ?     ?     ?

I added more pages 1:14, that is why there are fourteen numbers

system · April 26, 2023, 8:47am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.