Using rvest to scrab websites Error

olivetti03 · May 9, 2022, 11:11am

Hi Rstudio comunity

I am trying to scrab the emails from all the diputies of the european parlaments with their names and parlament URL.

For this I create two functions to aplicate for each of the URLs of the diputies.


## función eurodiputados

eurodiputados_funcion <- function(page_url){
  
  page_html <- read_html(page_url)
  
  topic_names <- page_html %>% 
    html_nodes(css = ".t-y-block") %>% 
    html_text() %>% 
    str_squish()
  
  topic_urls <- page_html %>% 
    html_nodes(css=".t-y-block") %>% 
    html_attr(name = "href")
    
  tibble(topic=topic_names, topic_url=topic_urls)
}

## Funcion emails


scrape_mail <- function(topic_url) {
    
    topic_html <- read_html(topic_url)

      topic_html %>% 
      html_nodes(css="link_email mr-2") %>% 
      html_text() %>%
      str_squish()
      close(scrape_mail)
}

page_ulrs <- c("https://www.europarl.europa.eu/meps/es/full-list/all",paste0("https://www.europarl.europa.eu/meps/es", 0:200000))

master <- map_dfr(page_ulrs, eurodiputados_funcion) %>% 
  mutate(content = map_chr(topic_url, scrape_mail))

The problem is that I get this error that I can't fix.

no loop for break/next, jumping to top level

I am stuck

thanks

nirgrahamuk · May 9, 2022, 12:01pm

What is this close function ? and will it return a value that you would want to be the return value of your scrape_mail function? what even is the scrape_mail parameter being passed into the close ... is it a self reference to the function being defined ? It doesn't seem right at all.

olivetti03 · May 9, 2022, 12:13pm

Hi @nirgrahamuk,

Thanks at first for the help.

Whith this function I was trying to close the function of scrape_mail every time the function reads one url. Without this can't get any result and the code stuck loading.


## función eurodiputados


eurodiputados_funcion <- function(page_url){
  
  page_html <- read_html(page_url)
  
  topic_names <- page_html %>% 
    html_nodes(css = ".t-y-block") %>% 
    html_text() %>% 
    str_squish()
  
  topic_urls <- page_html %>% 
    html_nodes(css=".t-y-block") %>% 
    html_attr(name = "href")
    
  tibble(topic=topic_names, topic_url=topic_urls)
}

## Funcion emails


scrape_mail <- function(topic_url) {
    
    topic_html <- read_html(topic_url)

      topic_html %>% 
      html_nodes(css="link_email mr-2") %>% 
      html_text() %>%
      str_squish()
}

page_ulrs <- c("https://www.europarl.europa.eu/meps/es/full-list/all",paste0("https://www.europarl.europa.eu/meps/es", 0:200000))

master <- map_dfr(page_ulrs, eurodiputados_funcion) %>% 
  mutate(content = map_chr(topic_url, scrape_mail))

The outcome is the same


no loop for break/next, jumping to top level
14.
open.connection(x, "rb")
13.
open(x, "rb")
12.
read_xml.connection(con, encoding = encoding, ..., as_html = as_html, 
base_url = x, options = options)
11.
read_xml.character(x, encoding = encoding, ..., as_html = TRUE, 
options = options)
10.
read_xml(x, encoding = encoding, ..., as_html = TRUE, options = options)
9.
withCallingHandlers(expr, warning = function(w) if (inherits(w, 
classes)) tryInvokeRestart("muffleWarning"))
8.
suppressWarnings(read_xml(x, encoding = encoding, ..., as_html = TRUE, 
options = options))
7.
read_html.default(page_url)
6.
read_html(page_url)
5.
.f(.x[[i]], ...)
4.
map(.x, .f, ...)
3.
map_dfr(page_ulrs, eurodiputados_funcion)
2.
mutate(., content = map_chr(topic_url, scrape_mail))
1.
map_dfr(page_ulrs, eurodiputados_funcion) %>% mutate(content = map_chr(topic_url, 
scrape_mail))

nirgrahamuk · May 9, 2022, 3:08pm

My advice would be to work on getting a solution that works on lets say the first url, before running 200,000 times.

Therefore, I note that when I run it from 0:0 (rather than 0:200000), I get a clear mutate error, /news/es does not exist. This is what is being passed as parameter to scrape_mail function. There wouldnt appear to be an html page to read_html on.
Maybe eurodiputados_funcion should store not only topic and topic_url but some base url that the topic url extends ?

olivetti03 · May 9, 2022, 3:57pm

Running this code I get the mail of the first parlamentarist., but the mail is upsaid down.


url_eurodiputados_2 <- read_html("https://www.europarl.europa.eu/meps/es/197490/MAGDALENA_ADAMOWICZ/home")
  
primera <- url_eurodiputados_2 %>%
  html_nodes(".link_email.mr-2") %>% 
  html_attr('href') %>% 
  as.data.frame()

Is true that when I scrab the eurodiputados_function, the first 7 rows are information without html page, but I can't find the way to avoid this rows inside the function.

Running this code I get the url of each parlamentarist

pagina_web_redes <- read_html(x ="https://www.europarl.europa.eu/meps/es/full-list/all")

urls_22 <- pagina_web_redes %>%
  html_nodes(".t-y-block")%>%
  html_attr("href")%>%
  as.data.frame()

Texto_Europarlamentarios_URL <- urls_22[-1:-7,]%>%
  as.data.frame()

I couuld be, but I tried with several urls and I get the same error.

Thanks for all @nirgrahamuk

nirgrahamuk · May 9, 2022, 3:59pm

If you have a function that sometimes works and sometimes doesnt, and you dont want the failures to affect the success but to keep going, you should look into the purrr::safely() or purrr::possibly() wrapper functions.

olivetti03 · May 9, 2022, 4:48pm

I tried before but I don't know how to do it.

## Funcion emails


scrape_mail <- function(topic_url) {
    
      topic_html <- read_html(topic_url)
      purrr::safely(topic_url, otherwise = NULL, )

      topic_html %>% 
      html_nodes(css="link_email mr-2") %>% 
      html_attr(name = "href", default = NA_character_) 

}

page_ulrs <- c("https://www.europarl.europa.eu/", paste0("https://www.europarl.europa.eu/meps/es/", 197490:197491))

master <- map(page_ulrs, scrape_mail)

  purrr::safely(topic_url,, otherwise = NULL, )

something like this?

olivetti03 · May 9, 2022, 10:27pm

I try what you say but it doesn't work.

I try to do it in another way.



info_de_eurodiputados <- function(infou){
  
  result <- tryCatch({
  
  infoo <- read_html(infou) 
  
  email <- infoo %>% 
    html_nodes(".link_email .mr-2") %>% 
    html_attr("href") %>% 
    paste(., collapse = "")
  
  twitter <- infoo %>% 
    html_nodes(".link_twitt .mr-2") %>% 
    html_attr("href") %>%  
    paste(., collapse = "")
  
  youtube <- infoo %>% 
    html_nodes(".link_youtube .mr-2") %>% 
    html_attr("href") %>%     
    paste(., collapse = "")
  
  Instagram <- infoo %>% 
    html_nodes(".link_instagram .mr-2") %>% 
    html_attr("href") %>%  
    paste(., collapse = "")
  
  facebook <- infoo %>% 
    html_nodes(".link_fb .mr-2") %>% 
    html_attr("href") %>%  
    paste(., collapse = "")
  
  paginaweb <- infoo %>%
    html_nodes(".link_website") %>% 
    html_attr("href") %>%  
    paste(., collapse = "")

tibble(Correos = email, Perfiles_Twitter = twitter, Perfiles_Youtube = youtube, Perfiles_Instagram = Instagram, Perfiles_Facebookk = facebook, Pagina_Web_Personal = paginaweb)

  }, error = function(e) data.frame(Correos = NA, Perfiles_Twitter = NA, Perfiles_Youtube = NA, Perfiles_Instagram = NA, Perfiles_Facebook = NA, Pagina_Web_Personal = NA))
  
  return(result)

}

result <- purrr::map_df(url_europarlamentarios, info_de_eurodiputados)

same problem, it does't work for me.

olivetti03 · May 10, 2022, 8:20am

Here me function

 scrape_mail <- function(topic_url) {

  topic_html <- read_html(topic_url)
  purrr::safely(topic_url, otherwise = NULL, )

  topic_html %>% 
  html_nodes(css="link_email mr-2") %>% 
  html_attr(name = "href", default = NA_character_) }

Here the pages and the master

```{r}

page_ulrs <- c("https://www.europarl.europa.eu/", paste0("https://www.europarl.europa.eu/meps/es/", 197490:197491))

master <- map(page_ulrs, scrape_mail)

´´´

nirgrahamuk · May 10, 2022, 8:55am

Your arent using safely as intended, its used to make a new function from your old function. I dont think I can improve on the examples in the documentation, so I'm loathe to try , but I will show you how to perhaps integrate it with the last code snippet you shared. The only thing is when I run it (ommitted the safely line) for the two urls you suggest, in both cases the result is no error, (so the safely method wont be put to the test vis recovering from an error) and yet the result is character(0) i.e. no textual result is actually returned despite no runtime errors.
Because there is no error in the example to see the value of safely, I have added on, I include 'not even a url' as an entry in page_ulrs, this would normally cause an error 'does not exist', but with safely the rest will run

library(rvest)
library(purrr)
scrape_mail <- function(topic_url) {
  
  topic_html <- read_html(topic_url)

  
  topic_html %>% 
    html_nodes(css="link_email mr-2") %>% 
    html_attr(name = "href", default = NA_character_) }


page_ulrs <- c("not even a url","https://www.europarl.europa.eu/", paste0("https://www.europarl.europa.eu/meps/es/", 197490:197491))

#unsafe
master <- map(page_ulrs, scrape_mail)

# made safe
master2 <- map(page_ulrs,purrr::safely(scrape_mail))

system · May 31, 2022, 8:55am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.