Scraping 800 pages: Error in open .connection, HTTP error 403

I have a list of 800 urls. I want to scrape the element .breadcrumb from this pages. when i do a test with 50 pages everything goes well. When I do the full list of urls (800) I get the error: "Error in open.connection(x, "rb") : HTTP error 403."

This is my code:

# Read csv #
websites <- read.csv("websites.csv", sep = ";")
View(websites)

# Make list #
list <- as.list(websites$URL)

# Scrape all pages #
breadcrumbs <- list %>% 
  map(read_html) %>% 
  map(html_node, ".breadcrumb") %>% 
  map_chr(html_text)

Error in open.connection(x, "rb") : HTTP error 403.

How can i fix this?

UPDATE:
There are some drupal pages with no access that causes the 403 error How can i set this up in R to ignore this?

You need a way to intercept the error and ignore it. The function tryCatch() lets you do that, for example with something like tryCatch(read_html, error = function(e) return(NA)).

You can find more complete explanations here.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.