I've been trying to scrape a large list of websites for its Title, Description, and Keywords using rvest with a loop, but R keeps giving me a connection timed out error:
Error in open.connection(x, "rb") : Timeout was reached: Connection timed out after 10000 milliseconds
I found an alternative way to do it with RSelenium, but it takes forever to run down the list, so I'm not sure if there's a workaround the timed out error message that anyone know? I tried options(timeout = 9999999) but it doesn't work. Here is my code with rvest:
library(rvest)
library(dplyr)
webpages <- data.frame(name = c("amazon", "apple", "usps", "yahoo", "bbc", "ted", "surveymonkey", "forbes", "imdb", "hp"),
url = c("http://www.amazon.com",
"http://www.apple.com",
"http://www.usps.com",
"http://www.yahoo.com",
"http://www.bbc.com",
"http://www.ted.com",
"http://www.surveymonkey.com",
"http://www.forbes.com",
"http://www.imdb.com",
"http://www.hp.com"))
webpages <- apply(webpages, 1, function(x){
URL <- read_html(x['url'], encoding = "UTF-8")
results <- URL %>% html_nodes("head")
records <- vector("list", length = length(results))
for (i in seq_along(records)) {
title <- xml_contents(results[i] %>% html_nodes("title"))[1] %>% html_text(trim = TRUE)
desc <- html_nodes(results[i], "meta[name=description]") %>% html_attr("content")
kw <- html_nodes(results[i], "meta[name=keywords]") %>% html_attr("content")
}
return(data.frame(name = x['name'],
url = x['url'],
title = ifelse(length(title) > 0, title, NA),
description = ifelse(length(desc) > 0, desc, NA),
keywords = ifelse(length(kw) > 0, kw, NA)))
})
webpages <- do.call(rbind, webpages)