Web Scraping Wikipedia Help

Hi,
I'm knew to web scraping and running into some issues for scraping the name Liam on Wikipedia. I'm scraping for Irish, Ireland, and Catholic on Liam Wikipedia pages. I think the code works until Liam_urls <- paste0("https://en.wikipedia.org",Liam_urls) but could be wrong. I get the error message Error in function (type, msg, asError = TRUE) : error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version or Error in function (type, msg, asError = TRUE) : <url> malformed
How should I adjust my code?
Thanks for your help.

library(RCurl)
library(rvest)
library(stringr)
html_attr(html_nodes(read_html("https://en.wikipedia.org/wiki/Liam"), "a[title^=Liam]"),"href")
Liam_urls <- html_attr(html_nodes(read_html("https://en.wikipedia.org/wiki/Liam"), "a[title^=Liam]"),"href")
Liam_urls <- Liam_urls[which(!str_detect(Liam_urls, "https"))]
Liam_urls
Liam_urls <- paste0("https://en.wikipedia.org",Liam_urls)
scraped_Liam <- sapply(Liam_urls, function(x) getURL(x))
results_Liam <- sapply(scraped_Liam, function(x) str_detect(x,"Irish|Ireland|Catholic"))
results_Liam.df <- data.frame("Hit"=results_Liam, stringsAsFactors = FALSE)
length(results_Liam.df$Hit[which(results_Liam.df$Hit==TRUE)])/length(results_Liam.df$Hit)

I tried your code and experienced the same SSL related error. I believe that RCurl is insufficiently sophisticated to getURL on the wikipedia domain. Happy to be corrected by anyone on that point.
I changed to httr library and made subsequent changes as necessary

# replace rcurl with httr
library(httr)
library(rvest)
library(stringr)
# html_attr(html_nodes(read_html("https://en.wikipedia.org/wiki/Liam"), "a[title^=Liam]"),"href")
Liam_urls <- html_attr(html_nodes(read_html("https://en.wikipedia.org/wiki/Liam"), "a[title^=Liam]"),"href")
Liam_urls <- Liam_urls[which(!str_detect(Liam_urls, "https"))]
Liam_urls
Liam_urls <- paste0("https://en.wikipedia.org",Liam_urls)
#GET return response objects so have them in a list - use lapply
scraped_Liam <- lapply(Liam_urls, function(x) GET(x))
# response object need their content extracting as text
results_Liam <- sapply(scraped_Liam, function(x) str_detect(content(x,as="text"),"Irish|Ireland|Catholic"))
results_Liam.df <- data.frame("Hit"=results_Liam, stringsAsFactors = FALSE)
length(results_Liam.df$Hit[which(results_Liam.df$Hit==TRUE)])/length(results_Liam.df$Hit)
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.