I am new to R. I am trying to build a dataset of a newspaper to be able to perform tidytext analysis on it.
Using some online help, I have gathered links to 400 articles for their editorials. Now I want to get text out of the editorial pages. But I am getting the following error: Error in open.connection(x, "rb") : HTTP error 403.
I have tried to scrap links to all editorials published online. These are 400.
library(rvest)
library (purrr)
dawnedlnk <- lapply(paste0('https://www.dawn.com/authors/2677/editorial/', 1:20),
function(url){
read_html(url) %>%
html_nodes(".m-2") %>%
html_nodes(".story__link") %>%
html_attr("href")
})
edlk <- unlist (dawnedlnk)
elpages <- edlk %>% map(read_html)
I want to map all 400 pages. They get titles and text out of each page. I tried this approach on five links that it worked, but I am unable to map 400 pages.
If I can map 400 pages, then I would expect to replicate the following code, which helped me get 20 some editorial posts.
library(rvest)
library(purrr)
library(tidyverse)
url <- "https://www.dawn.com/authors/2677/editorial"
edit <- read_html(url)
el <- edit %>%
html_nodes(".m-2") %>%
html_nodes(".story__link") %>%
html_attr("href") %>%
xml2::url_absolute("http://dawn.com")
elpages <- edlk %>% map(read_html)
eltitle = elpages %>%
map_chr(. %>%
html_node(".story__title") %>%
html_text()
)
elauthor = elpages %>%
map_chr(. %>%
html_node(".story__byline") %>%
html_text()
)
elpubtime = elpages %>%
map_chr(. %>%
html_node(".story__time") %>%
html_text()
)
eltext = elpages %>%
map_chr(. %>%
html_node(".modifier--blockquote-center-narrow p") %>%
html_text()
)
dawned <- tibble (author = elauthor, title = eltitle, text = eltext, pubtime = elpubtime, links = el, paper = "Dawn")