Scrapping 400 pages using rvest and purr

hassannasir · May 17, 2019, 12:56pm

I am new to R. I am trying to build a dataset of a newspaper to be able to perform tidytext analysis on it.

Using some online help, I have gathered links to 400 articles for their editorials. Now I want to get text out of the editorial pages. But I am getting the following error: Error in open.connection(x, "rb") : HTTP error 403.

I have tried to scrap links to all editorials published online. These are 400.

library(rvest)
library (purrr)

dawnedlnk <- lapply(paste0('https://www.dawn.com/authors/2677/editorial/', 1:20), 
                    function(url){
                      read_html(url) %>% 
                        html_nodes(".m-2") %>%
                        html_nodes(".story__link") %>%
                        html_attr("href")
                    })

edlk <- unlist (dawnedlnk)

elpages <- edlk %>% map(read_html)

I want to map all 400 pages. They get titles and text out of each page. I tried this approach on five links that it worked, but I am unable to map 400 pages.

If I can map 400 pages, then I would expect to replicate the following code, which helped me get 20 some editorial posts.

library(rvest)
library(purrr)
library(tidyverse)

url <- "https://www.dawn.com/authors/2677/editorial"

edit <- read_html(url)
el <- edit %>%
  html_nodes(".m-2") %>%
  html_nodes(".story__link") %>%
  html_attr("href") %>% 
  xml2::url_absolute("http://dawn.com") 

elpages <- edlk %>% map(read_html)

eltitle = elpages %>% 
    map_chr(. %>% 
              html_node(".story__title") %>% 
              html_text()
    )

  elauthor = elpages %>% 
    map_chr(. %>% 
              html_node(".story__byline") %>% 
              html_text()
    )

  elpubtime = elpages %>% 
    map_chr(. %>% 
              html_node(".story__time") %>% 
              html_text()
    )

eltext = elpages %>% 
  map_chr(. %>% 
            html_node(".modifier--blockquote-center-narrow p") %>% 
            html_text()
  )

dawned <- tibble (author = elauthor, title = eltitle, text = eltext, pubtime = elpubtime, links = el, paper = "Dawn")

cderv · May 18, 2019, 9:22pm

I don't know if it is linked to the error or not but I would advice to be polite when scraping web pages. Specifically about crawl delay.

There is some great articles on this

Some useful to know what to do

GitHub - ropensci/robotstxt: robots.txt file parsing and checking for R
GitHub - dmi3kno/polite: Be nice on the web
purrr::safely to get map keeps going even if you encounter some errors

I would advice to add a crawl delay of a few seconds when scraping. Either manually or using polite , a wrapper for httr using robotstxt infos.
Scraping will take more time but I think it the good behavior and the error will I think disappear.

library(rvest)
library(purrr)

dawnedlnk <- purrr::map(
  paste0('https://www.dawn.com/authors/2677/editorial/', 1:20), 
  function(url){
    read_html(url) %>% 
      html_nodes(".m-2") %>%
      html_nodes(".story__link") %>%
      html_attr("href")
  })

edlk <- unlist(dawnedlnk)

elpages <- edlk[1:2] %>% 
  map(~ {
    message(glue::glue("* parsing: {.x}"))
    Sys.sleep(5)
    safely(read_html)(.x)
  })

or using polite, something like that I think

library(polite)
library(rvest)

session <- bow("https://www.dawn.com")
session

dawnedlnk <- purrr::map(
  paste0('authors/2677/editorial/', 1:20), 
  ~ { nod(session, .x) %>%
      scrape() %>%
      html_nodes(".m-2") %>%
      html_nodes(".story__link") %>%
      html_attr("href")
  })

edlk <- unlist(dawnedlnk)

elpages <- edlk %>% 
  purrr::map(~ {
    nod(session, urltools::path(.x)) %>%
      scrape()
  })

Hope it helps

hassannasir · May 22, 2019, 5:18pm

Thanks for the help. I am trying to apply text sentiment analysis on the editorials, so that's why I want to download that many articles. No scrapping for the sake of scrapping.

polite option worked well. With rvest, even when I use Sys.Dely() I keep getting 403 error after a while. Both approaches are slow, but polite is slower of the two. Yet, good to have it.

One more thing, I am trying to grapple with. How can I supply two digit number to past0() function. I have some website's where I need to give three missing links, e.g. the date at the end of this URL: https://www.dawn.com/archive/2019-05-22,

I would want to supply, the year, month and day parts separately.

nviau · May 22, 2019, 5:54pm

You mean left pad with a zero?

stringr::str_pad(1:31, 2, "left", "0")

#>  [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12" "13" "14"
#> [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
#> [29] "29" "30" "31"

You can also print dates like so:

seq(as.Date("2011-12-15"), as.Date("2012-01-15"), by="days")

#>  [1] "2011-12-15" "2011-12-16" "2011-12-17" "2011-12-18" "2011-12-19"
#>  [6] "2011-12-20" "2011-12-21" "2011-12-22" "2011-12-23" "2011-12-24"
#> [11] "2011-12-25" "2011-12-26" "2011-12-27" "2011-12-28" "2011-12-29"
#> [16] "2011-12-30" "2011-12-31" "2012-01-01" "2012-01-02" "2012-01-03"
#> [21] "2012-01-04" "2012-01-05" "2012-01-06" "2012-01-07" "2012-01-08"
#> [26] "2012-01-09" "2012-01-10" "2012-01-11" "2012-01-12" "2012-01-13"
#> [31] "2012-01-14" "2012-01-15"

Created on 2019-05-22 by the reprex package (v0.2.0).

hassannasir · May 23, 2019, 7:00am

Thanks a million, this is really helpful. The more I learn R, the more I love it and the R community which is always there to help.

So I tried to apply the seq(as.date())... to generate dates and they paste them using paste0() function below. But, I am getting the following mistake running the code... (hopefully this will be the last question in the series.)

Error:

Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "NULL"


library (polite)
library (rvest)
library(purrr)

dates <- seq(as.Date("2019-04-01"), as.Date("2019-04-05"), by="days")

dawnsession <- bow("https://www.dawn.com")

dawnsession

dawnlks <- purrr::map(
  
  paste0('archive/', dates), 
  
  ~ { nod(dawnsession, .x) %>%
      
      scrape() %>%
      
      html_nodes(".mb-4") %>%
      
      html_nodes(".story__link") %>%
      
      html_attr("href")
    
  })

hassannasir · May 24, 2019, 9:37am

ok, found the solution. the following worked:


library(polite)
library (rvest)
library(purrr)

dawnsession <- bow("https://www.dawn.com")

dawnsession

dates <- seq(as.Date("2019-04-01"), as.Date("2019-04-30"), by="days")

fulllinks <- map(dates, ~scrape(dawnsession, params = paste0("archive/",.x)) )

links <- map(fulllinks, ~html_nodes(.x, ".mb-4") %>%
      
      html_nodes(".story__link") %>%
      
      html_attr("href"))

system · May 31, 2019, 9:45am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.