scraping of dates and links from a pdf using functions

Hi community,

I have been trying to scrape links and published date from a pdf in a website using rvest but the functions I wrote keeps returning itself without giving out result.

Load Packages ------


Data Wrangling

tidyverse, lubridate, magrittr,

Web scraping

rvest, xopen,

Text data mining

readtext, tidytext,
quanteda, textclean

search_pages <- c("", "") %>% tibble(page = .) %>%

Create a function to grab the links

get_links <- function(page){

page <- search_pages %>% pull(page) %>% .[1] %>% read_html()

page <- search_pages %>% read_html()

Create a table of extracted data

page_tbl <- tibble(

Get Title

  title = page %>% 
    html_nodes('.dbasetable a') %>% 
    html_text2() %>% 
      "(CBN )|(Economic Report)|(for )|(the )|(Published\\s\\d+/\\d+/\\d+)|(of)") %>% 
  # Get Published Date
  date = page %>% 
    html_nodes('#publishedDt') %>% #
    html_text2() %>% 
    str_squish() %>% 
    str_replace("Published ", "") %>% 
    str_extract("\\d+/\\d+/\\d+") %>% 
    mdy() %>% 
    format(., format = "%Y%m%d"),
  # Get the download links
  links = page %>%
    html_nodes('.dbasetable a') %>%
    html_attr("href") %>% 
    str_replace("^(\\.\\.)", "") %>% 
    str_c("", .)


1 Like

Hi @Techzill , Im try to get the links but show status HTTP ERROR 404, you could put the original links for check.

Thank you @M_AcostaCH , the orginal link is

I want to scrap the pdf files in both pages and extract the executive summary.

1 Like

Maybe you don't need the web scraping, you need OCR (Optical Character Recognition) for get this executive summary.

I have a easy example:

Thank you @M_AcostaCH , actually I have been able to scrap the pdf files into R, the only problem here is to scrap data from the pdfs exctracted.

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.