scraping of dates and links from a pdf using functions

Techzill · November 29, 2022, 2:13am

Hi community,

I have been trying to scrape links and published date from a pdf in a website using rvest but the functions I wrote keeps returning itself without giving out result.

Load Packages ------

pacman::p_load(

Data Wrangling

tidyverse, lubridate, magrittr,

Web scraping

rvest, xopen,

Text data mining

readtext, tidytext,
quanteda, textclean
)

search_pages <- c("https://www.cbn.gov.ng/Documents/quarterlyecoreports.aspbeginrec=1&endrec=20&keyword=&from=&tod=", "https://www.cbn.gov.ng/Documents/quarterlyecoreports.aspbeginrec=21&endrec=40&keyword=&from=&tod=") %>% tibble(page = .) %>%
print()

Create a function to grab the links

get_links <- function(page){

page <- search_pages %>% pull(page) %>% .[1] %>% read_html()

page <- search_pages %>% read_html()

Create a table of extracted data

page_tbl <- tibble(

Get Title

  title = page %>% 
    html_nodes('.dbasetable a') %>% 
    html_text2() %>% 
    str_remove_all(
      "(CBN )|(Economic Report)|(for )|(the )|(Published\\s\\d+/\\d+/\\d+)|(of)") %>% 
    str_squish(),
  
  # Get Published Date
  date = page %>% 
    html_nodes('#publishedDt') %>% #
    html_text2() %>% 
    str_squish() %>% 
    str_replace("Published ", "") %>% 
    str_extract("\\d+/\\d+/\\d+") %>% 
    mdy() %>% 
    format(., format = "%Y%m%d"),
  
  # Get the download links
  links = page %>%
    html_nodes('.dbasetable a') %>%
    html_attr("href") %>% 
    str_replace("^(\\.\\.)", "") %>% 
    str_c("https://www.cbn.gov.ng", .)

)
return(links)
}

M_AcostaCH · November 29, 2022, 4:06am

Hi @Techzill , Im try to get the links but show status HTTP ERROR 404, you could put the original links for check.

Techzill · November 29, 2022, 5:08am

Thank you @M_AcostaCH , the orginal link is
(www.cbn.gov.ng/Documents/quarterlyecoreports.asp.)

I want to scrap the pdf files in both pages and extract the executive summary.

M_AcostaCH · November 29, 2022, 5:19am

Maybe you don't need the web scraping, you need OCR (Optical Character Recognition) for get this executive summary.

I have a easy example:

https://rpubs.com/miaacostach/OCR

Techzill · November 29, 2022, 5:33am

Thank you @M_AcostaCH , actually I have been able to scrap the pdf files into R, the only problem here is to scrap data from the pdfs exctracted.

system · January 10, 2023, 5:33am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.