Recently, to pass the time, I decided to create a code to map the Covid situation in my area.
In the end I decided to try to make a fully automated program to build the matrices from the pdf files posted every day. I found a way to download all the .pdf and I almost managed to extract the data I want from the .pdf with regular expressions.
I'm currently stuck on a really simple issue, I haven't manage to use a function (str_match() is highly suspected) to convert strings into an exploitable dataframe.
Does anyone have any idea ?
library(dplyr)
library(pdftools)
library(stringr)
# Code R
Wd<-"D:/Fichiers R/stats de la région/Bulletin_occitanie_Covid"
setwd(Wd)
# Example with one PDF from the website
download.file("https://www.occitanie.ars.sante.fr/system/files/2020-05/%40ARSOC_%23COVID-19_BulletinInfo54_20200501.pdf",
destfile = paste0(Wd,"/Bulletin.pdf"),
mode="wb"
)
# List of the "départements" in the area
Liste_departement<-c("Ari.ge","Aude","Aveyron","Gard","Gers","Haute.Garonne","Hautes.Pyr.n.es","H.rault","Lot","Loz.re","Pyr.n.es-Orientales","Tarn","Tarn.et.Garonne")
# Extraction of the data from the PDF
Data_extraction<-function(X){
text <- pdf_text("./Bulletin.pdf")
Liste_brute <- strsplit(text, "\r\n")[[1]] %>%
str_subset(paste0(X,"[:blank:]*\\([:digit:]{2}\\)")) %>%
str_extract(paste0(X,"[:blank:]*\\([:digit:]{2}\\)([:blank:]*[:digit:]{1,5}){4}"))
return(Liste_brute)
}
sapply(Liste_departement,Data_extraction)
While I don't have a solution to your specific question, you might consider parsing the PDF using the tabulizer package. It does especially well extracting tables from PDFs.
Often when using tabulizer I find I have to manually define the areas of the tables I want to extract. A great way to do this is with the locate_areas() function that will let you interactively drag a box around a table and give you the coordinates to plug into the area argument of extract_tables.
It's a very good way to solve my problem, thank you.
I'm still reading the extract_table instructions. Since you seem to be used to this function, can't I use the guess parameter (or any other function) to have the right definition of the area no matter the little fluctuations in the font of each PDF ?
I assume that's what you are advising me to do with the locate_areas() function but this is a "manual" function and I'd like an automated function that does the extraction for every pdf (one report is added every day, we are at 53).
You can try to use guess = TRUE in extract_table(), but depending on the structure of the PDF, it may or may not be able to find the table automatically. When possible, specifying the area is much more reliable, but if the layout of all the PDFs are slightly different, it might be tough.
Since you've done a lot of the hard regex work already extracting rows from the table, here's an approach to turn that text in rows in a data frame. I'm sure there's more elegant ways to do this, but it seems to work!
It seems to work properly, I have a lot to learn about map() ! Thank you for your time and dedication, the two answers already lead me in the right direction.