Recently, to pass the time, I decided to create a code to map the Covid situation in my area.
In the end I decided to try to make a fully automated program to build the matrices from the pdf files posted every day. I found a way to download all the .pdf and I almost managed to extract the data I want from the .pdf with regular expressions.
I'm currently stuck on a really simple issue, I haven't manage to use a function (str_match() is highly suspected) to convert strings into an exploitable dataframe.
Does anyone have any idea ?
library(dplyr)
library(pdftools)
library(stringr)
# Code R
Wd<-"D:/Fichiers R/stats de la région/Bulletin_occitanie_Covid"
setwd(Wd)
# Example with one PDF from the website
download.file("https://www.occitanie.ars.sante.fr/system/files/2020-05/%40ARSOC_%23COVID-19_BulletinInfo54_20200501.pdf",
destfile = paste0(Wd,"/Bulletin.pdf"),
mode="wb"
)
# List of the "départements" in the area
Liste_departement<-c("Ari.ge","Aude","Aveyron","Gard","Gers","Haute.Garonne","Hautes.Pyr.n.es","H.rault","Lot","Loz.re","Pyr.n.es-Orientales","Tarn","Tarn.et.Garonne")
# Extraction of the data from the PDF
Data_extraction<-function(X){
text <- pdf_text("./Bulletin.pdf")
Liste_brute <- strsplit(text, "\r\n")[[1]] %>%
str_subset(paste0(X,"[:blank:]*\\([:digit:]{2}\\)")) %>%
str_extract(paste0(X,"[:blank:]*\\([:digit:]{2}\\)([:blank:]*[:digit:]{1,5}){4}"))
return(Liste_brute)
}
sapply(Liste_departement,Data_extraction)