How to create a data frame/tibble from numerics strings?

Recently, to pass the time, I decided to create a code to map the Covid situation in my area.
In the end I decided to try to make a fully automated program to build the matrices from the pdf files posted every day. I found a way to download all the .pdf and I almost managed to extract the data I want from the .pdf with regular expressions.
I'm currently stuck on a really simple issue, I haven't manage to use a function (str_match() is highly suspected) to convert strings into an exploitable dataframe.
Does anyone have any idea ?

library(dplyr)
library(pdftools)
library(stringr)

# Code R
Wd<-"D:/Fichiers R/stats de la région/Bulletin_occitanie_Covid"
setwd(Wd)
# Example with one PDF from the website 
download.file("https://www.occitanie.ars.sante.fr/system/files/2020-05/%40ARSOC_%23COVID-19_BulletinInfo54_20200501.pdf",
  destfile = paste0(Wd,"/Bulletin.pdf"),
  mode="wb"
)
# List of the "départements" in the area
Liste_departement<-c("Ari.ge","Aude","Aveyron","Gard","Gers","Haute.Garonne","Hautes.Pyr.n.es","H.rault","Lot","Loz.re","Pyr.n.es-Orientales","Tarn","Tarn.et.Garonne")

# Extraction of the data from the PDF
Data_extraction<-function(X){
  text <- pdf_text("./Bulletin.pdf")
  Liste_brute <- strsplit(text, "\r\n")[[1]] %>% 
    str_subset(paste0(X,"[:blank:]*\\([:digit:]{2}\\)")) %>% 
    str_extract(paste0(X,"[:blank:]*\\([:digit:]{2}\\)([:blank:]*[:digit:]{1,5}){4}"))
  return(Liste_brute)
}

sapply(Liste_departement,Data_extraction)

While I don't have a solution to your specific question, you might consider parsing the PDF using the tabulizer package. It does especially well extracting tables from PDFs.

Often when using tabulizer I find I have to manually define the areas of the tables I want to extract. A great way to do this is with the locate_areas() function that will let you interactively drag a box around a table and give you the coordinates to plug into the area argument of extract_tables.

This is how I would go about parsing your table:

library(dplyr, warn.conflicts = FALSE)
library(janitor, warn.conflicts = FALSE)
library(tabulizer)

t <- tempfile()
download.file(
  "https://www.occitanie.ars.sante.fr/system/files/2020-05/%40ARSOC_%23COVID-19_BulletinInfo54_20200501.pdf",
  destfile = t, mode = "wb"
  )

covid_table <- extract_tables(
  t,
  output = "data.frame",
  area = list(c(156, 13, 354, 346)),
  guess = FALSE
  ) %>% 
  as.data.frame()

covid_table %>% 
  as_tibble() %>%
  slice(4:17) %>% 
  mutate_at(2:5, as.numeric) %>% 
  clean_names()
#> # A tibble: 14 x 5
#>    suivi_de_la_activite_hospit~ hospitalisations  dont total_retours total_deces
#>    <chr>                                   <dbl> <dbl>         <dbl>       <dbl>
#>  1 Ariège (09)                                7     1            29           2
#>  2 Aude (11)                                  40     3           171          51
#>  3 Aveyron (12)                               30     2           113          22
#>  4 Gard (30)                                 132    28           186          61
#>  5 Haute-Garonne (31)                        120    38           481          50
#>  6 Gers (32)                                  26     4            52          17
#>  7 Hérault (34)                             110    33           563         104
#>  8 Lot (46)                                   16     1            50          10
#>  9 Lozère (48)                                1     0            18           1
#> 10 Hautes-Pyrénées (65)                     65     5            96          20
#> 11 Pyrénées-Orientales (66)                 15     8           260          33
#> 12 Tarn (81)                                  38    10            79          19
#> 13 Tarn-et-Garonne (82)                       12     6            33           4
#> 14 OCCITANIE                                 612   139          2131         394

Created on 2020-05-01 by the reprex package (v0.3.0)

Clearly there are a few encoding issue in some of the names, but you should be able to clean that up!

1 Like

You can try to use guess = TRUE in extract_table(), but depending on the structure of the PDF, it may or may not be able to find the table automatically. When possible, specifying the area is much more reliable, but if the layout of all the PDFs are slightly different, it might be tough.

Since you've done a lot of the hard regex work already extracting rows from the table, here's an approach to turn that text in rows in a data frame. I'm sure there's more elegant ways to do this, but it seems to work!

library(tidyverse)
library(pdftools)

t <- tempfile()
download.file(
  "https://www.occitanie.ars.sante.fr/system/files/2020-05/%40ARSOC_%23COVID-19_BulletinInfo54_20200501.pdf",
  destfile = t, mode = "wb"
)

text <- pdf_text(t)
departement <- c("Ari.ge", "Aude", "Aveyron", "Gard", "Gers", "Haute.Garonne",
                 "Hautes.Pyr.n.es", "H.rault", "Lot", "Loz.re", "Pyr.n.es-Orientales",
                 "Tarn", "Tarn.et.Garonne")


extract_text <- function(x, y) {
  strsplit(x, "\r\n")[[1]] %>% 
    str_subset(paste0(y,"[:blank:]*\\([:digit:]{2}\\)")) %>% 
    str_extract(paste0(y,"[:blank:]*\\([:digit:]{2}\\)([:blank:]*[:digit:]{1,5}){4}")) %>% 
    str_split("\\s+") %>% 
    as_vector() %>% 
    set_names(paste0("X", 1:6))
}

map2_dfr(text, departement, extract_text)
#> # A tibble: 13 x 6
#>    X1                  X2    X3    X4    X5    X6   
#>  * <chr>               <chr> <chr> <chr> <chr> <chr>
#>  1 Ariège              (09)  7     1     29    2    
#>  2 Aude                (11)  40    3     171   51   
#>  3 Aveyron             (12)  30    2     113   22   
#>  4 Gard                (30)  132   28    186   61   
#>  5 Gers                (32)  26    4     52    17   
#>  6 Haute-Garonne       (31)  120   38    481   50   
#>  7 Hautes-Pyrénées     (65)  65    5     96    20   
#>  8 Hérault             (34)  110   33    563   104  
#>  9 Lot                 (46)  16    1     50    10   
#> 10 Lozère              (48)  1     0     18    1    
#> 11 Pyrénées-Orientales (66)  15    8     260   33   
#> 12 Tarn                (81)  38    10    79    19   
#> 13 Tarn-et-Garonne     (82)  12    6     33    4

Created on 2020-05-02 by the reprex package (v0.3.0)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

It's a very good way to solve my problem, thank you.
I'm still reading the extract_table instructions. Since you seem to be used to this function, can't I use the guess parameter (or any other function) to have the right definition of the area no matter the little fluctuations in the font of each PDF ?
I assume that's what you are advising me to do with the locate_areas() function but this is a "manual" function and I'd like an automated function that does the extraction for every pdf (one report is added every day, we are at 53).

It seems to work properly, I have a lot to learn about map() ! Thank you for your time and dedication, the two answers already lead me in the right direction.

1 Like