Extracting text fields from a list of pdfs

petermacp · March 26, 2019, 8:08am

I have a folder of about 2000 .pdf files containing laboratory results. All files are in a similar format and layout.

I have been trying to read all of the .pdf files into R, then extract data from relevant fields for analysis... but really struggling with the regex.

Here is an example:

library(tidyverse)
library(pdftools)

#read in all the .pdf files
file.list <- list.files(pattern='\\.pdf')
x <- map(file.list, ~ pdf_text(.))
names(x) <- gsub("\\.pdf", "", file.list)

^{Created on 2019-03-26 by the reprex package (v0.2.1)}

The imported .pdf files basically look like this (but much longer in reality)

x <- list(file1 = "OTHER TEXT\nSample ID:                      PRO22884Z-       OTHER TEXT   \nTest Result:              NOT DETECTED\n OTHER TEXT End Time:          21/01/19 17:21:10\n       OTHER TEXT       ",
          file2 = "OTHER TEXT\nSample ID:                      PRO33443M-       OTHER TEXT   \nTest Result:              DETECTED\n OTHER TEXT End Time:          22/01/19 18:04:34\n       OTHER TEXT       ",
          file3 = "OTHER TEXT\nSample ID:                      PRO112236-       OTHER TEXT   \nTest Result:              DETECTED\n OTHER TEXT End Time:          14/02/19 09:34:17\n       OTHER TEXT       ")

^{Created on 2019-03-26 by the reprex package (v0.2.1)}

Now, I guess I need to map through each item in the list, and extract the fields I need... but this is where I am getting lost in regex symbols...

The end result should look like this:

library(tidyverse)
output <- tribble(
  ~sample_id, ~end_time, ~test_result,
  "PRO22884Z", "21/01/19 17:21:10", "NOT DETECTED",
  "PRO33443M", "22/01/19 18:04:34", "DETECTED",
  "PRO112236", "14/02/19 09:34:17", "DETECTED"
)

output
#> # A tibble: 3 x 3
#>   sample_id end_time          test_result 
#>   <chr>     <chr>             <chr>       
#> 1 PRO22884Z 21/01/19 17:21:10 NOT DETECTED
#> 2 PRO33443M 22/01/19 18:04:34 DETECTED    
#> 3 PRO112236 14/02/19 09:34:17 DETECTED

^{Created on 2019-03-26 by the reprex package (v0.2.1)}

Any help greatfully received.