Extracting text fields from a list of pdfs

petermacp · March 26, 2019, 8:08am

I have a folder of about 2000 .pdf files containing laboratory results. All files are in a similar format and layout.

I have been trying to read all of the .pdf files into R, then extract data from relevant fields for analysis... but really struggling with the regex.

Here is an example:

library(tidyverse)
library(pdftools)

#read in all the .pdf files
file.list <- list.files(pattern='\\.pdf')
x <- map(file.list, ~ pdf_text(.))
names(x) <- gsub("\\.pdf", "", file.list)

^{Created on 2019-03-26 by the reprex package (v0.2.1)}

The imported .pdf files basically look like this (but much longer in reality)

x <- list(file1 = "OTHER TEXT\nSample ID:                      PRO22884Z-       OTHER TEXT   \nTest Result:              NOT DETECTED\n OTHER TEXT End Time:          21/01/19 17:21:10\n       OTHER TEXT       ",
          file2 = "OTHER TEXT\nSample ID:                      PRO33443M-       OTHER TEXT   \nTest Result:              DETECTED\n OTHER TEXT End Time:          22/01/19 18:04:34\n       OTHER TEXT       ",
          file3 = "OTHER TEXT\nSample ID:                      PRO112236-       OTHER TEXT   \nTest Result:              DETECTED\n OTHER TEXT End Time:          14/02/19 09:34:17\n       OTHER TEXT       ")

^{Created on 2019-03-26 by the reprex package (v0.2.1)}

Now, I guess I need to map through each item in the list, and extract the fields I need... but this is where I am getting lost in regex symbols...

The end result should look like this:

library(tidyverse)
output <- tribble(
  ~sample_id, ~end_time, ~test_result,
  "PRO22884Z", "21/01/19 17:21:10", "NOT DETECTED",
  "PRO33443M", "22/01/19 18:04:34", "DETECTED",
  "PRO112236", "14/02/19 09:34:17", "DETECTED"
)

output
#> # A tibble: 3 x 3
#>   sample_id end_time          test_result 
#>   <chr>     <chr>             <chr>       
#> 1 PRO22884Z 21/01/19 17:21:10 NOT DETECTED
#> 2 PRO33443M 22/01/19 18:04:34 DETECTED    
#> 3 PRO112236 14/02/19 09:34:17 DETECTED

^{Created on 2019-03-26 by the reprex package (v0.2.1)}

Any help greatfully received.

FJCC · March 26, 2019, 2:33pm

I was writing this up while andresrcs was responding, so I will post it. Working with the example you provided:

library(stringr)
library(purrr)
#> Warning: package 'purrr' was built under R version 3.5.3
x <- list(file1 = "OTHER TEXT\nSample ID:                      PRO22884Z-       OTHER TEXT   \nTest Result:              NOT DETECTED\n OTHER TEXT End Time:          21/01/19 17:21:10\n       OTHER TEXT       ",
          file2 = "OTHER TEXT\nSample ID:                      PRO33443M-       OTHER TEXT   \nTest Result:              DETECTED\n OTHER TEXT End Time:          22/01/19 18:04:34\n       OTHER TEXT       ",
          file3 = "OTHER TEXT\nSample ID:                      PRO112236-       OTHER TEXT   \nTest Result:              DETECTED\n OTHER TEXT End Time:          14/02/19 09:34:17\n       OTHER TEXT       ")


GetData <- function(dat){
  list(Sample = str_trim(str_extract(dat, "(?<=Sample ID:)[^-]+")),
    EndTime = str_trim(str_extract(dat, "(?<=End Time:)[^\\n]+")),
    Result = str_trim(str_extract(dat, "(?<=Test Result:)[^\\n]+")))
}

Out <- map_dfr(x, GetData)
Out
#> # A tibble: 3 x 3
#>   Sample    EndTime           Result      
#>   <chr>     <chr>             <chr>       
#> 1 PRO22884Z 21/01/19 17:21:10 NOT DETECTED
#> 2 PRO33443M 22/01/19 18:04:34 DETECTED    
#> 3 PRO112236 14/02/19 09:34:17 DETECTED

^{Created on 2019-03-26 by the reprex package (v0.2.1)}

petermacp · March 26, 2019, 5:33pm

@FJCC Thank you so much! This works perfectly.

Regex seems like black magic to me...

petermacp · March 26, 2019, 5:42pm

@andresrcs Thank you very much for the reply, and apologies for not getting this perfect...

The source pdfs are laboratory reports with personal information, so I obviously can't copy the exact text here... I did think that the reprexes that I constructed (using the reprex package) were close enough though.

For my future learning, I wold be very grateful if you could give an example of how you would have structured this question differently to get the best quality input and support.

Having said all that, I am extremely grateful to all in the RStudio Community for assistance, and in particular to @FJCC for the solution in this case.

andresrcs · March 26, 2019, 8:11pm

Well, if you are working with sensitive information and the solution you have is generalizing well with the rest of your data, then your reprex has proven to be good enough.
My request for minimal but complete sample data was because very often when answering regex related questions with overly simplified sample data, the solution doesn't scales well with the real data because of unseen patterns and that generates unncesary back and forths with new questinons like, "how do I make this solution to work with this other text". Luckly, this was not the case and you already have a working solution.

system · April 2, 2019, 8:11pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.