Automate reading multiple PDF files and creating data frame

SKC · January 19, 2022, 6:32pm

This is following my previous question..

I was trying to read from a PDF file and create a data frame with specific fields from the file.
I have about 1000+ reports generated weekly. This is how far I have gotten so far

I am reading the PDF file using

text = extract_text(file = "C:/Work/R/text extraction/Extration_tests/WO-09017974A.pdf",
         pages = c(1,6)) ## The two pages with the data of importance

and then putting it into a text file

sink("extract_text.txt")
cat(text, sep = "/n")
sink()

Next I am reading this text file (file output pasted below) into a list

ICRE = readLines("extract_text.txt")

and creating a output dataframe with the fields I want

extracted = data.frame("date" = ICRE[[6]],
                       "WO" = ICRE [[ 10]],
                       "Incident description" = paste(ICRE[[c(42)]],ICRE[[c(43)]]), ### using paste because the related data is in two lines
                       "Impact on customer" = ICRE[[45]],
                       "Condition of system" = ICRE[[47]]
                       )

Note: This is just a sample dataframe and I will be adding more fields from the text file.

This is how the text file I am using to extract the above data looks.

Text file from output (data has been anonymized for some entries)

Internal

Repair Report

10/20/2021

site1 - Site

Our reference: 0 / WO-09017974 Customer reference:

Report prepared by ABC Customer contact: DEF
Lines 11-38 below can be skipped, hence deleted

*NOTE:All bold text are column headers (sometimes end with a period), non bold text is row (follows the period in some cases)
Incident Description (line 39 in the report)

_

Incident description:

Troubleshoot the issue with

ABC and BBB

Impact on customer. No disturbance

Condition of system upon arrival. ALL is OK

_

Investigation and Analysis

_

Action to protect the block.

It was discussed to not transfer the block since the other block is
already in PPP.

Other

Circumstance of the fault. No particular circumstance

Premises and environment visual check.

The premises are clean / The

premises are well-ventilated

I have a few of questions

Is this the best way to do this?
I would like to automate this process for 1000+files coming in every week
The primary key for the data is the field WO-09017974,, which is the first entry in the dataset but as on now in the DF I get the whole line "Our reference: 0 / WO-09017974 Customer reference:", I only need to extract WO-09017974. How can I do this?
Is there a way to automate the extraction process (that I am basing on page numbers right now text = extract_text(file = "C:/Work/R/text extraction/Extration_tests/WO-09017974A.pdf",pages = c(1,6)) ) with out using page numbers but just key phrases?

Sorry for the overly long question and thanks in advance for any help.

technocrat · January 19, 2022, 8:30pm

Starting from a text file (mine was a guess from the text given, because of the interleaved comments)

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(stringr)
input <- c("Internal", "", "Repair Report", "", "10/20/2021", "", "site1 - Site", "", "Our reference: 0 / WO-09017974 Customer reference:", "", "Report prepared by ABC Customer contact: DEF", "", "Incident Description (line 39 in the report)", "", "_", "", "Incident description:", "", "Troubleshoot the issue with", "", "ABC and BBB", "", "Impact on customer. No disturbance", "", "Condition of system upon arrival. ALL is OK", "", "_", "", "Investigation and Analysis", "", "_", "", "Action to protect the block.", "", "It was discussed to not transfer the block since the other block is", "already in PPP.", "", "Other", "", "Circumstance of the fault. No particular circumstance", "", "Premises and environment visual check.", "", "The premises are clean / The", "", "premises are well-ventilated")
pattern1 <- "Impact on customer. "
pattern2 <- "Condition of system upon arrival. "
extract <- data.frame(
  dated = mdy(input[5]),
    wo = str_extract(input[9],"WO-\\d+"),
    desc = str_flatten(input[19:21][c(1,3)]," "),
    impact = str_remove(input[23], pattern1),
    condition = str_remove(input[25],pattern2))
extract
#>        dated          wo                                    desc         impact
#> 1 2021-10-20 WO-09017974 Troubleshoot the issue with ABC and BBB No disturbance
#>   condition
#> 1 ALL is OK

The success of automation depends on the consistency of the 1K files. Assuming they have identical processing needs, a script to iterate over the pdf files and produce a data frame is feasible. Running that script on a schedule is system dependent. In the unices it could be done with a cron job.

To avoid page references, assuming the pdf file was not too unwieldy and the content began on the same line, it would be just a matter of adjusting the index references. That's probably too much to hope for, however, and to trim by keyword start and keyword end, you could use the facilities of {stringr}; however, I would probably write a sed script myself.

SKC · January 21, 2022, 4:34am

@technocrat thank you so much for your response. As you can see from my question I very new to string and text processing.
Yes, the 1K files have identical processing needs and are fairly consistent. I am only automating the extraction of data, the files are downloaded, collated and stored automatically by the backend system, So I do not need to automate that step. The issue is these are templated files with entries being made manually in the field by different people. So there is some inconsistency at times.

Just to clarify, if I understand correctly, the index values for inputs (input[x]) are line numbers including the spaces, indicating that from the specified "pattern" the required row entries can be found at line number `input'.

Also, to loop through the files
will a for loop show below work? (sorry for the simple question)


input <- c("Internal", "", "Repair Report", "", "10/20/2021", "", "site1 - Site", "", "Our reference: 0 / WO-09017974 Customer reference:", "", "Report prepared by ABC Customer contact: DEF", "", "Incident Description (line 39 in the report)", "", "_", "", "Incident description:", "", "Troubleshoot the issue with", "", "ABC and BBB", "", "Impact on customer. No disturbance", "", "Condition of system upon arrival. ALL is OK", "", "_", "", "Investigation and Analysis", "", "_", "", "Action to protect the block.", "", "It was discussed to not transfer the block since the other block is", "already in PPP.", "", "Other", "", "Circumstance of the fault. No particular circumstance", "", "Premises and environment visual check.", "", "The premises are clean / The", "", "premises are well-ventilated")
pattern1 <- "Impact on customer. "
pattern2 <- "Condition of system upon arrival. "
.
.
.
patternNN = " "

pdf_files = "folder with pdf files"
for( i in pdf_files)
{

text = extract_text(i)) 

sink("extract_text.txt")
cat(text, sep = "/n")
sink()
ICRE = readLines("extract_text.txt") ##**is this the correct variable to use as show below**

    extracted[nrow(extracted)+1,] = (dated = ICRE[mdy(input[5])],
    wo = ICRE[str_extract(input[9],"WO-\\d+")],
    desc = ICRE[str_flatten(input[19:21][c(1,3)]," ")],
    impact = ICRE[str_remove(input[23], pattern1)],
    condition = ICRE[str_remove(input[25],pattern2))]
}

or should I pass "extract_text.txt" instead of ICRE?
Sorry for asking more questions, but I am very new to text processing.
Again thank you for your help .

SKC · January 21, 2022, 8:23am

Or is it that the your input list that you have created is the list I will need to loop on and add to the DF? as shown below


patterrn1
 ..
 patternNN
pdf_files = "folder with pdf files"
for( i in pdf_files)
 {
 

 text = c(extract_text(i))  #### This is my list with the text same as input

extracted[nrow(extracted)+1,] = data.frame(
    dated = mdy(input[5]),
    wo = str_extract(input[9],"WO-\\d+"),
    desc = str_flatten(input[19:21][c(1,3)]," "),
     impact = str_remove(input[23], pattern1),
     condition = str_remove(input[25],pattern2))
 
 }

technocrat · January 21, 2022, 10:48pm

The community would have only gossip if there were no questions and if there were only sophisticated questions, beginners wouldn't find much here to help them along. So never apologize for questions—that's why we're all here.

It will help to reframe the questions using an idea from school algebra: $f(x) = y.

x is a directory of pdf files.
y is a data frame with six character variables: the date, a work order, description, impact condition

To create the data frame from pdf files, the content of each pdf file must be converted to text and the text must be parsed to identify and collect the desired content to populate the variables in the data frame.

Given the text, called input in my example, pieces of it can be selected with the index value. For example input[1] is "Internal". Some of the pieces aren't usable and require modification, which I did with functions from {stringr}.

You appear to be doing the pdf conversion with a function extract_text to produce a file extract_text.txt, which is equivalent to my input.

To automate this your loop code would work using extract_text.txt as the disk source and creating the R object ICRE with readLines. You would want to start with an empty data frame extracted.

Eyeballing this, it should work. Come back with any other questions.

SKC · January 24, 2022, 6:04am

Thank you for the elaborate response.
I will get back to you with the results or questions, which ever comes first!!

SKC · January 31, 2022, 5:25pm

@technocrat Hi, So I have put together a rough solution that goes something like this.
I am stuck at the final step.

The comments in the code will explain each section elaborately

for (i in 1:length(LOF)) 
  {
  
    #print((LOF[i]))
### Reading pdf files one by one
     text_extracted =(extract_text(paste(file = "C:/Work/R/text extraction/Extration_tests/WO/",LOF[i], sep = '')))  

     ### Creating a txt file to extract text and add line numbers to it using **readLines**
           sink("Looped_WO_OP_Txt.txt")
           cat(PDFTOTXT, sep = c("\n"))
           sink()

  LoopedTxtLineNos = readLines("Looped_WO_OP_Txt.txt")
           
            ## I have a global list of patterns(numbered 1 to 10, from the reports, 
            ##similar to what was listed in the previous discussion) 
           ##getting line numbers for each of those patterns 
           ########### getting line number of pattern 
           LnoWO = grep(WOpattern, LoopedTxtLineNos)
           LnoDate = grep(Datepattern, LoopedTxtLineNos)
           LnoPat1 = grep(pattern1, LoopedTxtLineNos)
           LnoPat2 = grep(pattern2, LoopedTxtLineNos)
           LnoPat3 = grep(pattern3, LoopedTxtLineNos)
           LnoPat4 = grep(pattern4, LoopedTxtLineNos)
           LnoPat5 = grep(pattern5, LoopedTxtLineNos)
           LnoPat6 = grep(pattern6, LoopedTxtLineNos)
           LnoPat7 = grep(pattern7, LoopedTxtLineNos)
           LnoPat8 = grep(pattern8, LoopedTxtLineNos)
           LnoPat9 = grep(pattern9, LoopedTxtLineNos)
           LnoPat10 = grep(pattern10, LoopedTxtLineNos)
           LnoPat11 = grep(pattern11, LoopedTxtLineNos)
           

           ##### The reports are all not the same and the required fields 
           ##### are not always the same number of lines
            ###### So extracting number of lines to extract 
            ###### based on the index numbers between two patterns
           ###### EX: if pattern 1 starts at line 1 and pattern 2 at line 5, 
           ###### then seqpat will contain the numbers lines to cover
           
           ###### Extracting text to add to data frame           
           seqpat1 = seq(0, (((LnoPat2[1]-1)-(LnoPat1[1]))),1)
           seqpat2 = seq(0, (((LnoPat3[1]-1)-(LnoPat2[1]))),1)
           seqpat3 = seq(0, (((LnoPat4[1]-1)-(LnoPat3[1]))),1)
           seqpat4 = seq(0, (((LnoPat5[1]-2)-(LnoPat4[1]))),1)
           seqpat5 = seq(0, (((LnoPat5[1])-(LnoPat4[1]))),1)
           seqpat6 = seq(0, (((LnoPat6[1]-1)-(LnoPat5[1]))),1)
           seqpat7 = seq(0, (((LnoPat8[1]-1)-(LnoPat7[1]))),1)
           seqpat8 = seq(0, (((LnoPat9[1]-1)-(LnoPat8[1]))),1)
           seqpat9 = seq(0, (((LnoPat10[1]-1)-(LnoPat9[1]))),1)
           seqpat10 = seq(0, (((LnoPat11[1]-1)-(LnoPat10[1]))),1)
     
          #####Based on this I am trying to add the text to a list, and then append to the DF

           temp_list[[i]]= c( str_extract(LoopedTxtLineNos[[LnoWO]],"WO-\\d+"),
                                                        (LoopedTxtLineNos[[LnoDate+1]]),
                                                        str_remove(str_flatten(LoopedTxtLineNos[(LnoPat1[1]+seqpat1)]),pattern1),
                                                        str_remove(str_flatten(LoopedTxtLineNos[(LnoPat2[1]+seqpat2)]),pattern2),
                                                        str_remove(str_flatten(LoopedTxtLineNos[(LnoPat3[1])+seqpat3][-c(4)]),pattern3),
                                                        str_remove((str_flatten(LoopedTxtLineNos[(LnoPat4[1])+seqpat4])), pattern4),
                                                        str_remove((str_flatten(LoopedTxtLineNos[(LnoPat5[1]+seqpat5)])), (pattern5)),
                                                        str_remove((str_flatten(LoopedTxtLineNos[(LnoPat6[1]+seqpat6-1)])), pattern6),
                                                        str_remove((str_flatten(LoopedTxtLineNos[(LnoPat7)+seqpat7])), (pattern7)),
                                                        str_remove((str_flatten(LoopedTxtLineNos[(LnoPat8)+seqpat8])), (pattern8)),
                                                        str_remove((str_flatten(LoopedTxtLineNos[(LnoPat9)+seqpat9-1])), (pattern9)),
                                                        str_remove((str_flatten(LoopedTxtLineNos[(LnoPat10)+seqpat10-1])), (pattern10))
                                                      )
  # ##Loop_extracted_WO = rbind(Loop_extracted_WO, temp_extracted)
  }
Loop_extracted_WO <- do.call(temp_list, rbind)

When I run this loop, only the last PDF that is extracted is entered into the DF. So if I have 15 files I am extracting from, all 15 entries in the DF are of the last PDF extracted.
The loop works, when I print the files names out on to the console, all 15 files names are printed. But when appending the data to the DF only the last PDF is appended .
Even the list temp_list has stored only the last pdf that has been extracted.

Can you please point out what the error is (Basically my loop does not work, how can i fix it?? I am sure it is a simple mistake!!).

SKC · February 1, 2022, 4:25am

I think I found the mistake as well, it was silly,

          sink("Looped_WO_OP_Txt.txt")
           cat(PDFTOTXT, sep = c("\n"))
           sink()

Was passing the wrong file here should have been passing text_extracted instead of PDFTOTXT

I think the solution is working.
Thanks for your help @technocrat