@technocrat Hi, So I have put together a rough solution that goes something like this.
I am stuck at the final step.
The comments in the code will explain each section elaborately
for (i in 1:length(LOF))
{
#print((LOF[i]))
### Reading pdf files one by one
text_extracted =(extract_text(paste(file = "C:/Work/R/text extraction/Extration_tests/WO/",LOF[i], sep = '')))
### Creating a txt file to extract text and add line numbers to it using **readLines**
sink("Looped_WO_OP_Txt.txt")
cat(PDFTOTXT, sep = c("\n"))
sink()
LoopedTxtLineNos = readLines("Looped_WO_OP_Txt.txt")
## I have a global list of patterns(numbered 1 to 10, from the reports,
##similar to what was listed in the previous discussion)
##getting line numbers for each of those patterns
########### getting line number of pattern
LnoWO = grep(WOpattern, LoopedTxtLineNos)
LnoDate = grep(Datepattern, LoopedTxtLineNos)
LnoPat1 = grep(pattern1, LoopedTxtLineNos)
LnoPat2 = grep(pattern2, LoopedTxtLineNos)
LnoPat3 = grep(pattern3, LoopedTxtLineNos)
LnoPat4 = grep(pattern4, LoopedTxtLineNos)
LnoPat5 = grep(pattern5, LoopedTxtLineNos)
LnoPat6 = grep(pattern6, LoopedTxtLineNos)
LnoPat7 = grep(pattern7, LoopedTxtLineNos)
LnoPat8 = grep(pattern8, LoopedTxtLineNos)
LnoPat9 = grep(pattern9, LoopedTxtLineNos)
LnoPat10 = grep(pattern10, LoopedTxtLineNos)
LnoPat11 = grep(pattern11, LoopedTxtLineNos)
##### The reports are all not the same and the required fields
##### are not always the same number of lines
###### So extracting number of lines to extract
###### based on the index numbers between two patterns
###### EX: if pattern 1 starts at line 1 and pattern 2 at line 5,
###### then seqpat will contain the numbers lines to cover
###### Extracting text to add to data frame
seqpat1 = seq(0, (((LnoPat2[1]-1)-(LnoPat1[1]))),1)
seqpat2 = seq(0, (((LnoPat3[1]-1)-(LnoPat2[1]))),1)
seqpat3 = seq(0, (((LnoPat4[1]-1)-(LnoPat3[1]))),1)
seqpat4 = seq(0, (((LnoPat5[1]-2)-(LnoPat4[1]))),1)
seqpat5 = seq(0, (((LnoPat5[1])-(LnoPat4[1]))),1)
seqpat6 = seq(0, (((LnoPat6[1]-1)-(LnoPat5[1]))),1)
seqpat7 = seq(0, (((LnoPat8[1]-1)-(LnoPat7[1]))),1)
seqpat8 = seq(0, (((LnoPat9[1]-1)-(LnoPat8[1]))),1)
seqpat9 = seq(0, (((LnoPat10[1]-1)-(LnoPat9[1]))),1)
seqpat10 = seq(0, (((LnoPat11[1]-1)-(LnoPat10[1]))),1)
#####Based on this I am trying to add the text to a list, and then append to the DF
temp_list[[i]]= c( str_extract(LoopedTxtLineNos[[LnoWO]],"WO-\\d+"),
(LoopedTxtLineNos[[LnoDate+1]]),
str_remove(str_flatten(LoopedTxtLineNos[(LnoPat1[1]+seqpat1)]),pattern1),
str_remove(str_flatten(LoopedTxtLineNos[(LnoPat2[1]+seqpat2)]),pattern2),
str_remove(str_flatten(LoopedTxtLineNos[(LnoPat3[1])+seqpat3][-c(4)]),pattern3),
str_remove((str_flatten(LoopedTxtLineNos[(LnoPat4[1])+seqpat4])), pattern4),
str_remove((str_flatten(LoopedTxtLineNos[(LnoPat5[1]+seqpat5)])), (pattern5)),
str_remove((str_flatten(LoopedTxtLineNos[(LnoPat6[1]+seqpat6-1)])), pattern6),
str_remove((str_flatten(LoopedTxtLineNos[(LnoPat7)+seqpat7])), (pattern7)),
str_remove((str_flatten(LoopedTxtLineNos[(LnoPat8)+seqpat8])), (pattern8)),
str_remove((str_flatten(LoopedTxtLineNos[(LnoPat9)+seqpat9-1])), (pattern9)),
str_remove((str_flatten(LoopedTxtLineNos[(LnoPat10)+seqpat10-1])), (pattern10))
)
# ##Loop_extracted_WO = rbind(Loop_extracted_WO, temp_extracted)
}
Loop_extracted_WO <- do.call(temp_list, rbind)
When I run this loop, only the last PDF that is extracted is entered into the DF. So if I have 15 files I am extracting from, all 15 entries in the DF are of the last PDF extracted.
The loop works, when I print the files names out on to the console, all 15 files names are printed. But when appending the data to the DF only the last PDF is appended .
Even the list temp_list has stored only the last pdf that has been extracted.
Can you please point out what the error is (Basically my loop does not work, how can i fix it?? I am sure it is a simple mistake!!).