need to extract a specific part of PDF 10-K filings documents. The documents have a Table of Content with the section names in them.
I am interested in the section between Item 1A. Risk Factors & Item 1B. Unresolved Staff comments:
......Parts not interested in.......
Item 1A. Risk Factors
.....text I am interested in.......
Item 1B. Unresolved Staff Comments
..........Parts not interested in............
I have the following code:
CapitalOne_files <- c("https://investor.capitalone.com/static-files/881eed87-5cc8-4950-bc33-7324f61b6dfe",
"https://investor.capitalone.com/static-files/03bd6db7-5ddc-4711-a307-efc7fe1b9748",
"https://investor.capitalone.com/static-files/d6af3768-9a98-4b25-8a2e-21922eec370e",
"https://investor.capitalone.com/static-files/1fccb8d3-10db-48e5-abea-f44179724b49")
CapitalOne_l <- lapply(CapitalOne_files, function(CapitalOne_files) {
# print status message
message("processing: ", basename(CapitalOne_files))
lines <- unlist(stringr::str_split(pdftools::pdf_text(CapitalOne_files), "\n"))
start <- stringr::str_which(lines, "Item 1A.\\s+Risk Factors")
end <- stringr::str_which(lines, "Item 1B.\\s+Unresolved Staff Comments")
# cover a few different outcomes depending on what was found
if (length(start) == 1 & length(end) == 1) {
relevant <- lines[start:end]
} else if (length(start) == 0 | length(end) == 0) {
relevant <- "Pattern not found"
} else {
relevant <- "Problems found"
}
return(relevant)
})
names(CapitalOne_l) <- basename(CapitalOne_files)
sapply(CapitalOne_l, head)
But unfortunately, the output is "problems found".
One of the needed files can be found here: https://investor.capitalone.com/static-files/d6af3768-9a98-4b25-8a2e-21922eec370e.
I think my start string finds multiple hints. Do you know how I can specify which of those hints is the correct one?
Can maybe someone help me with that?