How to extract specific parts of PDFs in R?

need to extract a specific part of PDF 10-K filings documents. The documents have a Table of Content with the section names in them.

I am interested in the section between Item 1A. Risk Factors & Item 1B. Unresolved Staff comments:

......Parts not interested in.......

Item 1A. Risk Factors

.....text I am interested in.......

Item 1B. Unresolved Staff Comments

..........Parts not interested in............

I have the following code:

CapitalOne_files <- c("https://investor.capitalone.com/static-files/881eed87-5cc8-4950-bc33-7324f61b6dfe",
                      "https://investor.capitalone.com/static-files/03bd6db7-5ddc-4711-a307-efc7fe1b9748",
                      "https://investor.capitalone.com/static-files/d6af3768-9a98-4b25-8a2e-21922eec370e",
                      "https://investor.capitalone.com/static-files/1fccb8d3-10db-48e5-abea-f44179724b49")

CapitalOne_l <- lapply(CapitalOne_files, function(CapitalOne_files) {
  
  # print status message
  message("processing: ", basename(CapitalOne_files))
  
  lines <- unlist(stringr::str_split(pdftools::pdf_text(CapitalOne_files), "\n"))
  start <- stringr::str_which(lines, "Item 1A.\\s+Risk Factors")
  end <- stringr::str_which(lines, "Item 1B.\\s+Unresolved Staff Comments")
  
  # cover a few different outcomes depending on what was found
  if (length(start) == 1 & length(end) == 1) {
    relevant <- lines[start:end]
  } else if (length(start) == 0 | length(end) == 0) {
    relevant <- "Pattern not found"
  } else {
    relevant <- "Problems found"
  }
  
  return(relevant)
})

names(CapitalOne_l) <- basename(CapitalOne_files)
sapply(CapitalOne_l, head)

But unfortunately, the output is "problems found".

One of the needed files can be found here: https://investor.capitalone.com/static-files/d6af3768-9a98-4b25-8a2e-21922eec370e.

I think my start string finds multiple hints. Do you know how I can specify which of those hints is the correct one?

Can maybe someone help me with that?

Hi @marcia,

I think the problem is that Item 1A. Risk Factors and Item 1B. Unresolved Staff Comments appear several times within each document, at least in the table of contents and in the body of the document, and sometimes within a sentence.

We can assume that what makes these expressions a title of a section (as opposite to being used within a sentence) is that they are the only thing found in one line of text, meaning that the line will start and end with this expression, with regex you use the caret and dollar signs:

  start <- str_which(lines, "^Item 1A.\\s+Risk Factors$")
  end <- stringr::str_which(lines, "^Item 1B.\\s+Unresolved Staff Comments$")
1 Like

@marcia , I think the issue may be split conducted once the document is read in. Another issue maybe the regex used to attract the section, especially since the section has multiple paragraphs.

Below, is my attempt at trying to assist.

CapitalOne_files <- as.list(
  c("https://investor.capitalone.com/static-files/881eed87-5cc8-4950-bc33-7324f61b6dfe",
                      "https://investor.capitalone.com/static-files/03bd6db7-5ddc-4711-a307-efc7fe1b9748",
                      "https://investor.capitalone.com/static-files/d6af3768-9a98-4b25-8a2e-21922eec370e",
                      "https://investor.capitalone.com/static-files/1fccb8d3-10db-48e5-abea-f44179724b49"))

extract_section <- function(file_path){
  a <- pdftools::pdf_text(file_path)
  b <- data.frame(item = stringr::str_extract(a, "Item 1A\\. Risk Factors(\\n{1,}.{1,}){1,}"))|>
    na.omit()
  b$item <- stringr::str_replace(b$item,pattern="Item 1B\\. Unresolved Staff Comments+(\\n{1,}.{1,}){1,}",replacement="")
  b <- tidytext::unnest_lines(b,input = item,output = perline)
  return(b)
}

Matched <- lapply(CapitalOne_files,extract_section)

Essentially, the workflow is identical to yours with a few exceptions. The files are read in, filtered for rows that follow Item 1A. Thereafter, we replace all values following Item 1B. Finally, the data frame is unnested perline using the tidytext package. Let us know if it works.