Parsing parts of 10-K reports with rm_between regex

Hey there!
I´m trying to extract a specific textpart from an 10-K report using rm_between but can´t get the right pattern. Problem is that the title of the part is mentioned inside of other parts in the text so rm_between extracts wrong data. The edgar Package does have a command for it but I´d like to use rm_between.
Example:

Item 7. Management s Discussion and Analysis of Financial Condition and Results of Operations

Text I want to extract

Item 7A. Quantitative and Qualitative Disclosures About Market Risk

Different Text referring to Item 7. Management´s

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:


Short Version

You can share your data in a forum friendly way by passing the data to share to the dput() function.
If your data is too large you can use standard methods to reduce it before sending to dput().
When you come to share the dput() text that represents your data, please be sure to format your post with triple backticks on the line before your code begins to format it appropriately.

```
( example_df <- structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 
5, 4.4, 4.9), Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9, 3.4, 
3.4, 2.9, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 
1.4, 1.5, 1.4, 1.5), Petal.Width = c(0.2, 0.2, 0.2, 0.2, 0.2, 
0.4, 0.3, 0.2, 0.2, 0.1), Species = structure(c(1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("setosa", "versicolor", "virginica"
), class = "factor")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame")))
```

Sorry for the inconvenience, I thought it was obvious from the example. Via the link it is possible to download a report completely, to show an excerpt would be confusing. I'll try to extract specifically the Management Discussion. The problem is that within the texts there are often references to this section and rm_between extracts multiple text passages with the help of the borders "Item 7. and Item 8. Attached is my code so far and the pattern used.

library(dplyr)
library(rvest)
library(tidyverse)

setwd("C:/Users/richard dobler/OneDrive/Desktop/QTR1/neu")


files <- list.files(path = ".", recursive = TRUE,pattern = "\\.txt$", full.names = TRUE)

#create d.f.
df1 <- data.frame(document=files, 
                  accession.number=str_extract(files, pattern = "[^_]+(?=\\..+$)"),
                  text = sapply(files, FUN = function(x)readChar(x, file.info(x)$size)),
                  stringsAsFactors=FALSE)


#First cleaning (all to lower cases)
df1$text <- tolower(df1$text)
df1$text <- gsub("\r?\n|\r", " ", df1$text)

# Extraction of item 7
df1$item_7_sub <- rm_between(df1$text, 
                             "item\\s7\\.\\s",
                             "item\\s8\\.\\s",
                                   fixed = FALSE,trim = TRUE,clean = TRUE,extract = TRUE,
                                   include.markers = FALSE,merge = TRUE)

Even using your initial example as you presented it I'm not seeing conceptually how you can split what you say you wish to split, apart from perhaps if the structure is interpretable via the linebreaks a la

library(stringr)

somexampletext <- "Item 7. Management s Discussion and Analysis of Financial Condition and Results of Operations

Text I want to extract

Item 7A. Quantitative and Qualitative Disclosures About Market Risk"

str_split_fixed(somexampletext,"\n",str_count(somexampletext,"\n"))[3]

I glanced at one of your .txt files, and it seems entirely unstructured though.

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.