Parsing parts of 10-K reports with rm_between regex

ricdob · September 26, 2022, 1:23pm

Hey there!
I´m trying to extract a specific textpart from an 10-K report using rm_between but can´t get the right pattern. Problem is that the title of the part is mentioned inside of other parts in the text so rm_between extracts wrong data. The edgar Package does have a command for it but I´d like to use rm_between.
Example:

Item 7. Management s Discussion and Analysis of Financial Condition and Results of Operations

Text I want to extract

Item 7A. Quantitative and Qualitative Disclosures About Market Risk

Different Text referring to Item 7. Management´s

nirgrahamuk · September 26, 2022, 5:30pm

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Short Version

You can share your data in a forum friendly way by passing the data to share to the dput() function.
If your data is too large you can use standard methods to reduce it before sending to dput().
When you come to share the dput() text that represents your data, please be sure to format your post with triple backticks on the line before your code begins to format it appropriately.

```
( example_df <- structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 
5, 4.4, 4.9), Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9, 3.4, 
3.4, 2.9, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 
1.4, 1.5, 1.4, 1.5), Petal.Width = c(0.2, 0.2, 0.2, 0.2, 0.2, 
0.4, 0.3, 0.2, 0.2, 0.1), Species = structure(c(1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("setosa", "versicolor", "virginica"
), class = "factor")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame")))
```

ricdob · September 27, 2022, 8:47am

Sorry for the inconvenience, I thought it was obvious from the example. Via the link it is possible to download a report completely, to show an excerpt would be confusing. I'll try to extract specifically the Management Discussion. The problem is that within the texts there are often references to this section and rm_between extracts multiple text passages with the help of the borders "Item 7. and Item 8. Attached is my code so far and the pattern used.
https://seafile.zfn.uni-bremen.de/d/4c589adfd818423a930f/

library(dplyr)
library(rvest)
library(tidyverse)

setwd("C:/Users/richard dobler/OneDrive/Desktop/QTR1/neu")


files <- list.files(path = ".", recursive = TRUE,pattern = "\\.txt$", full.names = TRUE)

#create d.f.
df1 <- data.frame(document=files, 
                  accession.number=str_extract(files, pattern = "[^_]+(?=\\..+$)"),
                  text = sapply(files, FUN = function(x)readChar(x, file.info(x)$size)),
                  stringsAsFactors=FALSE)


#First cleaning (all to lower cases)
df1$text <- tolower(df1$text)
df1$text <- gsub("\r?\n|\r", " ", df1$text)

# Extraction of item 7
df1$item_7_sub <- rm_between(df1$text, 
                             "item\\s7\\.\\s",
                             "item\\s8\\.\\s",
                                   fixed = FALSE,trim = TRUE,clean = TRUE,extract = TRUE,
                                   include.markers = FALSE,merge = TRUE)

nirgrahamuk · September 27, 2022, 9:40am

Even using your initial example as you presented it I'm not seeing conceptually how you can split what you say you wish to split, apart from perhaps if the structure is interpretable via the linebreaks a la

library(stringr)

somexampletext <- "Item 7. Management s Discussion and Analysis of Financial Condition and Results of Operations

Text I want to extract

Item 7A. Quantitative and Qualitative Disclosures About Market Risk"

str_split_fixed(somexampletext,"\n",str_count(somexampletext,"\n"))[3]

I glanced at one of your .txt files, and it seems entirely unstructured though.

system · November 8, 2022, 9:41am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.