Package Edgar: Scraping Exhibits of 10-k / 8-k filings

The "Package Edgar" excludes the Exhibits of the SEC filings. Is there a way or additional package (edgarWebR does not work either) to scrape the Exhibits for the filings via R?

Thanks a lot for your help!

Yes, but it requires some construction of queries. Take as an example Sabine Oil, CIK # 38079 as filed here. BTW: that was an example in the {edgar} docs, but returned errors for both 2005 and 2015.

The url is https://www.sec.gov/Archives/edgar/data/0000038079/000119312515016321/0001193125-15-016321-index.htm and you can see that the filing consists of a sequence of 7 documents, the Amended Form 8-K, itself, and six exhibits contained in separate htm files.

However, there is also a link, https://www.sec.gov/Archives/edgar/data/38079/000119312515016321/0001193125-15-016321.txt that contains the complete filing. That's the good news.

The bad news is that most of the file consists of HTML markup that has to be parsed. This is due to historical reasons. In the early 1990s all filings were in ASCII plain text, limited to 80 characters per line. The following decade the door was opened to HTML and the per line limit dropped, since it had no effect on the display. Eventually, supplemental filings could also be made in pdf. These changes from the initial format were all in the name of on-screen and print-out readability at the expense of automated text processing.

Some of the ugliest filings derived from software to convert Microsoft Word to HTML—these produced mark-up bloat 10 or more times greater in byte count than actual content.

What to do depends on whether the use case is interactive or intended to be automated for bulk processing. I'll set aside de-HTMLification and focus on the retrieval aspect with this code outline:

forepart <- "https://www.sec.gov/Archives/edgar/data/"
CIK <- "38079/"
accession <- "000119312515016321/"
directory <- glue::glue(substr(accession,1,9),substr(accession,11,12),substr(accession,13,18),.sep="-")
ext <- ".txt"
glue::glue(forepart,CIK,accession,directory,ext)
#> https://www.sec.gov/Archives/edgar/data/38079/000119312515016321/000119312-15-016321.txt

This leaves "only" the accession number to be extracted from some other search. There's no reliable way of anticipating what it will be—searching is necessary for each filing.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.