SEC EGDAR 10-K files download

Dear all,

i am writing my master thesis and try to download the 10-K filings from the EDGAR data base for specific companies in html format. I tried to use the getfilingsHTML function. It should read the downloaded filings, scraps filing text excluding exhibits, and saves the filing contents in 'Edgar filings_HTML view' directory in HTML format, but it creates a master index directory instead which is not in HTML format.

An example code is below:

output <- getFilingsHTML(cik.no = c(1000180, 38079), c('10-K'),2006, useragent = "MyName@gmail.com")

This is the output:
Master Indexes' existiert bereitsDownloading Master Indexes from SEC server for 2016 ...

It is creating a new folder in my working directory: Master Indexes which is not in HTML format.
Can somebody help me with that?

Thank you in advance!
Jenny

Not sure which EDGAR package you're using, but let's see if we can get you on the right path.

Sandisk Corp.'s CIK is a string, not an integer, because it's zero left padded—0001000180, not 1000180. There are search contexts in which the difference can be important.

In 2006, and perhaps other years, there was a 10-KA filed to amend the previous version. It turns out to be a minor tick-the-box error, but it could have been important, which means that some check is needed. Depending on how many, it's probably best to do this manually.

The bulk of the 10-K filing is already in HTML. There can be a large number of exhibits, some or none of which may be of interest. (See end of this message). A complete version displays as text, but after 55 lines of non-HTML the balance is in HTML, so that's the one that should be targeted.

All 10-K filings in txt form for this issuer will be in the form

https://www.sec.gov/Archives/edgar/data/1000180/000089161806000116/0000891618-06-000116.txt

Notice that 000089161806000116 and the filename without extension only differ by the two - hyphens , 06 signifies 2006 and 0000891618 is an "accession number." See SEC explanation. Unfortunately, there doesn't appear to be an easy way of fetching the accession number directly.

That's why I usually start with a query to generate:

https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001000180&type=10-K&dateb=20060401&owner=include&count=10&search_text=

and then parsing the HTML table for the top entry (in this case a 10-KA) and getting its link, then parse again to fetch the txt version and script to strip down to tag.

Notice the dateb=20060401 part; for a calendar year reporting company, the 10-K is always due in March. But, of course, not all companies have the calendar year as their fiscal year, which is the measuring period.


INDEX TO EXHIBITS

Exhibit
Number
	
Exhibit Title
3.1
	
Restated Certificate of Incorporation of the Registrant.(2)
3.2
	
Certificate of Amendment of the Restated Certificate of Incorporation of the Registrant dated December 9, 1999.(4)
3.3
	
Certificate of Amendment of the Restated Certificate of Incorporation of the Registrant dated May 11, 2000.(6)
3.4
	
Certificate of Amendment to the Amended Restated Certificate of Incorporation of the Registrant dated May 26, 2006.(24)
3.5
	
Amended and Restated Bylaws of the Registrant dated July 25, 2007.(19)
3.6
	
Certificate of Designations for the Series A Junior Participating Preferred Stock, as filed with the Delaware Secretary of State on April 24, 1997.(3)
3.7
	
Amendment to Certificate of Designations for the Series A Junior Participating Preferred Stock, as filed with the Delaware Secretary of State on September 24, 2003.(11)
4.1
	
Reference is made to Exhibits 3.1, 3.2, 3.3, and 3.4.
4.2
	
Rights Agreement, dated as of September 15, 2003, between the Registrant and Computershare Trust Company, Inc.(11)
4.3
	
Amendment No. 1 to Rights Agreement by and between the Registrant and Computershare Trust Company, Inc., dated as of November 6, 2006.(27)
4.4
	
SanDisk Corporation Form of Indenture (including notes).(20)
4.5
	
Indenture (including form of Notes) with respect to the Registrant’s 1.00% Convertible Senior Notes due 2013 dated as of May 15, 2006 by and between the Registrant and The Bank of New York.(21)
10.1
	
Form of Indemnification Agreement entered into between the Registrant and its directors and officers.(2)
10.2
	
License Agreement between the Registrant and Dr. Eli Harari, dated September 6, 1988.(2)
10.3
	
SanDisk Corporation 1995 Stock Option Plan, as Amended and Restated January 2, 2002.(9), (*)
10.4
	
SanDisk Corporation 1995 Non-Employee Directors Stock Option Plan, as Amended and Restated as of January 2, 2004.(10), (*)
10.5
	
Registration Rights Agreement, dated as of January 18, 2001, by and between the Registrant, The Israel Corporation, Alliance Semiconductor Ltd., Macronix International Co., Ltd. and Quick Logic Corporation.(5)
10.6
	
Consolidated Shareholders Agreement, dated as of January 18, 2001, by and among the Registrant, The Israel Corporation, Alliance Semiconductor Ltd. and Macronix International Co., Ltd.(5)
10.7
	
Agreement, dated as of September 28, 2006, by and among the Registrant, Bank Leumi Le Israel B.M., a banking corporation organized under the laws of the State of Israel, The Israel Corporation Ltd., Alliance Semiconductor Corporation and Macronix International Co. Ltd.(26)
10.8
	
Agreement, dated as of September 28, 2006, by and among the Registrant, Bank Hapoalim B.M., a banking corporation organized under the laws of the State of Israel, The Israel Corporation Ltd., Alliance Semiconductor Corporation and Macronix International Co. Ltd.(26)
10.9
	
Amendment No. 3 to Payment Schedule of Series A-5 Additional Purchase Obligations, Waiver of Series A-5 Conditions, Conversion of Series A-4 Wafer Credits and Other Provisions, dated as of November 11, 2003, by and between the Registrant, Tower Semiconductor Ltd. and the other parties thereto.(12)
10.10
	
New Master Agreement, dated as of April 10, 2002, by and between the Registrant and Toshiba Corporation.(7), (1)
10.11
	
Amendment to New Master Agreement, dated and effective as of August 13, 2002 by and between the Registrant and Toshiba Corporation.(8), (1)
10.12
	
New Operating Agreement, dated as of April 10, 2002, by and between the Registrant and Toshiba Corporation.(7), (1)
10.13
	
Indemnification and Reimbursement Agreement, dated as of April 10, 2002, by and between the Registrant and Toshiba Corporation.(7), (1)
10.14
	
Amendment to Indemnification and Reimbursement Agreement, dated as of May 29, 2002 by and between the Registrant and Toshiba Corporation.(7)
10.15
	
Amendment No. 2 to Indemnification and Reimbursement Agreement, dated as of May 29, 2002 by and between the Registrant and Toshiba Corporation.(25)
10.16
	
Form of Amended and Restated Change of Control Benefits Agreement entered into by and between the Registrant and its named executive officers.(13), (*)
10.17
	
Form of Option Agreement Amendment (13), (*)
10.18
	
Flash Partners Master Agreement, dated as of September 10, 2004, by and among the Registrant and the other parties thereto.(14), (1)
10.19
	
Flash Alliance Master Agreement, dated as of July 7, 2006, by and among the Registrant, Toshiba Corporation and SanDisk (Ireland) Limited.(23), (+)
10.20
	
Operating Agreement of Flash Partners Ltd., dated as of September 10, 2004, by and between SanDisk International Limited and Toshiba Corporation.(14), (1)
10.21
	
Operating Agreement of Flash Alliance, Ltd., dated as of July 7, 2006, by and between Toshiba Corporation and SanDisk (Ireland) Limited.(23), (+)
10.22
	
Mutual Contribution and Environmental Indemnification Agreement, dated as of September 10, 2004, by and among the Registrant and the other parties thereto.(14), (1)
10.23
	
Flash Alliance Mutual Contribution and Environmental Indemnification Agreement, dated as of July 7, 2006, by and between Toshiba Corporation and SanDisk (Ireland) Limited.(23), (+)
10.24
	
Patent Indemnification Agreement, dated as of September 10, 2004 by and among the Registrant and the other parties thereto.(14), (1)
10.25
	
Patent Indemnification Agreement, dated as of July 7, 2006, by and among the Registrant and the other parties thereto.(23), (+)
10.26
	
Master Lease Agreement, dated as of December 24, 2004, by and among Mitsui Leasing & Development, Ltd., IBJ Leasing Co., Ltd., and Sumisho Lease Co., Ltd. and Flash Partners Ltd.(15), (1)
10.27
	
Master Lease Agreement, dated as of September 22, 2006, by and among Flash Partners Limited Company, SMBC Leasing Company, Limited, Toshiba Finance Corporation, Sumisho Lease Co., Ltd., Fuyo General Lease Co., Ltd., Tokyo Leasing Co., Ltd., STB Leasing Co., Ltd. and IBJ Leasing Co., Ltd.(23), (+)
10.28
	
Guarantee Agreement, dated as of December 24, 2004, by and between the Registrant and Mitsui Leasing & Development, Ltd.(15)
10.29
	
Guarantee Agreement, dated as of September 22, 2006, by and among the Registrant, SMBC Leasing Company, Limited and Toshiba Finance Corporation.(23)
10.30
	
Amended and Restated SanDisk Corporation 2005 Incentive Plan.(25), (*)
10.31
	
SanDisk Corporation Form of Notice of Grant of Stock Option.(16), (*)
10.32
	
SanDisk Corporation Form of Notice of Grant of Non-Employee Director Automatic Stock Option (Initial Grant).(16), (*)
10.33
	
SanDisk Corporation Form of Notice of Grant of Non-Employee Director Automatic Stock Option (Annual Grant).(16), (*)
10.34
	
SanDisk Corporation Form of Stock Option Agreement.(16), (*)
10.35
	
SanDisk Corporation Form of Automatic Stock Option Agreement.(16), (*)
10.36
	
SanDisk Corporation Form of Restricted Stock Unit Issuance Agreement.(17), (*)
10.37
	
SanDisk Corporation Form of Restricted Stock Unit Issuance Agreement (Director Grant).(16), (*)
10.38
	
SanDisk Corporation Form of Restricted Stock Award Agreement.(16), (*)
10.39
	
SanDisk Corporation Form of Restricted Stock Award Agreement (Director Grant).(16), (*)
10.40
	
SanDisk Corporation Form of Performance Stock Unit Issuance Agreement.(**), (*)
10.41
	
Guarantee Agreement between the Registrant, IBJ Leasing Co., Ltd., Sumisho Lease Co., Ltd., and Toshiba Finance Corporation.(18)
10.42
	
Guarantee Agreement, dated as of June 20, 2006, by and between the Registrant, IBJ Leasing Co., Ltd., Sumisho Lease Co., Ltd. and Toshiba Finance Corporation.(25)
10.43
	
Basic Lease Contract between Flash Partners Yugen Kaisha, IBJ Leasing Co., Ltd., Sumisho Lease Co., Ltd., and Toshiba Finance Corporation.(18), (+)
10.44
	
Basic Lease Contract, dated as of June 20, 2006, by and between Flash Partners Yugen Kaisha, IBJ Leasing Co., Ltd., Sumisho Lease Co., Ltd. and Toshiba Finance Corporation.(25), (+)
10.45
	
Sublease (Building 3), dated as of December 21, 2005 by and between Maxtor Corporation and the Registrant.(25)
10.46
	
Sublease (Building 4), dated as of December 21, 2005 by and between Maxtor Corporation and the Registrant.(25)
10.47
	
Sublease (Building 5), dated as of December 21, 2005 by and between Maxtor Corporation and the Registrant.(28)
10.48
	
Sublease (Building 6), dated as of December 21, 2005 by and between Maxtor Corporation and the Registrant.(25)
10.49
	
Confidential Separation Agreement and General Release of Claims.(17)
10.50
	
3D Collaboration Agreement.(22), (1)
12.1
	
Computation of ratio of earnings to fixed charges. (**)
21.1
	
Subsidiaries of the Registrant(**)
23.1
	
Consent of Independent Registered Public Accounting Firm(**)
31.1
	
Certification of Chief Executive Officer Pursuant to Section 302 of the Sarbanes-Oxley Act of 2002(**)
31.2
	
Certification of Chief Financial Officer Pursuant to Section 302 of the Sarbanes-Oxley Act of 2002(**)
32.1
	
Certification of Chief Executive Officer Pursuant to 18 U.S.C. Section 1350, as adopted pursuant to Section 906 of the Sarbanes-Oxley Act of 2002(**)
32.2
	
Certification of Chief Financial Officer Pursuant to 18 U.S.C. Section 1350, as adopted pursuant to Section 906 of the Sarbanes-Oxley Act of 2002(**)


*
	
Indicates management contract or compensatory plan or arrangement.
**
	
Filed herewith.
***
	
Furnished herewith.

+
	
Confidential treatment has been requested with respect to certain portions hereof.
1.  
	
Confidential treatment granted as to certain portions of these exhibits.
2.  
	
Previously filed as an Exhibit to the Registrant’s Registration Statement on Form S-1 (No. 33-96298).
3.  
	
Previously filed as an Exhibit to the Registrant’s Current Report on Form 8-K/A dated April 18, 1997.
4.  
	
Previously filed as an Exhibit to the Registrant’s Form 10-Q for the quarter ended June 30, 2000.
5.  
	
Previously filed as an Exhibit to the Registrant’s Schedule 13(d) dated January 26, 2001.
6.  
	
Previously filed as an Exhibit to the Registrant’s Registration Statement on Form S-3 (No. 333-85686).
7.  
	
Previously filed as an Exhibit to the Registrant’s Form 10-Q for the quarter ended June 30, 2002.
8.  
	
Previously filed as an Exhibit to the Registrant’s Form 10-Q for the quarter ended September 30, 2002.
9.  
	
Previously filed as an Exhibit to the Registrant’s Registration Statement on Form S-8 (No. 333-85320).
10.  
	
Previously filed as an Exhibit to the Registrant’s Registration Statement on Form S-8 (No. 333-112139).
11.  
	
Previously filed as an Exhibit to the Registrant’s Registration Statement on Form 8-A dated September 25, 2003.
12.  
	
Previously filed as an Exhibit to the Registrant’s 2003 Annual Report on Form 10-K.
13.  
	
Previously filed as an Exhibit to the Registrant’s Form 8-K dated November 12, 2008.
14.  
	
Previously filed as an Exhibit to the Registrant’s Form 10-Q for the quarter ended September 26, 2004.
15.  
	
Previously filed as an Exhibit to the Registrant’s 2004 Annual Report on Form 10-K.
16.  
	
Previously filed as an Exhibit to the Registrant’s Current Report on Form 8-K dated June 3, 2005.
17.  
	
Previously filed as an Exhibit to the Registrant’s Form 10-Q for the quarter ended March 30, 2008.
18.  
	
Previously filed as an Exhibit to the Registrant’s 2005 Annual Report on Form 10-K.
19.  
	
Previously filed as an Exhibit to the Registrant’s Form 8-K dated July 27, 2007.
20.  
	
Previously filed as an Exhibit to the Registrant’s Form 8-K dated May 9, 2006.
21.  
	
Previously filed as an Exhibit to the Registrant’s Form 8-K dated May 15, 2006.
22.  
	
Previously filed as an Exhibit to the Registrant’s Form 8-K dated June 17, 2008.
23.  
	
Previously filed as an Exhibit to the Registrant’s Form 10-Q for the quarter ended October 1, 2006
24.  
	
Previously filed as an Exhibit to the Registrant’s Form 8-K dated June 1, 2006.
25.  
	
Previously filed as an Exhibit to the Registrant’s Form 10-Q for the quarter ended July 2, 2006.
26.  
	
Previously filed as an Exhibit to the Registrant’s Schedule 13(d)/A dated October 12, 2006.
27.  
	
Previously filed as an Exhibit to the Registrant’s Form 8-A/A dated November 8, 2006.
28.  
	
Previously filed as an Exhibit to the Registrant’s 2006 Annual Report on Form 10-K.

For the benefits of others working with EDGAR filings.

The SEC began blocking API calls that did not include an HTTP user-agent header, but did not provide anything in the way of documentation. It also limits calls to no more than 10/sec. None of the R packages that mine the EDGAR data appear to have been updated to work around these obstacles.

I was able to use {reticulate} to import a Python library that does the same thing as the R packages—download files of specified types and dates given a vector of CIK codes for the companies of interest.

It’s been some time since I’ve done much with EDGAR documents. Originally, they had to be plaintext with only simple SGML markup in the headers. All of today’s filings are in HTML. Some of those are easy to pull text from. A large portion, however, are 95% markup in various flavors, all of which is contained in a single string that can be of length 10MB or longer.

I’ve now processed a few hundred of these files using pandoc to convert to markdown and running shell scripts on the results to extract the paragraphs of interest. I use a combination of awk, grep, Perl, sed, tr. For most of this a line edit tool won’t work readily. Stream editing with buffering is needed. In theory, it could be done as a Perl one-liner, but that would be so ugly as to be unreadable and might as well be stored as a write-only file.

1 Like