How to clean local txt files in R?

I'am trying to clean 70GB of 8-K filings local data which I have downloaded with the help of the edgar package in R. The next step is to clean all these files (clean HTML tags etc.) to just have the filing text inside the text file. I wrote a for loop which is going through all my folders and subfolders, but I have problems with the gsub() function. I would like to take all HTML tags out and characters like =?./,^() etc. How can I take these characters out only inside the HTML tag <> and NOT from the filing text? I would be very very happy if somebody can help me.

PS: Is it possible to overwrite the cleaned text into the file? With my code I just have it in RStudio as Value but I would like to have the cleaned text overwritten in the modified text file.

for (i in 1:nrow(Data8K)) {

  dest.filename <- paste0("Edgar filings_full text/Form 8-K/", Data8K$cik[i], "/", Data8K$cik[i], "_8-K_", Data8K$date.filed[i], "_", Data8K$accession.number[i], ".txt")

  # Read filing 
  filing.text <- readLines(dest.filename)

  # Extract data from first <DOCUMENT> to </DOCUMENT>
  filing.text <- filing.text[(grep("<DOCUMENT>", filing.text, ignore.case = TRUE)[1]):(grep("</DOCUMENT>", filing.text, ignore.case = TRUE)[1])]

  # Preprocessing the filing text
  filing.text <- gsub("\\n|\\t|,", " ", filing.text)
  filing.text <- paste(filing.text, collapse=" ")
  filing.text <- gsub("'s ", "", filing.text)
  filing.text <- gsub("[[:punct:]]", "", filing.text, perl=T)
  filing.text <- gsub("[[:digit:]]", "", filing.text, perl=T)
  filing.text <- iconv(filing.text, from = 'UTF-8', to = 'ASCII//TRANSLIT')
  filing.text <- tolower(filing.text)
  filing.text <- gsub("\\s{2,}", " ", filing.text)  
}

There's a package for that!

install.packages("rvest")
library(rvest)
#> Loading required package: xml2
y <- read_html("https://www.sec.gov/Archives/edgar/data/40545/000120677419001160/0001206774-19-001160.txt")
y %>% html_nodes("p") %>% html_text()
#>  [1] "UNITED STATESSECURITIES AND EXCHANGE COMMISSION"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
#>  [2] "Washington, D.C. 20549"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
#>  [3] "FORM 8-K"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
#>  [4] "CURRENT REPORTPursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
#>  [5] "Date of Report (Date of earliest event reported)  April 1, 2019"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
#>  [6] "Registrant’s telephone number, including area code (617) 443-3000"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
#>  [7] "Check the appropriate box below if the Form 8-K filing is intended to simultaneously satisfy the filing obligation of the registrant under any of the following provisions:"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
#>  [8] "Indicate by check mark whether the registrant is an emerging growth company as defined in Rule 405 of the Securities Act of 1933 (§230.405 of this chapter) or Rule 12b-2 of the\nSecurities Exchange Act of 1934 (§240.12b-2 of this chapter)."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
#>  [9] "Emerging growth company ☐ "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
#> [10] "If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying with any new or revised financial accounting\nstandards provided pursuant to Section 13(a) of the Exchange Act. ☐ "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
#> [11] "Item 5.02. Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers; Compensatory Arrangements of Certain Officers."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
#> [12] "On April 1, 2019, General Electric Company (“GE” or the “Company”) announced annual equity awards for the Company’s executives and the framework for the Company’s 2019 annual cash bonus plan, in each case as approved by the Management Development & Compensation Committee (“MDCC”) of the GE Board."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
#> [13] "Equity Awards. All of the 2019 equity awards approved by the MDCC for GE’s Chairman and CEO, H. Lawrence Culp, Jr., are in the form of performance stock units (“PSUs”), with a grant date fair value of approximately $15 million, in accordance with his employment agreement. The MDCC also approved an equity award with a grant date fair value of approximately $15 million for David Joyce, Vice Chairman, GE and President and CEO, GE Aviation, anticipating that Mr. Joyce is not expected to receive other equity awards in future years prior to his retirement. The performance conditions for Mr. Joyce’s PSU award will be based upon operating metrics for the Aviation business and the PSU award will vest in two tranches of 50% each on December 31, 2020 and December 31, 2021."
#> [14] "For the other named executive officers in GE’s Definitive Proxy Statement for the 2019 Annual Meeting of Shareowners, approximately 50% by value of their 2019 equity awards were delivered in the form of PSUs, approximately 30% by value were delivered in the form of options (with an exercise price of $10.19) and approximately 20% by value were delivered as restricted stock units (“RSUs”)."                                                                                                                                                                                                                                                                                                                                                                                              
#> [15] "The stock options and RSUs become exercisable and vest, respectively, in two equal tranches on the second and third anniversary of the grant date. The PSU awards granted in March 2019 (with the exception of those for Mr. Joyce), have an approximately three-year performance period and will settle in equity based upon a single performance metric: GE total shareholder return (“TSR”) versus the S&P 500 from the grant date through December 31, 2021. PSUs will be earned to the extent that the performance condition is satisfied as follows, with proportional adjustment for performance between threshold, target and maximum:"                                                                                                                                                      
#> [16] "The MDCC determined that a single TSR-based metric for the PSUs continued to be appropriate due to the difficulty in forecasting Company performance over a three-year period during the ongoing portfolio restructuring. TSR performance will take into account any change in GE’s capital structure. Additionally, any shares issued under the PSUs (other than those issued to Mr. Joyce) will have a mandatory one-year hold period, regardless of whether the executive has satisfied the Company’s stock ownership requirement."                                                                                                                                                                                                                                                               
#> [17] "Annual Bonus Plan. Similar to 2018, the Company has determined that the annual cash bonus program for 2019 will focus on two metrics – an earnings metric and a free cash flow metric. Individual targets will be set at the business level, while achievement of the performance objectives for Corporate executives will be measured against Company-wide results."                                                                                                                                                                                                                                                                                                                                                                                                                                
#> [18] "(2)"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
#> [19] "\nSIGNATURES"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
#> [20] "\nPursuant to the requirements of the Securities Exchange Act of 1934, the registrant has duly caused this report to be signed on its behalf by the undersigned hereunto duly authorized."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
#> [21] "(3)"

Created on 2019-04-09 by the reprex package (v0.2.1)

It works the same with the html version.

For further cleansing there's the stringr package.

2 Likes

Dear technocrat, many thanks for your quick response. I am working with local (downloaded) txt files. Is it possible to run the local file inside the read_html() function. I am trying but didnt find a solution yet. Hope you have an idea.

Try something like:

# Line 1 just gives dummy data
download.file("www.jumpingrivers.com", destfile="/tmp/tmp.html")
# Open a file connection
f = file("/tmp/tmp.html")
read_html(f)

Make sure you have stringi installed

    # A file on your local system
    path <- system.file("some.txt",  package = "rvest")
    x <- read_html(path)
    clean.txt <- x %>% html_nodes("p") %>% html_text()

When you've finished processing you can write back clean.txt

write(clean.txt, file = "GE1Q19.txt",

I tried your idea which I guess could be the solution but if I read the local file with system.file, path takes this string only: "" so its empty.

path <- system.file("1800/1800_8-K_2008-01-23_0001104659-08-003972.txt", package = "rvest")
x <- read_html(path)
clean.txt <- x %>% html_nodes("p") %>% html_text()

If I run x I am getting this error:

> x <- read_html(path)
Error: '' does not exist in current working directory ('D:/Stock_Price_Prediction').

Is this the actual path of your file?
D:/Stock_Price_Prediction/1800/1800_8-K_2008-01-23_0001104659-08-003972.txt

Have in mind that relative paths start at the current working directory, also if you share a sample of filing.text (on a copy /paste friendly format) we could try to give you a regex based alternative solution.

1 Like

The problem was with the system.file function which should be used to find the path of a package and not of files. So check my first post on top thats my actual code. Here is one txt file of my 70,000 files for testing: https://ufile.io/ki4pb

My actual code I am working with is:

library(quantmod)
library(rvest)
library(tidyverse)
library(edgar)
library(readtext)
library(dplyr)
library(qdap)
library(rvest)
library(stringi)
library(RCurl)
library(XML)

for (i in 1:nrow(d1800)) {
  
  dest.filename <- paste0("1800/1800_8-K_", d1800$date.filed[i], "_", d1800$accession.number[i], ".txt")
  
  # Read filing 
  filing.text <- readLines(dest.filename)

  # Extract data from first <DOCUMENT> to </DOCUMENT>
  filing.text <- filing.text[(grep("<DOCUMENT>", filing.text, ignore.case = TRUE)[1]):(grep("</DOCUMENT>", filing.text, ignore.case = TRUE)[1])]
  
  filing.text <- gsub("\\n|\\t|,", " ", filing.text)
  filing.text <- paste(filing.text, collapse=" ")
  filing.text <- gsub("'s ", "", filing.text)
  filing.text <- gsub("[[:punct:]]", "", filing.text, perl=T)
  filing.text <- gsub("[[:digit:]]", "", filing.text, perl=T)
  filing.text <- iconv(filing.text, from = 'UTF-8', to = 'ASCII//TRANSLIT')
  filing.text <- tolower(filing.text)
  filing.text <- gsub("\\s{2,}", " ", filing.text)
  
  # To write cleaned data into the txt file (overwrite file)
  # writeLines(filing.text, dest.filename)
  
  browser()
}

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.