Removing text from character with carriage returns

StatSteph · August 10, 2020, 7:34pm

I am trying to remove all parts of a string beginning with (and including) the word "Footnotes" in the following example but it's not working like I would expect and only deletes characters from that line and not the next one. Any ideas?

The end goal is to remove the footer then use read_csv to read it in without the footer.

library(stringr)

j <- ",,,\"$75,000 or more\",, 3310, ! , 5224, ! , \r\n,,,\"Unknown\",, 1936, ! , 5369, ! , \r\n\r\n\r\n\r\n\r\nFootnotes:\r\n*  Special tabulations from the NCVS Victimization Analysis Tool (NVAT).\r\n\"*  Detail may not sum to total due to rounding and/or missing data."
writeLines(j)
#> ,,,"$75,000 or more",, 3310, ! , 5224, ! , 
#> ,,,"Unknown",, 1936, ! , 5369, ! , 
#> 
#> 
#> 
#> 
#> Footnotes:
#> *  Special tabulations from the NCVS Victimization Analysis Tool (NVAT).
#> "*  Detail may not sum to total due to rounding and/or missing data.

# Goal is to remove everything starting with "Footnotes" from j
j2 <- str_replace_all(j, "Footnotes.*", "")
writeLines(j2)
#> ,,,"$75,000 or more",, 3310, ! , 5224, ! , 
#> ,,,"Unknown",, 1936, ! , 5369, ! , 
#> 
#> 
#> 
#> 
#> 
#> *  Special tabulations from the NCVS Victimization Analysis Tool (NVAT).
#> "*  Detail may not sum to total due to rounding and/or missing data.

^{Created on 2020-08-10 by the reprex package (v0.3.0)}

StatSteph · August 10, 2020, 8:35pm

I got some help elsewhere (thanks Cass on R-Ladies Slack) and here's a solution.

library(stringr)
j <- ",,,\"$75,000 or more\",, 3310, ! , 5224, ! , \r\n,,,\"Unknown\",, 1936, ! , 5369, ! , \r\n\r\n\r\n\r\n\r\nFootnotes:\r\n*  Special tabulations from the NCVS Victimization Analysis Tool (NVAT).\r\n\"*  Detail may not sum to total due to rounding and/or missing data."
j2 <- str_split(j, "Footnotes.*", simplify=TRUE)
j2_before_footnotes <- j2[1]
writeLines(j2_before_footnotes)
#> ,,,"$75,000 or more",, 3310, ! , 5224, ! , 
#> ,,,"Unknown",, 1936, ! , 5369, ! , 
#> 
#> 
#> 
#>

^{Created on 2020-08-10 by the reprex package (v0.3.0)}

riva · August 10, 2020, 8:36pm

Hi, StatSteph!

. matches every character except a new line, so you need to do something like this:

j2 <- str_remove_all(j, "Footnotes(.|\r\n)*")

StatSteph · August 10, 2020, 8:37pm

Thanks @riva. These are both great solutions and that explains why it was ignoring the new line!

technocrat · August 10, 2020, 8:58pm

j is ugly. If there is a lot like it, I would tokenize the lot and snip the feet along the following lines

library(stringr)
suppressPackageStartupMessages({
  library(dplyr)
  library(stringr)}
  )

j <- ",,,\"$75,000 or more\",, 3310, ! , 5224, ! , \r\n,,,\"Unknown\",, 1936, ! , 5369, ! , \r\n\r\n\r\n\r\n\r\nFootnotes:\r\n*  Special tabulations from the NCVS Victimization Analysis Tool (NVAT).\r\n\"*  Detail may not sum to total due to rounding and/or missing data."

str_split(j," ") -> k

k[[1]][1:(which(str_detect(k[[1]],"Footnote")))-1] 
#>  [1] ",,,\"$75,000"         "or"                   "more\",,"            
#>  [4] "3310,"                "!"                    ","                   
#>  [7] "5224,"                "!"                    ","                   
#> [10] "\r\n,,,\"Unknown\",," "1936,"                "!"                   
#> [13] ","                    "5369,"                "!"                   
#> [16] ","

paste(str_split(j," ")[[1]][1:(which(str_detect(k[[1]],"Footnote")))-1], collapse = "")
#> [1] ",,,\"$75,000ormore\",,3310,!,5224,!,\r\n,,,\"Unknown\",,1936,!,5369,!,"

^{Created on 2020-08-10 by the reprex package (v0.3.0)}

StatSteph · August 10, 2020, 11:56pm

Thanks, this doesn't quite work. It will change something from "$75,000 or more" to "$75,000 or more" which isn't ideal. I think the other solutions are great and reality, my input is much longer than what I've shown but always ends with "Footnotes:" and then a few lines I want to discard.

technocrat · August 11, 2020, 12:07am

FWIW: that could be fixed with a more careful split. More broadly this is a class of parsing problem that benefits from a broad look. It may be that the source data has a consistent

\r\n\r\n\r\n\r\n\r\nFootnotes:\r\n*

but when the whitespace padding varies ...

system · August 18, 2020, 12:07am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.