Can I perform remove stopwords with one exception & removing duplicate words within one document

DLU · January 25, 2020, 3:39pm

Dear all,

I used the remove stopwords (dutch) function within tm but I saw that I really don't want one of the words that was removed by that function to be removed. Is there any exception I can put into this function or can I add that word again with some other function?

I also have another question. I have a textfile with 1015 lines filled with explanations. Within each line some words are used more then once and that is influencing my outcome of the function findassoc. Is there any way I can remove any duplicate words within the same line / document?

I hope someone can help me with this. Many thanks in advance.

With kind regards,

Diana

technocrat · January 25, 2020, 7:31pm

Stopword editing is easy the Python nltk package, but I haven't figured out a way to do it in R yet.

This recipe from https://stackoverflow.com/questions/20283624/removing-duplicate-words-in-a-string-in-r by @andrekos should work on the duplicated words (on a per line basis). This one removes duplicate een

text <- "Het verzamelen puur voor het genoegen was een nieuw fenomeen dat in de Republiek een hoge vlucht nam"
d <- unlist(strsplit(text, split=" "))
paste(d[-which(duplicated(d))], collapse = ' ')
#> [1] "Het verzamelen puur voor het genoegen was een nieuw fenomeen dat in de Republiek hoge vlucht nam"

^{Created on 2020-01-25 by the reprex package (v0.3.0)}

system · February 15, 2020, 7:31pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.