I am doing a LSTM-analysis for tweets and facing the following issue:
I want to replace the words in a data frame with the numeric value of the word-frequency of every word.
The first five rows of my data frame look like this:
text
1 Apropos baerbockfails Von CDU und CSU ist die sogenannte "bürgerliche Mitte" Betrug und Trickserei gewöhnt.
2 CDU: Die Laschet-Union tut nichts, will nichts – und trifft damit den Nerv einer genervten Bevölkerung.
3 Ich habe heute Bilanz des Scheiterns der Klimapolitik von #Merkel gezogen
4 Die_Waffenlobby saki_statement dieLinke Die_Gruenen Warum verschwendet unsere Waffenlobby ihre Zeit
5 BMieterverein: MieterInnentag21: Regierungsprogramm der #CDU soll am 21.6.2021 vorgestellt werden. Wir sind gespannt!
I used the following code:
#LSTM
#wordcount
prof.tm<-unnest_tokens(twitter, word, text)
word.freq<-prof.tm %>% count(word, sort = TRUE)
word.freq<-cbind(word.freq,"nr"=1:18420)
word.freq2<-word.freq %>%
select(nr, word) %>%
install.packages("dplyr")
library(dplyr)
tweet <- twitter$text
tweettxt <- data.frame(
stringsAsFactors = F,
tweetwords = (strsplit(tweet," ")[[1]])
)
#combine the two tables: column n will contain the frequencies, nr the ranks
tweetnum <- tweettxt %>%
left_join(word.freq,by=c('tweetwords'='word')) %>%
mutate (n = ifelse(is.na(n),0,n),
nr = ifelse(is.na(nr),Inf,nr))
tweetchar = paste("[",tweetnum$nr,"]",sep='',collapse = ' ')
Do you know how I can use this code for every row and not only for the first in the dataset?
And how do I replace the words with the values in the data frame?
I hope I could clarify my point and looking forward for every help!
I did it manually by searching in the wordfreq data the place of the word frequency and replaced the word with the number.
For this example the [1] is "die" and is the most frequent word in the dataset.
I am looking for a code which does it automatically for the whole dataset