Replace words in a data frame.

Hey community!

I am doing a LSTM-analysis for tweets and facing the following issue:

I want to replace the words in a data frame with the numeric value of the word-frequency of every word.

The first five rows of my data frame look like this:

    text
1 Apropos baerbockfails Von CDU und CSU ist die sogenannte "bürgerliche Mitte" Betrug und Trickserei  gewöhnt.
2    CDU: Die Laschet-Union tut nichts, will nichts – und trifft damit den Nerv einer genervten Bevölkerung. 
3   Ich habe heute Bilanz des Scheiterns der Klimapolitik von #Merkel gezogen 
4   Die_Waffenlobby saki_statement dieLinke Die_Gruenen Warum verschwendet unsere Waffenlobby ihre Zeit 
5  BMieterverein: MieterInnentag21: Regierungsprogramm der #CDU soll am 21.6.2021 vorgestellt werden. Wir sind gespannt! 

I used the following code:

#LSTM

#wordcount

prof.tm<-unnest_tokens(twitter, word, text)

word.freq<-prof.tm %>% count(word, sort = TRUE)

word.freq<-cbind(word.freq,"nr"=1:18420)

word.freq2<-word.freq %>%

select(nr, word) %>%

install.packages("dplyr")

library(dplyr)

tweet <- twitter$text

tweettxt <- data.frame(

stringsAsFactors = F,

tweetwords = (strsplit(tweet," ")[[1]])

)

#combine the two tables: column n will contain the frequencies, nr the ranks

tweetnum <- tweettxt %>%

left_join(word.freq,by=c('tweetwords'='word')) %>%

mutate (n = ifelse(is.na(n),0,n),

nr = ifelse(is.na(nr),Inf,nr))

tweetchar = paste("[",tweetnum$nr,"]",sep='',collapse = ' ')

Do you know how I can use this code for every row and not only for the first in the dataset?
And how do I replace the words with the values in the data frame?

I hope I could clarify my point and looking forward for every help!

Can you show an example of what the final data frame should look like?

Hey @cactusoxbird,

the final data frame should look like this:

text
1 [2890] [2829] [14] [8] [6] [48] [13] [1] [1282] [2460] [972] [1733] [6] [16698] [11959]
2 [8] [1] [162] [766] [300] [125] [300] [6] [1519] [90] [12] [5412] [173] [5106] [1480]
3 [23] [212] [77] [2013] [60] [15750] [3] [888] [14] [276] [11962]
4 [6424] [7789] [896] [1149] [131] [3936] [190] [8273] [62] [755]
5 [4104] [3398] [7723] [3] [8] [369] [69] [5886] [4677] [103] [34] [50] [2038]

I did it manually by searching in the wordfreq data the place of the word frequency and replaced the word with the number.
For this example the [1] is "die" and is the most frequent word in the dataset.

I am looking for a code which does it automatically for the whole dataset :wink:

You'll need to refine this to account for punctuation, spacing, etc. I think, but I believe this generally does what you're looking for:

library(tidyverse)
library(tidytext)

# Adapted from the Help page example for str_replace
# https://www.rdocumentation.org/packages/stringr/versions/1.4.0/topics/str_replace:
c("one apple", "two pears", "three bananas") %>%
  str_c(collapse = "---") %>%
  str_replace_all(c("one" = "1", "two" = "2", "three" = "3"))
#> [1] "1 apple---2 pears---3 bananas"


df <- tribble(
  ~text,
  'Apropos baerbockfails Von CDU und CSU ist die sogenannte "bürgerliche Mitte" Betrug und Trickserei  gewöhnt.',
  'CDU: Die Laschet-Union tut nichts, will nichts - und trifft damit den Nerv einer genervten Bevölkerung.', 
  'Ich habe heute Bilanz des Scheiterns der Klimapolitik von #Merkel gezogen',
  'Die_Waffenlobby saki_statement dieLinke Die_Gruenen Warum verschwendet unsere Waffenlobby ihre Zeit', 
  'BMieterverein: MieterInnentag21: Regierungsprogramm der #CDU soll am 21.6.2021 vorgestellt werden. Wir sind gespannt!'
)


word_counts <- unnest_tokens(df, word, text) %>%
  count(word, sort = TRUE) %>%
  mutate(word_rank = rank(desc(n), ties.method = "first"))


df %>%
  mutate(text = str_replace_all(tolower(text),
                                word_counts %>%
                                  select(word, word_rank) %>%
                                  mutate(word_rank = paste0("[", .$word_rank, "]")) %>%
                                  deframe()))
#> # A tibble: 5 x 1
#>   text                                                                          
#>   <chr>                                                                         
#> 1 "[9] [10] [6] [1] [2] [16] [32] [4] [43] \"[15] [37]\" [11] [2] [45]  [26]."  
#> 2 "[1]: [4] [34]-[48] [47] [5], [55] [5] - [2] [46] d[8]it [18] [38] [23] [24] ~
#> 3 "[30] [28] [29] [13] [19] [41] [3] [33] [6] #[35] [27]"                       
#> 4 "[4]_[52] [40] [4]linke [4]_gruenen [53] [50] [49] [52] [31] [57]"            
#> 5 "[14]: [36]: regierungsprogr[8]m [3] #[1] [44] [8] [7] [51] wer[18]. [56] [42~

Created on 2021-08-30 by the reprex package (v2.0.0)
he reprex package (v2.0.0)

Thanks @cactusoxbird , thats brilliant! :slight_smile:

I just added df3<-cbind(df, "integer"=df2) at the end

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.