Hello.
I have a question about how to clean my data for bigrams.
Following is my code.
r <- read_lines('Blinken.txt')
text_r <- tibble(line = 1: 2229, text = r)
r_bigrams <- text_r %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
bigrams_separated <- r_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word) %>% na.omit
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_counts
A tibble: 2,639 x 3
word1 word2 n
1 president elect 51
2 human rights 31
3 national security 28
4 biden administration 27
5 foreign policy 24
6 um hmm 19
7 elect biden 14
8 senator menendez 14
9 trump administration 14
10 al qaeda 13
... with 2,629 more rows
I would like to delete "um hmm" on the sixth row in the result.
What I want to try is to filter out certain words including "um" "hmm" and others, like senators' names.
How should I do?