How to clean data for bigrams


I have a question about how to clean my data for bigrams.

Following is my code.

r <- read_lines('Blinken.txt')

text_r <- tibble(line = 1: 2229, text = r)

r_bigrams <- text_r %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)

bigrams_separated <- r_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word) %>% na.omit

bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)


A tibble: 2,639 x 3

word1 word2 n

1 president elect 51
2 human rights 31
3 national security 28
4 biden administration 27
5 foreign policy 24
6 um hmm 19
7 elect biden 14
8 senator menendez 14
9 trump administration 14
10 al qaeda 13

... with 2,629 more rows

I would like to delete "um hmm" on the sixth row in the result.
What I want to try is to filter out certain words including "um" "hmm" and others, like senators' names.
How should I do?


In order for us to be able to help you, you should provide us with a reprex so we can actually run the part of the code where the error is. A reprex consists of the minimal code and data needed to recreate the issue/question you're having. You can find instructions how to build and share one here:

Also, please provide a bit more details on what it is you like to accomplish (provide before / after)


This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.