Hi. I tried the frequencies of words in President Trump's speech for UNGA this year (The original text can be found here: https://www.whitehouse.gov/briefings-statements/remarks-president-trump-75th-session-united-nations-general-assembly/)
Following is my code to check which words have been used most.
r <- read_lines('UN.txt')
text_r <- tibble(line = 1: 43, text = r)
tidy_r <- text_r %>% unnest_tokens(word, text) %>% anti_join(stop_words)
count_tidy_r <- tidy_r %>% + count(word, sort = TRUE)
count_tidy_r
The result is following.
A tibble: 327 x 2
word n
1 world 11
2 china 8
3 united 8
4 peace 7
5 america 6
6 human 6
7 countries 5
8 nations 5
9 god 4
10 it’s 4
... with 317 more rows
But the problem is, when I check the word "world" in the speech, it was used 13 times. I first thought that some of the word "world" have been eliminated when I cleaned the speech, so I counted the most frequently used words, not using the function "anti_join(stop_words)". But the number of frequencies for the word "world" is the same as 11. Why can't I get the number 13?