Word count is incorrect

Hi. I tried the frequencies of words in President Trump's speech for UNGA this year (The original text can be found here: https://www.whitehouse.gov/briefings-statements/remarks-president-trump-75th-session-united-nations-general-assembly/)

Following is my code to check which words have been used most.

r <- read_lines('UN.txt')
text_r <- tibble(line = 1: 43, text = r)
tidy_r <- text_r %>% unnest_tokens(word, text) %>% anti_join(stop_words)
count_tidy_r <- tidy_r %>% + count(word, sort = TRUE)
count_tidy_r

The result is following.

A tibble: 327 x 2

word n

1 world 11
2 china 8
3 united 8
4 peace 7
5 america 6
6 human 6
7 countries 5
8 nations 5
9 god 4
10 it’s 4

... with 317 more rows

But the problem is, when I check the word "world" in the speech, it was used 13 times. I first thought that some of the word "world" have been eliminated when I cleaned the speech, so I counted the most frequently used words, not using the function "anti_join(stop_words)". But the number of frequencies for the word "world" is the same as 11. Why can't I get the number 13?

Is it possible it is not counting capitalized World as world?

Hello @supreme02,

I am pretty sure your issue is with the two instances of world that were written as world’s . They are the only two instances that are vastly different from the set.

Hello @GreyMerchant. You are right. Thank you. I have checked all the cases. In two out of 13 cases "world", the word was used as "world's". How can I make them counted as well?

1 Like

if the 's doesn't matter on other words I would remove that from your r object directly and just replace that string with nothing.

Since this is a short speech, I can just directly delete the 's, but I want to know how to do so in the code. I tried the following code after "count_tidy_r" but it did not work out..

str_replace_all(count_tidy_r, "'s", "")

Try this...

tidy_r %>% 
  mutate(word = sub('[[:punct:]\u2019].*', '', word)) %>%
  count(word, sort = TRUE)

The quote mark is a ’ rather than a ', the unicode for which is '\u2019'

1 Like

Wow thank you so much @jmcvw! Would you be able to explain the contents of the function mutate? I guess 'word' indicates the column from "tidy_r", but not sure what others mean inside (). The reason I am asking is I always wanted to change/delete some words in tibble but dont know how to do so. Thank you a lot!

The mutate function allows you to change an existing column or add a new column with the name specified before the = sign.
Here word is a column that exists in the dataframe that is created by unnest_tokens().

Following the = is the base function sub(), repeated below with the argument names inserted

sub(pattern = '[[:punct:]\u2019].*', replacement = '', x = word)

The sub function substitutes text as specified by the pattern argument, which is a regular expression. Regexes are pretty confusing when you first encounter them.

This one will look for any punctuation character ([:punct:]) and any ’ ("\u2019"). These two parts are enclosed in [] to designate a "character class". The brackets are followed by .* which means any and all characters following any punctuation will also be replaced.

The next argument is the replacement, ie what should replace any text found to match the pattern. In this case it replaced with a null string''.

The final argument x specifies the name of the object / column to search.
So this example replaces the word column with a modified version of itself.

For more info on regular expressions you could start with ?regex.
Here I used the base sub, there are tidyverse equivalents of the base regex functions in the package stringr

Hope this helps a bit.

2 Likes

I really appreciate your detailed explanation! It helps a lot! Thank you so much!!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.