1 world 11
2 china 8
3 united 8
4 peace 7
5 america 6
6 human 6
7 countries 5
8 nations 5
9 god 4
10 it’s 4
... with 317 more rows
But the problem is, when I check the word "world" in the speech, it was used 13 times. I first thought that some of the word "world" have been eliminated when I cleaned the speech, so I counted the most frequently used words, not using the function "anti_join(stop_words)". But the number of frequencies for the word "world" is the same as 11. Why can't I get the number 13?
I am pretty sure your issue is with the two instances of world that were written as world’s . They are the only two instances that are vastly different from the set.
Hello @GreyMerchant. You are right. Thank you. I have checked all the cases. In two out of 13 cases "world", the word was used as "world's". How can I make them counted as well?
Since this is a short speech, I can just directly delete the 's, but I want to know how to do so in the code. I tried the following code after "count_tidy_r" but it did not work out..
Wow thank you so much @jmcvw! Would you be able to explain the contents of the function mutate? I guess 'word' indicates the column from "tidy_r", but not sure what others mean inside (). The reason I am asking is I always wanted to change/delete some words in tibble but dont know how to do so. Thank you a lot!
The mutate function allows you to change an existing column or add a new column with the name specified before the = sign.
Here word is a column that exists in the dataframe that is created by unnest_tokens().
Following the = is the base function sub(), repeated below with the argument names inserted
sub(pattern = '[[:punct:]\u2019].*', replacement = '', x = word)
The sub function substitutes text as specified by the pattern argument, which is a regular expression. Regexes are pretty confusing when you first encounter them.
This one will look for any punctuation character ([:punct:]) and any ’ ("\u2019"). These two parts are enclosed in [] to designate a "character class". The brackets are followed by .* which means any and all characters following any punctuation will also be replaced.
The next argument is the replacement, ie what should replace any text found to match the pattern. In this case it replaced with a null string''.
The final argument x specifies the name of the object / column to search.
So this example replaces the word column with a modified version of itself.
For more info on regular expressions you could start with ?regex.
Here I used the base sub, there are tidyverse equivalents of the base regex functions in the package stringr