Dear Learned Community,

I am very new to R and the tidyverse, so I beg your pardon for what may be a very basic question. I am trying to conduct a tf-idf analysis using TidyTools--more specifically, using Text Mining With R. But I am running into a problem early on.

Here's the relevant code from Chapter 3 of "Text Mining":


book_words <- austen_books() %>%
unnest_tokens(word, text) %>%
count(book, word, sort = TRUE)

total_words <- book_words %>%
group_by(book) %>%
summarize(total = sum(n))

book_words <- left_join(book_words, total_words)

I adpated it as follows, with 'stoppedwords.Baillie' being one of the the somewhat cleaned up corpuses. I removed the code for 'books,' since the Jane Austen library has all of her separate novels and I have no need at this point to split Baillie into separate plays (and the corpus is not structured with those differences marked, I don't believe).

First step:

book_words <- austen_books() %>%
unnest_tokens(word, text) %>%
count(book, word, sort = TRUE)

tdfBaillie<-stoppedwords.Baillie %>%
count(word, sort = TRUE)

This does return a tibble that looks right--words ranked by frequency

Second Step:

From:total_words <- book_words %>%
group_by(book) %>%
summarize(total = sum(n))

To: total_words <- tdfBaillie %>%
summarize (total = sum(n))

This also LOOKS like it may be right, returning a tibble of one row and summing up as 219508 words. But then I run into trouble

Third Step:
From: book_words <- left_join(book_words, total_words)

To: book_words <- left_join(tdfBaillie, total_words)

This returns the following error: Error: by must be supplied when x and y have no common variables.
i use by = character()` to perform a cross-join.

I'm not sure what's gone wrong here. Before trying to integrate "by=character()," which I don't know how to do in any case, I need to understand why there seem to be no common variables between x (tdfBaillie) and y (total_words), since the latter is built on the former.

Grateful for any help!

Steve Newman

It's difficult to be sure because I can't reproduce without having the Baillie data, but are you sure they actually have common variables?
count(word, sort = TRUE) gives you a dataframe with variables word and n.
summarise(total = sum(n)) gives you your one row tibble that you mention, that presumably has one variable total.
So there's nothing in common, I think.

In any case, if you want to do a cartesian (cross) join, isn't it normal that they have no variables in common?

But why would you want to have a column where every value says 219508? [Or have I misunderstood this?]

Dear David,

Thank you very much for this helpful reply.

I see now that x (tdfBaillie) and y (total_words), are different in kind. And, if I'm understanding this, there's no point in left-joining, since that move in Text Mining with R is designed to show the different word frequencies in individual Austen novels. BUT: The question then is how I might get to the next step, which is to calculate the term frequency.

Data Mining with R codes it this way:

freq_by_rank <- book_words %>% 
  group_by(book) %>% 
  mutate(rank = row_number(), 
         `term frequency` = n/total) %>%

I adapted it as follows:

freq_by_rank <- tdfBaillie%>% 
  mutate(rank = row_number(), 
         `term frequency` = n/total_words) %>%

I changed 'total' to 'total_words,' because I get an error when I use 'total' [Error: object 'total' not found]. This surprises me because I thought I had defined 'total' above as 'total=sum(n)'. This might be a sign that I have something wrong here. And that's proven out by the result of the code above, which is:

 A tibble: 17,896 x 4
   word      n  rank `term frequency`$t~
   <chr> <int> <int>               <dbl>
 1 thou   4144     1              0.0189
 2 thee   2105     2              0.0189
 3 thy    2051     3              0.0189
 4 sir    1349     4              0.0189
 5 enter  1293     5              0.0189
 6 lady   1188     6              0.0189
 7 hand    865     7              0.0189
 8 lord    838     8              0.0189
 9 art     834     9              0.0189
10 dear    758    10              0.0189
# ... with 17,886 more rows

At first, I hoped that perhaps the difference among these top 10 was too slight to register in frequency--grasping at straws. But when I printed out to 200, it's still the same frequency. So that can't be right. But I don't know where I'm going awry here.

Thanks again for any further help.



