Error: `by` must be supplied when `x` and `y` have no common variables. i use by = character()` to perform a cross-join.

Dear Learned Community,

I am very new to R and the tidyverse, so I beg your pardon for what may be a very basic question. I am trying to conduct a tf-idf analysis using TidyTools--more specifically, using Text Mining With R. But I am running into a problem early on.

Here's the relevant code from Chapter 3 of "Text Mining":

library(dplyr)
library(janeaustenr)
library(tidytext)

book_words <- austen_books() %>%
unnest_tokens(word, text) %>%
count(book, word, sort = TRUE)

total_words <- book_words %>%
group_by(book) %>%
summarize(total = sum(n))

book_words <- left_join(book_words, total_words)

I adpated it as follows, with 'stoppedwords.Baillie' being one of the the somewhat cleaned up corpuses. I removed the code for 'books,' since the Jane Austen library has all of her separate novels and I have no need at this point to split Baillie into separate plays (and the corpus is not structured with those differences marked, I don't believe).

First step:

From:
book_words <- austen_books() %>%
unnest_tokens(word, text) %>%
count(book, word, sort = TRUE)

To:
tdfBaillie<-stoppedwords.Baillie %>%
count(word, sort = TRUE)

This does return a tibble that looks right--words ranked by frequency

Second Step:

From:total_words <- book_words %>%
group_by(book) %>%
summarize(total = sum(n))

To: total_words <- tdfBaillie %>%
summarize (total = sum(n))

This also LOOKS like it may be right, returning a tibble of one row and summing up as 219508 words. But then I run into trouble

Third Step:
From: book_words <- left_join(book_words, total_words)

To: book_words <- left_join(tdfBaillie, total_words)

This returns the following error: Error: by must be supplied when x and y have no common variables.
i use by = character()` to perform a cross-join.

I'm not sure what's gone wrong here. Before trying to integrate "by=character()," which I don't know how to do in any case, I need to understand why there seem to be no common variables between x (tdfBaillie) and y (total_words), since the latter is built on the former.

Grateful for any help!

Sincerely,
Steve Newman

It's difficult to be sure because I can't reproduce without having the Baillie data, but are you sure they actually have common variables?
count(word, sort = TRUE) gives you a dataframe with variables word and n.
summarise(total = sum(n)) gives you your one row tibble that you mention, that presumably has one variable total.
So there's nothing in common, I think.

In any case, if you want to do a cartesian (cross) join, isn't it normal that they have no variables in common?

But why would you want to have a column where every value says 219508? [Or have I misunderstood this?]

Dear David,

Thank you very much for this helpful reply.

I see now that x (tdfBaillie) and y (total_words), are different in kind. And, if I'm understanding this, there's no point in left-joining, since that move in Text Mining with R is designed to show the different word frequencies in individual Austen novels. BUT: The question then is how I might get to the next step, which is to calculate the term frequency.

Data Mining with R codes it this way:

freq_by_rank <- book_words %>% 
  group_by(book) %>% 
  mutate(rank = row_number(), 
         `term frequency` = n/total) %>%
  ungroup()

I adapted it as follows:

freq_by_rank <- tdfBaillie%>% 
  mutate(rank = row_number(), 
         `term frequency` = n/total_words) %>%
  ungroup()

I changed 'total' to 'total_words,' because I get an error when I use 'total' [Error: object 'total' not found]. This surprises me because I thought I had defined 'total' above as 'total=sum(n)'. This might be a sign that I have something wrong here. And that's proven out by the result of the code above, which is:

 A tibble: 17,896 x 4
   word      n  rank `term frequency`$t~
   <chr> <int> <int>               <dbl>
 1 thou   4144     1              0.0189
 2 thee   2105     2              0.0189
 3 thy    2051     3              0.0189
 4 sir    1349     4              0.0189
 5 enter  1293     5              0.0189
 6 lady   1188     6              0.0189
 7 hand    865     7              0.0189
 8 lord    838     8              0.0189
 9 art     834     9              0.0189
10 dear    758    10              0.0189
# ... with 17,886 more rows

At first, I hoped that perhaps the difference among these top 10 was too slight to register in frequency--grasping at straws. But when I printed out to 200, it's still the same frequency. So that can't be right. But I don't know where I'm going awry here.

Thanks again for any further help.

Sincerely,

Steve

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.