Can not use anti join

yeahno · March 18, 2021, 2:26pm

Hello everyone!

I'm trying to use Rstudio to analyze the reviews of a certain product.

I have already broken down the sentences into words in R, so each word of a sentence is in a separate column right now.

Problem is that it is filled with stop words ( of, in, a, etc), so I tried to use the anti join feature to get rid of them.

Here is how I tried to put the useful words of the reviews into a new dataset:

Cleanwords <- reviews %>% anti_join(my_stop_words)

When I tried to do this, I got this error message:

Error: by must be supplied when x and y have no common variables.
i use by = character()` to perform a cross-join.

So I tried it like this:

Cleanwords <- reviews %>% anti_join(my_stop_words, by =c("word", "stopword"))

But then I got another error message:

Error: Join columns must be present in data.
x Problem with stopword.

Can you help me with this?

Thanks in advance!

yeahno · March 18, 2021, 2:28pm

Sorry in the beginning I meant to say " each word of a sentence is in a separate ROW right now."

FJCC · March 18, 2021, 3:04pm

Here is an example that I think fits your case.

reviews <- data.frame(word = c("bread", "of", "toast", "in"))
stop_words <- data.frame(stopword = c("of", "in"))       
anti_join(reviews, stop_words, by = c(word = "stopword"))
   word
1 bread
2 toast

yeahno · March 18, 2021, 5:01pm

Thank you for your advice!

My problem however, is that in my "reviews" dataframe I have more than 4000 words, and in my "my_stop_words" dataframe, I have like 150 words.

With the method you mentioned, I should enumerate all words manually..?

FJCC · March 18, 2021, 6:05pm

If you already have data frames with the words, there is no need to make new ones. I had to do that to have something to work with. If you cannot get the method to work. Please post a small sample of your data. You can use the output of the command

dput(head(reviews))

Please place lines containing only three back ticks just before and after the pasted output, like this
```
your output here
```

yeahno · March 18, 2021, 6:10pm


structure(list(word = c("would", "be", "nice", "to", "have", 
"a")), row.names = c(NA, 6L), class = c("tbl_df", "tbl", "data.frame"))

like this?

FJCC · March 18, 2021, 7:27pm

Using the data you posted and a stop_words data frame that I made by hand, I can filter out the words "to" and "a" with this code. I believe you have a stop_words data frame already, so you should be able to use it, changing the column name in the anti_join from stopword to whatever it is in your data frame.

stop_words <- data.frame(stopword = c("to", "a")) 

review <- structure(list(word = c("would", "be", "nice", "to", "have", 
                         "a")), row.names = c(NA, 6L), class = c("tbl_df", "tbl", "data.frame"))
anti_join(review, stop_words, by = c(word = "stopword"))
# A tibble: 4 x 1
  word 
  <chr>
1 would
2 be   
3 nice 
4 have

yeahno · March 18, 2021, 8:16pm

Thanks a lot, I think that worked!

My next step would be analyzing the text by the first 2 words or first 3 words in a sentence, just to get a clear picture of the nature of these reviews (sad, disappointed, satisfied, etc).

I'll try to do that on my own tomorrow, but I have a feeling that I'll be back with another question..

rwalker · March 18, 2021, 8:53pm

If you "think that worked" perhaps you would consider verifying this and mark it as solved as a courtesy to the community.

fredoxvii · March 19, 2021, 2:23am

Hello, welcome to the community!

For something like this your should consider tidytext package.

cleaned_books <- tidy_books %>% anti_join(get_stopwords())

system · March 26, 2021, 2:23am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.