Anti_join function not working

I am trying to remove stop words from a dataframe using the anti join function. The code is running but it's not doing what it's supposed to do and I'm not sure it is doing anything. any help is greatly appreciated

joined <- anti_join(data, stop_words, by = "word")

and i get the output:

joined
# A tibble: 13,005 x 1
   word                                                                                                                          
   <chr>                                                                                                                         
 1 "c(\"how\", \"did\", \"everyone\", \"feel\", \"about\", \"the\", \"climate\", \"change\", \"question\", \"last\", \"night\", …
 2 "c(\"didnt\", \"catch\", \"the\", \"full\")"                                                                                  
 3 "c(\"no\", \"mention\", \"of\", \"tamir\", \"rice\", \"and\", \"the\")"                                                       
 4 "c(\"that\", \"carly\", \"fiorina\", \"is\", \"trendinghours\", \"after\", \"her\", \"debateabove\", \"any\", \"of\", \"the\"…
 5 "c(\"on\", \"my\", \"first\", \"day\", \"i\", \"will\", \"rescind\", \"every\", \"illegal\", \"executive\", \"action\", \"tak…
 6 "c(\"i\", \"liked\", \"her\", \"and\", \"was\", \"happy\", \"when\", \"i\", \"heard\", \"she\", \"was\", \"going\", \"to\", \…
 7 "c(\"going\", \"on\")"                                                                                                        
 8 "c(\"deer\", \"in\", \"the\", \"headlightsben\", \"carson\", \"may\", \"be\", \"the\", \"only\", \"brain\", \"surgeon\", \"wh…
 9 "c(\"last\", \"nights\", \"debate\", \"proved\", \"it\")"                                                                     
10 "c(\"in\", \"all\", \"fairness\")"                                                                                            
# … with 12,995 more rows

the stop words from the 'stop_words' dataframe haev not been taken out of the 'data' dataframe.

1 Like

Hi,

First, a reminder to use a full reprex whenever possible. See the FAQ: What's a reproducible example (`reprex`) and how do I do one?.

Second, there are natural language processing tools to handle this job. One of the best places to start is the tidytext package and its free online book, which includes tools for just this kind of task.

Third, your data frame doesn't actually contain what you want. It's a subtle difference.

existing <- "c(\"how\", \"did\", \"everyone\", \"feel\", \"about\", \"the\", \"climate\", \"change\", \"question\", \"last\", \"night\")"
print(existing)
#> [1] "c(\"how\", \"did\", \"everyone\", \"feel\", \"about\", \"the\", \"climate\", \"change\", \"question\", \"last\", \"night\")"
needed <-  c("how", "did", "everyone", "feel", "about", "the", "climate", "change", "question", "last", "night")
existing == needed
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
print(needed)
#>  [1] "how"      "did"      "everyone" "feel"     "about"    "the"     
#>  [7] "climate"  "change"   "question" "last"     "night"

Created on 2020-02-23 by the reprex package (v0.3.0)

caused by making the c() function a part of the string, by enclosing everything in the outer pair of ".

suppressPackageStartupMessages(library(dplyr)) 
existing <- "c(\"how\", \"did\", \"everyone\", \"feel\", \"about\", \"the\", \"climate\", \"change\", \"question\", \"last\", \"night\")"
needed <-  c("how", "did", "everyone", "feel", "about", "the", "climate", "change", "question", "last", "night")
stops <- c("the")
needed <- as.data.frame(needed) 
stops <- as.data.frame(stops)
colnames(needed) <- "word"
colnames(stops) <- "word"
anti_join(needed,stops)
#> Joining, by = "word"
#> Warning: Column `word` joining factors with different levels, coercing to
#> character vector
#>        word
#> 1       how
#> 2       did
#> 3  everyone
#> 4      feel
#> 5     about
#> 6   climate
#> 7    change
#> 8  question
#> 9      last
#> 10    night

Created on 2020-02-23 by the reprex package (v0.3.0)

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.