How can I filter out rows based on a column of texts containing certain keywords

I streamed tweets from Twitter for about 2hours and I have the tweets in a data frame. Now, I want to filter out the tweets that contain certain keywords like 'coronavirus' but R returns a 0*3 tibble

#sample data frame
tweet <- c("I was tested for coronavirus today", "my covid-19 test came out negative")
is_retweet <- c("TRUE", "FALSE")
is_quote <- c("FALSE", "FALSE")

df <- data.frame(tweet, is_retweet, is_quote)

.....and here is how I am trying to filter out the rows where the "tweet" column contains the keywords like covid, corona

filter_df <- df
filter_df %>%
  select(tweet, is_quote, is_retweet) %>% 
  filter(tweets %in% c("covid", "covid-19", "face mask", "pandemic", "coronavirus", "virus"))

Hi,

The command %in% is looking for an exact match and would only catch cases that are exactly that word. I would suggest something like the following which uses the function str_detect and I made a case that has none of the keywords which you can see is no longer included.

library(tidyverse)

tweet <- c("I was tested for coronavirus today", "my covid-19 test came out negative" , "no key words")
is_retweet <- c("TRUE", "FALSE", "FALSE")
is_quote <- c("FALSE", "FALSE", "FALSE")

df <- data.frame(tweet, is_retweet, is_quote)

filter_df <- df %>%
  select(tweet, is_quote, is_retweet) %>% 
  filter(str_detect(tweet, "covid|covid-19|face mask|pandemic|coronavirus|virus"))
filter_df
#>                                tweet is_quote is_retweet
#> 1 I was tested for coronavirus today    FALSE       TRUE
#> 2 my covid-19 test came out negative    FALSE      FALSE

Created on 2020-06-01 by the reprex package (v0.3.0)

1 Like

This works! Thanks @StatSteph.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.