How to filter large dataframe to select for specific rows?

Context: Trying to filter data frame to only include columns for "lemma" below.

method_matches <- matches %>% filter(lemma==c("analysis", "analyze", "assess", "assessment", "assign", "author", "autobiographical", "background", "base", "baseline", "cross", "construct", "construction", "correlation", "design", "development", "diagnose", "diagnosis", "diagnostic", "discuss", "discussion", "document", "documentation", "factor", "item", "measure", "measurement", "model", "modelling", "n", "personality", "result", "sample", "scale", "score", "structure", "student", "study", "use", "valid", "validate", "validity", "question", "questionnaire"))

Problem: R doesn't seem to like this as it comes up with the below error AND it is struggling to find all of the rows in the data frame. The data frame it returns is too small. Guessing it requires some functions that deal with text data but don't have much experience in this so any pointers would be helpful
ERROR: "longer object length is not a multiple of shorter object length"

Example dataset:

col_name <- c(id, year, epoch, lemma, repeat)
col_value <- c(3, 1998, 5, "abandon", 1)

I am not sure what you want to do. Does the following code help you?

VALUES <- c("analysis", "analyze", "assess", "assessment", "assign", 
            "author", "autobiographical", "background", "base", 
            "baseline", "cross", "construct", "construction", 
            "correlation", "design", "development", "diagnose", 
            "diagnosis", "diagnostic", "discuss", "discussion", 
            "document", "documentation", "factor", "item", "measure", 
            "measurement", "model", "modelling", "n", "personality", 
            "result", "sample", "scale", "score", "structure", "student", 
            "study", "use", "valid", "validate", "validity", "question", 
            "questionnaire")
DF <- data.frame(id = 1:5, lemma = c("frog", "analyze", "moose", "score",
                                     "question"))
DF
#>   id    lemma
#> 1  1     frog
#> 2  2  analyze
#> 3  3    moose
#> 4  4    score
#> 5  5 question
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
DF |> filter(lemma %in% VALUES)
#>   id    lemma
#> 1  2  analyze
#> 2  4    score
#> 3  5 question

Created on 2022-06-09 by the reprex package (v2.0.1)

Thank you and sorry for not being clear. Following on from your post, I have attempted to convey the idea below.

I am trying to filter a large dataset of ~40k rows to contain only the rows in the lemma column specified in VALUES. But when I use the filter function it fails to search through all the rows and find all the matches for VALUES.

filtered_df <- DF %>% filter(lemma==(values in VALUES, but not sure if this would work!) 

It should work if you change the == to %in% as I did in my example.

library(dplyr)
filtered_df <- DF %>% filter(lemma  %in% VALUES)
1 Like

Thank you!

what is the meaning of %in% ?
In this case, does it mean filter lemma column for all values within 'values'?

The %in% operator returns a single TRUE/FALSE if the single item on the left is one of the elements of the vector on the right.

"A" %in% c("B", "C", "A", "E")
[1] TRUE

If you use ==, you will get a vector of TRUE/FALSE values comparing the item on the left to each element of the vector.

"A" == c("B", "C", "A", "E")
[1] FALSE FALSE  TRUE FALSE

For the filter() function, you want to return a single TRUE/FALSE; is the value of lemma one of the elements of VALUES.

Thank you for this detailed explanation!

Now I am wondering: if I want to exclude all the rows in a dataset from this vector of words (VALUES), can %in% be altered?

For example, == can be altered to != or is there another function to use then? The concept is below [modelled off filter(lemma != VALUES)], but it does not work.

library(dplyr)
filtered_df2 <- DF %>% filter(lemma !%in% VALUES)

You need to put the ! at the left end of the comparison to invert the value that is returned.

"A" %in% c("B", "C", "A", "E")
[1] TRUE

!"A" %in% c("B", "C", "A", "E")
[1] FALSE
1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.