Filter sentences with two words

Rodri2095 · October 8, 2022, 10:38pm

Hello everyone

I have a large data frame with information on species (location, registration date, kingdom, class, order, gender, etc..). The column that interests me is the "scientific name".

I would like to be able to filter through this column those values that have two words and thus, get rid of the data that only have one word in this column (that would mean that I do not have the complete information about its scientific name)

I appreciate if you can help me with this. Thank you very much and greetings

hichammoadsafhi · October 9, 2022, 1:14am

Hi. So the words are separated by spaces right?
We can use str_count from stringr package to count the words, then select only the columns that have the count greater than 1.
For example:

library(dplyr)
library(stringr)
myFilteredDF = myDF %>% filter(str_count( `scientific name` , "\\S+") >1 )

Without using dplyr:

myFilteredDF = myDF[ str_count( myDF$`scientific name` , "\\S+") >1 ,]

technocrat · October 9, 2022, 4:12am

How well a classification works depending on length two vs. length one depends on the data.

It will work well enough with the pair Melanogrammus aeglefinus and Haddock, but not well with Melissa melissa samuelis and Karner Blue.

hichammoadsafhi · October 9, 2022, 4:54pm

Can you share data sample, and the expected output?

Rodri2095 · October 16, 2022, 12:16am

Hey Hicham

Actually your answer was all I needed to achieve it, it worked perfect

Thanks a lot

system · October 23, 2022, 12:17am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.