How to filter sentences with two words or higher in r

fiorepalombina · June 17, 2021, 5:09pm

Dear community,
I have a dataframe that contain in one column text data.
I would know if there are some functions that i could use to filter all the observations that have two or higher number of words, in order to delete all the observation with just one word.

Thanks

Below are there are some examples of my dataset.

[1] "acqua valmora residuo fisso" "acquisto materiale per ufficio on line"
[3] "agenda 2021 giornaliera" "agenda settimanale 2021"
[5] "agende 2021" "agende giornaliere 2021"
[7] "agende settimanali 2021" "armadio metallico"
[9] "barriere scrivania" "bicchieri plastica caffè"
[11] "bio bottle" "bioform"

So in this case i would eliminate the last string "bioform".

mhakanda · June 17, 2021, 5:40pm

You can use unnest_tokens and group_by to filter

library(tidyverse)
library(tidytext)
df # only 12 th row will be deleted
#>                                     word1
#> 1             acqua valmora residuo fisso
#> 2  acquisto materiale per ufficio on line
#> 3                 agenda 2021 giornaliera
#> 4                 agenda settimanale 2021
#> 5                             agende 2021
#> 6                 agende giornaliere 2021
#> 7                 agende settimanali 2021
#> 8                       armadio metallico
#> 9                      barriere scrivania
#> 10               bicchieri plastica caffè
#> 11                             bio bottle
#> 12                                bioform

df %>% mutate(id = row_number()) %>% 
  unnest_tokens(word, word1) %>% 
  group_by(id) %>%
  filter(n()>1) %>% summarise(updated_word = paste0(word, collapse = " ")) %>%
  select(-id)
#> # A tibble: 11 x 1
#>    updated_word                          
#>    <chr>                                 
#>  1 acqua valmora residuo fisso           
#>  2 acquisto materiale per ufficio on line
#>  3 agenda 2021 giornaliera               
#>  4 agenda settimanale 2021               
#>  5 agende 2021                           
#>  6 agende giornaliere 2021               
#>  7 agende settimanali 2021               
#>  8 armadio metallico                     
#>  9 barriere scrivania                    
#> 10 bicchieri plastica caffè              
#> 11 bio bottle

arthur.t · June 17, 2021, 6:29pm

library(tidyverse)
c("three little words", "two words", "one") %>% strsplit(" ") %>% map_dbl(length)