problem extracting matching sentence from column in data frame

ammarhm · May 25, 2019, 1:18am

Quick question.
I am using the following code to extract matching sentences from a column in . adataframe:

Data$Prep <- grep("the preparation was", unlist(strsplit(Data$REPORT_TEXT, '(?<=\\.)\\s+', perl=TRUE)), value=TRUE, ignore.case = TRUE)

The problem is that not all the rows in the data frame column contain that matching pattern, so the returned vector is shorter than the data frame itself resulting in an error
Error in $<-.data.frame(*tmp*, Prep, value = c(4L, 22L, 41L, 67L, :
replacement has 685 rows, data has 700

Is there a way to avoid this ? is there a way to return empty string or NA when the searched string doesnt contain the matching words?
Thank you

technocrat · May 25, 2019, 2:54am

There's a tidy was to do this. I made a toy example to illustrate

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stringr)
library(tibble)
pattern <- "the preparation was"
phrases <- c("the preparation was", "the outcome will be")
my.df <- enframe(phrases)
colnames(my.df) <- c("index", "phrase")
my.df <- my.df %>% select(-index)
my.df <- my.df %>% filter(phrase == pattern)
my.df
#> # A tibble: 1 x 1
#>   phrase             
#>   <chr>              
#> 1 the preparation was

^{Created on 2019-05-24 by the reprex package (v0.3.0)}

ammarhm · May 25, 2019, 3:11am

Thank you for taking the time to reply.
I would still though prefer to use the regex model above if possible?

ammarhm · May 25, 2019, 3:14am

In python, the way I would do this is the following:

try:
  the function above ie something similar to Data$Prep <- grep("the preparation was", unlist(strsplit(Data$REPORT_TEXT, '(?<=\\.)\\s+', perl=TRUE)), value=TRUE, ignore.case = TRUE)
except:
 someting to return NA in case the above function returns an error because there is no matching text

Is there a way to do this in R

andresrcs · May 25, 2019, 3:24am

You can do something like this.

library(dplyr)

iris %>%
  count(Species) %>% # Example data
  mutate(Prep = ifelse(grepl("virginica", Species), "virginica", NA))
#> # A tibble: 3 x 3
#>   Species        n Prep     
#>   <fct>      <int> <chr>    
#> 1 setosa        50 <NA>     
#> 2 versicolor    50 <NA>     
#> 3 virginica     50 virginica

It would be easier to help you, if you could provide a minimal REPRoducible EXample (reprex)

technocrat · May 25, 2019, 4:41am

If "the preparation was" was just an example, rather than a literal, stringr supports regex. I'm not sure I follow your Python example because your original example did include the search string and the problem you were trying to solve was to exclude the records without it. As @andresrcs suggests, a reproducible example, called a reprex would be a great help.

system · June 15, 2019, 4:41am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.