problem extracting matching sentence from column in data frame

Quick question.
I am using the following code to extract matching sentences from a column in . adataframe:

Data$Prep <- grep("the preparation was", unlist(strsplit(Data$REPORT_TEXT, '(?<=\\.)\\s+', perl=TRUE)), value=TRUE, ignore.case = TRUE)

The problem is that not all the rows in the data frame column contain that matching pattern, so the returned vector is shorter than the data frame itself resulting in an error
Error in $<-.data.frame(*tmp*, Prep, value = c(4L, 22L, 41L, 67L, :
replacement has 685 rows, data has 700

Is there a way to avoid this ? is there a way to return empty string or NA when the searched string doesnt contain the matching words?
Thank you

There's a tidy was to do this. I made a toy example to illustrate

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stringr)
library(tibble)
pattern <- "the preparation was"
phrases <- c("the preparation was", "the outcome will be")
my.df <- enframe(phrases)
colnames(my.df) <- c("index", "phrase")
my.df <- my.df %>% select(-index)
my.df <- my.df %>% filter(phrase == pattern)
my.df
#> # A tibble: 1 x 1
#>   phrase             
#>   <chr>              
#> 1 the preparation was

Created on 2019-05-24 by the reprex package (v0.3.0)

Thank you for taking the time to reply.
I would still though prefer to use the regex model above if possible?

In python, the way I would do this is the following:

try:
  the function above ie something similar to Data$Prep <- grep("the preparation was", unlist(strsplit(Data$REPORT_TEXT, '(?<=\\.)\\s+', perl=TRUE)), value=TRUE, ignore.case = TRUE)
except:
 someting to return NA in case the above function returns an error because there is no matching text

Is there a way to do this in R

You can do something like this.

library(dplyr)

iris %>%
  count(Species) %>% # Example data
  mutate(Prep = ifelse(grepl("virginica", Species), "virginica", NA))
#> # A tibble: 3 x 3
#>   Species        n Prep     
#>   <fct>      <int> <chr>    
#> 1 setosa        50 <NA>     
#> 2 versicolor    50 <NA>     
#> 3 virginica     50 virginica

It would be easier to help you, if you could provide a minimal REPRoducible EXample (reprex)

If "the preparation was" was just an example, rather than a literal, stringr supports regex. I'm not sure I follow your Python example because your original example did include the search string and the problem you were trying to solve was to exclude the records without it. As @andresrcs suggests, a reproducible example, called a reprex would be a great help.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.