Regular expression character including space

I'm very new to R community and could really use some help. I have this column contains unique values of:

  1. corn
  2. good corn
  3. bad corn
  4. corn fine

Now, I want to find out how many rows contain %corn% including the ones with space.

I tried many options but to no avail:

nrow(subset(df, col_name == '\\bcorn\\b'))

and

nrow(subset(subset(df, col_name == '\\<corn\\>'))

They all return zero.

This code right here return only 1; which is the first row

nrow((subset(subset(df, col_name == 'corn'))

How can I make it return the number of all that contains 'corn' including space? Please let me know if I can provide more information. Thanks

Here are two methods. One uses the grepl function from base R and the other uses a function from the stringr package.

DF <- data.frame(Things = c("corn", "barn", "good corn", 
                             "yellow", "corn bad"))
DF
     Things
1      corn
2      barn
3 good corn
4    yellow
5  corn bad
#method 1
sum(grepl("corn", DF$Things))
[1] 3
 
#method2
library(stringr)
sum(str_detect(DF$Things, "corn"))
[1] 3

Thank you so much. You're my life saver. I have been not able to sleep for 2 days.

Although, I would like to add that the second method doesn't work for me (I've installed and loaded stingr). Is there any limitation on how to use it?

And also, can we make a table for future use out of the output?
example:

sum(grepl("corn", DF$Things))
sum(grepl("barn",DF$Things

will return:

corn 3
barn 1

Thank you.

You can make a named vector of results like this.

Words <- c("corn", "barn")
Results <- sapply(Words, function(x) sum(grepl(x, DF$Things)))
Results
corn barn 
   3    1 

I can't say why the stringr version of my code is not working for you. Can you make a small example of it not working, similar to what I put in my first post?

Is there any way I can make it as a new table for future reference for plotting? I want to be able to define x and y axis from the table.

The second method only shows '' as a result. I did exactly like you wrote there

sum(str_detect(DF$Things, "corn"))
[1] <NA>

I honestly have no idea why

The str_detect version is returning NA because one of the values in the Things column is NA. The grepl function seems to ignore NA values. If you set the na.rm argument of sum() to TRUE, you will get the desired result.

DF <- data.frame(Things = c("corn", "barn", "good corn", 
                             NA, "corn bad"))
sum(grepl("corn", DF$Things))
[1] 3
sum(stringr::str_detect(DF$Things, "corn"))
[1] NA
sum(stringr::str_detect(DF$Things, "corn"), na.rm = TRUE)
[1] 3

To make a data frame of the counts, you culd use the data.frame function, though there is no reason you cannot store the vector that the original code produced.

Words <- c("corn", "barn")
Results <- sapply(Words, function(x) sum(grepl(x, DF$Things)))
Results <- data.frame(Words, Results) 
Results
     Words Results
corn  corn       3
barn  barn       1

Wow, thanks!!! I do have NA values in my dataset, sorry I didn't mention it in the first place.

I have made a data frame of the counts but unfortunately, when plotting, R doesn't recognize it as data frame.

Here's how I plot:

ggplot(data = Results) %>%
    geom_bar(mapping = aes(x = Words, y = Results))

It returns:

Error in `fortify()`:
! `data` must be a data frame, or other object coercible by `fortify()`, not an S3 object with class gg/ggplot.

I'm sure it's data frame by now but I wonder why R doesn't recognize it

What is the result of

class(Results)
[1] "data.frame"

Tried inspecting structure and all, I'm positive it's data frame

Please post the output of

dput(Results)

If Results is large enough to make that unwieldy, you can post the output of

dput(head(Results, 20))

I would also change the name of the y column so it doesn't match the data frame name. That shouldn't be a problem but it strikes me as dangerous, though I did it myself.

It's late here, so I will not be able to respond for several hours. Someone else will, I hope.

This is what it returns (it's symptoms in a disease data)

structure(list(name = c("fever", "headache", "muscle pain", "backache", 
"lymph nodes", "fatigue", "lesion", "pustule", "blister", "cough", 
"rash", "ulcer"), number = c(38L, 5L, 4L, 0L, 2L, 2L, 66L, 3L, 
2L, 1L, 15L, 73L)), class = "data.frame", row.names = c(NA, -12L
))

I've changed column names into: name and number.

It's alright. Thank you for the help, you've been so kind

I'm sorry to confuse you but apparently what was wrong from the above problem is my plot.

It should've been:

ggplot(data = Results)  +
    geom_col(mapping = aes(x = Words, y = Results))

I was not careful.
Thanks for the help. All good now.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.