Subsetting Data


Dear All,
I am trying to subset some particular rows in the data with the picture attached. My goal is to (1) List the "Make.and.model" with model names ending with 'R'.

(2) List the "Make.and.model" that might be smaller than 'liter' bikes (engine size < 1000 cc), based on their name. I want to achieve this by first, excluding motorcycles with 1 in the name (these will mostly be 1000+ numbers). From that set of names, I will now select those with numbers in range 2-9 in their names.
Please, how do I achieve these?


It is not a good idea to put your data in the screenshot like this. Here you can find what you should do instead:

This way it'll be much easier to others to help you.

I think, both of your questions can be answered with stringr package, specifically stringr::str_detect().


As @mishabalyasin mentioned, it will be much easier to help you with a reprex. However, a general suggestion would be to take a look at the stringr package and its Regular Expression helpers (likely for use in combination with tidyr, or a similar tool for helping you separate out multiple variables from a single column, which is what you have right now). See, for example, the section on separating multiple vars in the Tidy data vignette:

The stringr cheat sheet can be an invaluable asset as you go, too:

Once you've done that (which is the majority of the work), you'll be able to begin to subset data. For example, if you got a reliable displacement figure (I don't know how accurate you want to be, for example, the Diavel is 1200cc, which isn't in the name ā€” R can also mean different things depending on the make) you could use dplyr::filter() to select rows with engine displacements >= 1000.


Thank you all for your suggestions. I got it figured out. I used the sqldf and it was helpful.


I got it figured Mishabalyasin. Thanks very much. I used the sqldf and it help me out.


Great! Can you share your solution here as well? It is very likely that someone else might have similar problem in future, so it will be helpful to have it in one place.


So for the first part of my problem, this is how I got it done
sqldf("select * from data where variable like '%R' ").

This is the second part of the problem and the code:

# selecting those with numbers with 1
 data$variable[grepl("[1]", data$variable)]

# dropping those with numbers with 1
exlud <- data$variable[!grepl("[1]", data$variable)]
# those with numbers between 2-9
exlud[grepl("[2-9]", exlud)]

Hope it helps...