Subsetting - subset function producing odd output


#1

Hi, this is my first post, so here goes… :persevere:

I have been practising with the ‘iris’ dataset - filtering the data with three methods to see the result. However in example 3 below, the subset function (iris_3) is producing different results to iris_1 and iris_2. Can anyone explain why? Many thanks.

# Example One: | (OR) returns 87 obs/5 variables for each line
(iris_1 <- filter(iris, Sepal.Length > 7 | Sepal.Width <= 3))
(iris_2 <- iris[iris$Sepal.Length > 7 | iris$Sepal.Width <= 3, ])
(iris_3 <- subset(iris, Sepal.Length > 7 | Sepal.Width <= 3))
#-------------------------------------------------------
# Example Two: & (And) returns 8 obs/5 variables for each line
(iris_1 <- filter(iris, Sepal.Length > 7 & Sepal.Width <= 3))
(iris_2 <- iris[iris$Sepal.Length > 7 & iris$Sepal.Width <= 3, ])
(iris_3 <- subset(iris, Sepal.Length > 7 & Sepal.Width <= 3))
#------------------------------------------------------
# Example Three: , (comma)returns 8 obs for iris_1, and iris_2
# but 12 obs for iris_3?
(iris_1 <- filter(iris, Sepal.Length > 7, Sepal.Width <= 3))
(iris_2 <- iris[iris$Sepal.Length > 7, iris$Sepal.Width <= 3, ])
(iris_3 <- subset(iris, Sepal.Length > 7, Sepal.Width <= 3))

#2

Could you please provide the output from search() so that we can see which packages you have loaded.


#3

Hi martin.R - the results of the search( )

search()
[1] “.GlobalEnv” “package:bindrcpp” “package:forcats” “package:stringr”
[5] “package:dplyr” “package:purrr” “package:readr” “package:tidyr”
[9] “package:tibble” “package:ggplot2” “package:tidyverse” “tools:rstudio”
[13] “package:stats” “package:graphics” “package:grDevices” “package:utils”
[17] “package:datasets” “package:methods” “Autoloads” “package:base”


#4

Ok, filter allows a list (via the comma), but subset does not.

(I get an error for Example 3 iris2).


#5

I think the differences in your third example are caused by the way the functions operate:

  • dplyr::filter() (which I’m assuming is the filter() you’re using) can accept comma-separated conditions (which are evaluated using &),
  • [ can’t have multiple, comma-separated values the way you have written it. Check the help file with ?"[" - the commas are used to separate the indices you’re specifying when you call the function (and a “rectangle” of data has just two - rows and columns)
  • subset() works differently to filter() and doesn’t accept comma-separated values of filter conditions. When you separate with a comma as you have done, subset() assumes the second condition after the comma is a new argument to subset() and not a new condition. Again, check the help file (via ?subset) for details.

(On a side note, you can have code displayed as code on the forum by wrapping it in backticks (`), or use the </> button in the editor).


#6

Hi martin.R, yes just re-ran the code and you’re right, example 3, iris_2 does throw an error. I never noticed this as I was watching the environment pane to see what happened. Thank you for your response :grinning:


#7

Just to expand further on why you get the Ex3 iris3 weird output:

(iris_3 <- subset(iris, Sepal.Length > 7, Sepal.Width <= 3))
translates to:
(iris_3 <- subset(iris, subset = Sepal.Length > 7, select = Sepal.Width <= Petal.Length))
i.e. “3” refers to the column number, which is Petal.Length. Therefore the expression has unintentionally become an OR expression.

You will often get an unintended output, rather than an error, if you are not careful with functions.


#8

Hi Jim89, I’ve just reviewed my code against your comments. Yes I used the tidyverse package with dplyr in it. Really helpful comments which I’ll look into further - many thanks.


#9

Aha! That makes sense - thanks martin.R :grinning: