A dataframe question - solved

noelchiu · March 1, 2018, 7:33am

I am learning R and I follow the code in the book, Discovering Statistics Using R:

lecturerData<-read.delim("Lecturer Data.dat", header = TRUE); lecturerData
lecturerData$job<-factor(lecturerData$job, levels = c(1:2), labels = c("Lecturer", "Student")); lecturerData
lecturerOnly <- lecturerData[job=="Lecturer",]
lecturerOnly

However, the output is a bunch of "NA."

So I modify the code to the following:

lecturerData<-read.delim("Lecturer Data.dat", header = TRUE); lecturerData
lecturerData$job<-factor(lecturerData$job, levels = c(1:2), labels = c("Lecturer", "Student")); lecturerData
lecturerOnly <- lecturerData[**lecturerData$job**=="Lecturer",]
lecturerOnly

Then it works. (Show an output of rows with job = "Lecturer" only.)

I am wondering if it is supposed to be

lecturerOnly <- lecturerData[lecturerData$job=="Lecturer",]

instead of

lecturerOnly <- lecturerData[job=="Lecturer",]

Or did I do something wrong?

Thank you for your help.

mishabalyasin · March 1, 2018, 11:44am

Your correction is exactly how you supposed to do it, so you are not wrong for sure.

As with your example from the book, there are multiple explanations of why it doesn't work. I think, the simplest is the fact that there is a typo there. Another possible error is that, perhaps, they forgot to write attach(lecturerData) since it would have then be possible to just say lecturerOnly <- lecturerData[job==“Lecturer”,] and still get the correct answer. However, I don't recommend using this approach since it is a good way to have bunch of fairly awkward bugs that are difficult to understand.

Just an illustration of what I mean using multiple ways to do the same thing. Personally, I would always try to use dplyr approach since it makes your code more readable, but it is up to you.

library(tidyverse)

df <- tibble::tibble(x = rnorm(n = 100, mean = 0, sd = 1), y = rnorm(n = 100))

df[x > 0, ]
#> Error in `[.tbl_df`(df, x > 0, ): object 'x' not found

attach(df)

df[x > 0, ]
#> # A tibble: 49 x 2
#>        x       y
#>    <dbl>   <dbl>
#>  1 0.793 -0.382 
#>  2 0.847  0.204 
#>  3 0.312  1.53  
#>  4 0.749  0.761 
#>  5 0.149 -0.559 
#>  6 0.289 -2.29  
#>  7 0.485 -0.536 
#>  8 0.947  0.877 
#>  9 1.06  -0.318 
#> 10 0.293  0.0916
#> # ... with 39 more rows

detach(df)

df[x > 0, ]
#> Error in `[.tbl_df`(df, x > 0, ): object 'x' not found

with(data = df, expr = df[x > 0, ])
#> # A tibble: 49 x 2
#>        x       y
#>    <dbl>   <dbl>
#>  1 0.793 -0.382 
#>  2 0.847  0.204 
#>  3 0.312  1.53  
#>  4 0.749  0.761 
#>  5 0.149 -0.559 
#>  6 0.289 -2.29  
#>  7 0.485 -0.536 
#>  8 0.947  0.877 
#>  9 1.06  -0.318 
#> 10 0.293  0.0916
#> # ... with 39 more rows

df[df$x > 0, ]
#> # A tibble: 49 x 2
#>        x       y
#>    <dbl>   <dbl>
#>  1 0.793 -0.382 
#>  2 0.847  0.204 
#>  3 0.312  1.53  
#>  4 0.749  0.761 
#>  5 0.149 -0.559 
#>  6 0.289 -2.29  
#>  7 0.485 -0.536 
#>  8 0.947  0.877 
#>  9 1.06  -0.318 
#> 10 0.293  0.0916
#> # ... with 39 more rows

df %>%
  dplyr::filter(x > 0)
#> # A tibble: 49 x 2
#>        x       y
#>    <dbl>   <dbl>
#>  1 0.793 -0.382 
#>  2 0.847  0.204 
#>  3 0.312  1.53  
#>  4 0.749  0.761 
#>  5 0.149 -0.559 
#>  6 0.289 -2.29  
#>  7 0.485 -0.536 
#>  8 0.947  0.877 
#>  9 1.06  -0.318 
#> 10 0.293  0.0916
#> # ... with 39 more rows

Created on 2018-03-01 by the reprex package (v0.2.0).

noelchiu · March 1, 2018, 5:25pm

Thank you so much!!! I truly do appreciate your explanation and example. It is very helpful.

tbradley · March 1, 2018, 7:15pm

Rather then changing your question title to "-solved" please just mark the answer that solved your question as the solution. You can see how to do that here:

noelchiu · March 1, 2018, 7:51pm

I marked it but I don’t see the solution box for me to choose tho.
Could you help me find it? Thank you.

tbradley · March 1, 2018, 8:57pm

It should be available for you now. Can you try again? Thanks!

noelchiu · March 1, 2018, 9:37pm

It is done. Thank you very much.