Understanding Grouped Mutates

r4ds

#1

Hello, I was directed here by Hadley to pose a question. I am working through the R for Data Science book and I am a little confused by grouped mutates.

I am working with a dataset called “flights” that has 19 variables and over 300,000 rows of flight data from NYC.

I enter the following code

popular_dests <- flights %>%
+ group_by(dest) %>%
+ filter(n() > 365)
popular_dests

The following output is produced

Source: local data frame [332,577 x 19]
Groups: dest [77]

    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>   <chr>  <chr>
1   2013     1     1      517            515         2      830            819        11      UA   1545  N14228    EWR
2   2013     1     1      533            529         4      850            830        20      UA   1714  N24211    LGA
3   2013     1     1      542            540         2      923            850        33      AA   1141  N619AA    JFK
4   2013     1     1      544            545        -1     1004           1022       -18      B6    725  N804JB    JFK
5   2013     1     1      554            600        -6      812            837       -25      DL    461  N668DN    LGA
6   2013     1     1      554            558        -4      740            728        12      UA   1696  N39463    EWR
7   2013     1     1      555            600        -5      913            854        19      B6    507  N516JB    EWR
8   2013     1     1      557            600        -3      709            723       -14      EV   5708  N829AS    LGA
9   2013     1     1      557            600        -3      838            846        -8      B6     79  N593JB    JFK
10  2013     1     1      558            600        -2      753            745         8      AA    301  N3ALAA    LGA
# ... with 332,567 more rows, and 6 more variables: dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>



I don’t really understand why this code doesn’t appear to have changed popular_dests (there were 336,776 rows in flights, there are 332577 rows in popular_dests.

Is it that the code has taken the original dataset and only removed those flight where there are not 365 flights in a year? It only temporarily groups the data for the sake of the filter, and then ungroups again?

Thanks in advance.


#2

Does this help?

library(dplyr, warn.conflicts = FALSE)
library(nycflights13)

flights %>%
  group_by(dest) %>%
  filter(n() < 365)
#> # A tibble: 3,834 x 19
#> # Groups:   dest [27]
#>     year month   day dep_time sched_dep_time dep_delay arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1  2013     1     1      831            835        -4     1021
#>  2  2013     1     1      848            851        -3     1155
#>  3  2013     1     1      926            928        -2     1233
#>  4  2013     1     1      946            959       -13     1146
#>  5  2013     1     1     1655           1700        -5     1953
#>  6  2013     1     1     1814           1815        -1     2122
#>  7  2013     1     1     1840           1845        -5     2055
#>  8  2013     1     1     1900           1845        15     2212
#>  9  2013     1     1     1923           1859        24     2239
#> 10  2013     1     1     1952           1930        22     2358
#> # ... with 3,824 more rows, and 12 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>

ie. there are very few destinations that don’t have at least 365 flights.


#3

It does; thanks a lot. (I’m so sorry for taking a long time to reply, but I really appreciate your help)


#4

Wow, I didn’t think it was going to work with a database connection but it does. Learned something new today.