dplyr ifelse() mean calculation

Hello,
In the first code chunk, I'm trying to use the dplyr ifelse() function to calculate a summary mean value based on a condition. The code runs, but I am getting incorrect values returned (the correct values are produced in the second chunk of code, by filtering the df first, then running the summary). The sum ifelse() works fine to get the count of rows passing the condition, and they match the values returned in the filtered code chunk.

I suspect I'm doing something wrong with the second argument of the ifelse() and have unsuccessfully tried several options, and would appreciate any help.

library(tidyverse)
library(nycflights13)
flt_late_b <- flights %>% 
     group_by(origin) %>% 
   summarise(
   late_mean_delay = mean( ifelse( dep_delay >= 3.00, dep_delay, 0), na.rm = TRUE),
   count_late = sum( ifelse( dep_delay >= 3.00, 1, 0), na.rm = TRUE)
  ) %>% 
print('flt_late_b') 
flt_late <- flights %>% 
   filter(dep_delay >= 3.00) %>% 
   group_by(origin) %>% 
   summarise(
   late_ave_delay = mean( dep_delay, na.rm = TRUE),
   count_late = n()
  ) %>% 
 print( 'flt_late')

If you want to filter out the flights with departure delays less than 3.00, have ifelse() set the value to NA, not zero. The zero values will be included in the calculation of the mean, giving a much smaller value. Setting them to NA and then na.rm = TRUE will remove them from the calculation.

I am not 100% sure, but including print() at the end of the pipe does not make sense. How can it print an object that has not been assigned yet? When I ran your code it caused an error. I just wrapped the code in (), which means it will both assign the output to a name and print it.

library(tidyverse)
library(nycflights13)

(flt_late_b <- flights %>% 
  group_by(origin) %>% 
  summarise(
   late_mean_delay = mean(ifelse(dep_delay >= 3.00, dep_delay, NA), na.rm = TRUE),
    count_late = sum( ifelse(dep_delay >= 3.00, 1, 0), na.rm = TRUE)
  )
)
#> # A tibble: 3 × 3
#>   origin late_mean_delay count_late
#>   <chr>            <dbl>      <dbl>
#> 1 EWR               43.8      46759
#> 2 JFK               42.8      37213
#> 3 LGA               46.3      30177

(flt_late <- flights %>% 
  filter(dep_delay >= 3.00) %>% 
  group_by(origin) %>% 
  summarise(
    late_ave_delay = mean( dep_delay, na.rm = TRUE),
    count_late = n()
  )
)
#> # A tibble: 3 × 3
#>   origin late_ave_delay count_late
#>   <chr>           <dbl>      <int>
#> 1 EWR              43.8      46759
#> 2 JFK              42.8      37213
#> 3 LGA              46.3      30177

Created on 2022-12-31 with reprex v2.0.2

1 Like

Worked great thanks.
I fixed the print as well, it was left from previous attempts.

And Happy New Year :grinning:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.