I am working with
library(nycflights13)dataset in R and I’m trying to provide an answer to this specific question:
If I am leaving before noon, what are my top two airline options at each airport (JFK, LGA, EWR) that will have the least amount of delay time?
At first I created a summarised table that displayed each airport’s top two airline departure times (smallest departure delay) on average.
flights %>% filter(sched_dep_time < 1200, origin %in% c("JFK", "LGA", "EWR")) %>% aggregate(dep_delay ~ carrier + origin, ., mean) %>% group_by(origin) %>% top_n(n=-2, wt=dep_delay)
But then I thought, means aren’t really a good summary stat I should base my decision off of. So I started plotting my data
I can see there are quite a few outliers but my thought is to keep them in the dataset as they provide valuable information on how badly a departure can be delayed at times. After getting a glimpse of the entire dataset, I wanted to look closer at departure times that are negative (meaning departed early) or around zero.
These plots are valuable but don’t really make it obvious which airlines and airport would be the best for me to take given all the information I have. My next thought is to estimate the CDF for each airlines departure delay and then compute P(X < 0). At that point I can then select the airlines that have the highest probability of having a departure delay of at most zero.
My question is two parts:
Is this sound statistical reasoning? Is there a test I can perform that would be better? I Really would just like some guidance on thinking this problem through.
If this is sound statistical thinking, are there any resources you can direct me towards that would teach me how to implement it in R?