Help with Averaging Data per county per state for AQI data... then maps

KarmicDreamwork · March 12, 2019, 11:54pm

I am trying to study wildfire impacts on air quality. I downloaded the EPA's AQI data (https://aqs.epa.gov/aqsweb/airdata/download_files.html#AQI) and I need help with getting the averages.

Table looks like this. I made a reprex with some examples (thanks to andresrcs). The actual dataset has almost a million pieces of data in there.

data.frame(AQI = c(67L, 84L, 61L, 49L, 58L, 67L, 49L, 44L, 43L, 
    44L, 44L, 42L, 37L, 26L, 35L, 43L, 42L, 39L, 40L, 39L), State.Name = as.factor(c("Alabama", 
    "Alabama", "Alaska", "Alaska", "Washington", "Washington", 
    "Oregon", "Oregon", "Alabama", "Alabama", "Wyoming", "Wyoming", 
    "Wyoming", "Wyoming", "Wyoming", "Utah", "Washington", "Wyoming", 
    "Idaho", "Wyoming")), county.Name = as.factor(c("Autauga", 
    "Fayette", "Ada", "Fayette", "Clark", "Washington", "Clark", 
    "Cowlitz", "Clark", "Autauga", "Weston", "Fayette", "Weston", 
    "Cowlitz", "Weston", "Weston", "Weston", "Weston", "Clark", 
    "Weston")), Date = as.factor(c("1980-06-05", "1985-09-06", 
    "1989-04-07", "2007-11-08", "1980-10-09", "1990-11-10", "1980-10-11", 
    "2000-04-12", "1980-09-13", "1980-04-14", "2016-12-22", "2016-12-23", 
    "2016-12-24", "2012-10-25", "2013-12-26", "2010-12-27", "2016-12-28", 
    "2016-10-29", "2016-09-30", "2016-12-31")))
#>    AQI State.Name county.Name       Date
#> 1   67    Alabama     Autauga 1980-06-05
#> 2   84    Alabama     Fayette 1985-09-06
#> 3   61     Alaska         Ada 1989-04-07
#> 4   49     Alaska     Fayette 2007-11-08
#> 5   58 Washington       Clark 1980-10-09
#> 6   67 Washington  Washington 1990-11-10
#> 7   49     Oregon       Clark 1980-10-11
#> 8   44     Oregon     Cowlitz 2000-04-12
#> 9   43    Alabama       Clark 1980-09-13
#> 10  44    Alabama     Autauga 1980-04-14
#> 11  44    Wyoming      Weston 2016-12-22
#> 12  42    Wyoming     Fayette 2016-12-23
#> 13  37    Wyoming      Weston 2016-12-24
#> 14  26    Wyoming     Cowlitz 2012-10-25
#> 15  35    Wyoming      Weston 2013-12-26
#> 16  43       Utah      Weston 2010-12-27
#> 17  42 Washington      Weston 2016-12-28
#> 18  39    Wyoming      Weston 2016-10-29
#> 19  40      Idaho       Clark 2016-09-30
#> 20  39    Wyoming      Weston 2016-12-31

I need the average AQI for September-November per county per state (different states share same county names)
I need to know how many days or years of data there are for each county

Then I'll need to learn how to figure out how to find the difference between that info and a separate dataset for 2017 with the same type of data

My goal is to do this with 2017 as well and map the difference between the two data sets per county to see if there's any trends during the Eagle Creek Fire (occurred Sept-Nov 2017) and I know there's going to be a large potential error or other factors. I'm told I can do this in R but also no idea how.

I tried to do loops but I'm a bit confused on how to do it properly. I'm told I could use tidyverse and aggregate but also not sure if that would actually work for this.

So far, I was able to combine 1980-2016 files into one .csv file. I have been having issues with getting the loops to work. This is what I have so far:

#1980-2016 Air Quality Index Data
install.packages("dplyr")
library(dplyr)

#Combinate datasets for Air Quality Index Data for 1980-2016
AQIData <- rbind(AQI1980,AQI1981,AQI1982,AQI1983,AQI1984,AQI1985,AQI1986,AQI1987,AQI1988,AQI1989,AQI1990,AQI1991,AQI1992,AQI1993,AQI1994,AQI1995,AQI1996,AQI1997,AQI1998,AQI1999,AQI2000,AQI2001,AQI2002,AQI2003,AQI2004,AQI2005,AQI2006,AQI2007,AQI2008,AQI2009,AQI2010,AQI2011,AQI2012,AQI2013,AQI2014,AQI2015,AQI2016)

write.csv(AQIData, file = "AQIData.csv", row.names = FALSE)

#checkdata
str(AQIData)

#check how many counties there are
length(unique(AQIData$County.Code))

#county code is inconsistent, 200 codes, 3007 counties in US, over 1000 counties in dataset
#use county name/state

levels(AQIData$State.Name)
#remove canada, mexio, guam, virgin islands, puerto rico, 
#to get rid of states: tell R to remove rows where state is state.name
#which command, in a dataset, which are equal to ... which(v==1), will give location in vector equal to 1, finds row names
#Leave main one untouched
AQIData51 <- AQIData

#Mexico deleted observation, coded it weirdly. find it StateName <- levels(AQIData$State.Name); then length(which(AQIData$State.Name==StateName[9]))
StateName <- levels(AQIData$State.Name)
StateName
length(which(AQIData$State.Name==StateName[9]))

#Remove Country of Mexico, Canada, Virgin Islands, Puerto Rico, and Guam
AQIData51 <- AQIData51[-which(AQIData51$State.Name==StateName[9]),]
AQIData51 <- AQIData51[-which(AQIData51$State.Name=="Canada"),]
AQIData51 <- AQIData51[-which(AQIData51$State.Name=="Virgin Islands"),]
AQIData51 <- AQIData51[-which(AQIData51$State.Name=="Puerto Rico"),]
AQIData51 <- AQIData51[-which(AQIData51$State.Name=="Guam"),]



#factors... empty spots, instead of gone. tell R to get a different set of dividers
AQIData51 <- droplevels(AQIData51)
#always check structure to verify it's all good, such as if NA appears
str(AQIData51)
#51 states instead of 56

StateName2 <- levels(AQIData51$State.Name)
StateName2

#ask R to relate state to county to find AQI
#loops! 
#how to map it? Don't have Lat/Long in the same dataset. test it., example: test <- merge(AQIData51, LatLong, By County)
#extract it? merging has to have identical names, 
#ex. merge(both, seeds,by.x=c("genus", "species"), by.y-c("name1,"name2") ........... ?merge
#assuming no spelling errors

AQIData51$date <- as.Date(AQIData51$Date,format="%Y/%m/%d")
#check if it worked
str(AQIData51)
#contains NAs, didn't work

AQIData51$date <- as.Date(AQIData$Date,format="%Y/%m/%d")

#dates messed up
csvdat <- read.csv("C:/Users/keson/Desktop/Climate Change Lab/AQI data by year and county/AQI1980.csv")
str(csvdat)
#dates still different

#strptime function? didnt work
#change dates to character
#%d is day of the month as 0-31, but we have 1-31, day of month is 0-11 (?)
#anytime package?
#lubridate?
#try and split, date column is character, possible to split into day, month, year, columns 
#only need months
#grep
test1 <- as.Date(AQIData51$Date, formate="%m/%d/%Y")
head(test1)
#error na na na, reader is... 

test <- as.Date(csvdat$Date, formate="%m/%d/%Y")
head(test)



#formate is format.date
test <- as.Date(csvdat$Date,format.Date="%m/%d/%Y")
head(test)
#works for Y-m-d

test <- as.Date(AQIData51$Date,format.Date="%m/%d/%Y")
head(test)
#works

#sub data for states and county names based on certain months
state <- unique(AQIData51$State.Name)
#number of states
nS <- 51
#year variable

#November
#Took about 2 hours + to run
#for loop, do something for the number of times I tell you, indexes across that
meanmonth <- numeric()
#array of start and end dates to contain how many years of data for each county
#state name, county name, start date of record, end date of record
#tmp.m3 contains all info, subset 
st.end.dates = list()
k=1
for(i in 1:nS){
  tmp <- subset(AQIData51, State.Name==state[i])
  county <- unique(tmp$county.Name)
  nC <- length(county)
  CM <- numeric()
  for(t in 1:nC){
    #   x.tmp = which(tmp$county.Name==county[t])
    tmp.c <- subset(tmp,county.Name==county[t])
    tmp.m3 <-subset(tmp.c,format.Date(Date,'%m')%in%c('11'))
    #tmp.m3 want to extract date variable .. get year .. from year, find unique years 
    #print every step to make sure its doing it right, add to st.end.dates. unique function
    size.tmp3 = dim(tmp.m3)[1]
    st.end.dates[[k]] <- cbind(tmp.m3[1,c(1,2,5)],tmp.m3[size.tmp3,5])
    k=k+1
    print(size.tmp3)
    cm <- mean(tmp.m3[,"AQI"])
    CM <- c(CM,cm)
    print(CM)
  }
  names(CM) <- rep(state[i],nC)
  meanmonth <- c(meanmonth,CM)
}

I don't know how to finish the start-end date part of the loop. Can anyone please help with this? I am new to R and this has been the bane of my existence all semester.

andresrcs · March 13, 2019, 1:47am

You have been told right, you can do all this and more with the tools of the tidyverse, Here is a free online book where you can learn how to do it. (IMHO using loops is not the best approach)

For getting more specific help, I encourage you to make a REPRoducible EXample (reprex). A reprex makes it much easier for others to understand your issue and figure out how to help.

If you've never heard of a reprex before, you might want to start by reading this FAQ:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

KarmicDreamwork · March 13, 2019, 9:44am

Thank you! I am working on that tutorial you sent me. This is what I've been looking for.

For the tidyverse, the only thing I'm worried about is my AQI data set has Date in the y-m-d format and doesn't see it as a date (i think it was character? i can't remember atm)

Would I be able to use the filter in the dplyr package to sort out September-November for specified years? or do I have to figure out how to separate it into columns? (if so, how would I do that??)

andresrcs · March 13, 2019, 1:33pm

It's not clear to me what exactly you are trying to accomplish because you are not providing a reproducible example, but I think this example is similar to what you are trying to do

df <- data.frame(stringsAsFactors = FALSE,
                 dates = c("2019-03-01", "2019-03-02", "2019-03-03",
                           "2019-03-04", "2019-03-05", "2019-03-06",
                           "2019-03-07", "2019-03-08", "2019-03-09",
                           "2019-03-10", "2019-02-01", "2019-02-02",
                           "2019-02-03", "2019-02-04", "2019-02-05",
                           "2019-02-06", "2019-02-07", "2019-02-08",
                           "2019-02-09", "2019-02-10"),
                 values = 1:20,
                 group = sample(c("a", "b", "c"), 20, replace = TRUE)
)

library(dplyr)
library(lubridate)
library(tsibble)

df %>% 
    mutate(dates = ymd(dates)) %>% # convert text values into date class
    filter(dates <= "2019-03-05") %>% # filter by dates
    as_tsibble(key = id(group), index = dates) %>%
    group_by(group) %>% 
    index_by(year_month = yearmonth(dates)) %>% # monthly aggregates
    summarise(avg_value = mean(values))
#> # A tsibble: 6 x 3 [1M]
#> # Key:       group [3]
#>   group year_month avg_value
#>   <chr>      <mth>     <dbl>
#> 1 a       2019 feb      20  
#> 2 a       2019 mar       1.5
#> 3 b       2019 feb      16.2
#> 4 b       2019 mar       3  
#> 5 c       2019 feb      14  
#> 6 c       2019 mar       4.5

^{Created on 2019-03-13 by the reprex package (v0.2.1)}

KarmicDreamwork · March 15, 2019, 2:20am

Thank you. I'm trying to understand that code atm. I think I was able to make a proper reprex, though. Does this help? I tried to randomize it.
*my reply went to moderator or something so editing and hoping you can see it!

I need the average AQI for September-November per county per state (different states share same county names)
I need to know how many days or years of data there are for each county and hopefully be able to say only show me counties with over 100 (havent decided #) or so pieces of data to minimize error

Then I'll need to learn how to figure out how to find the difference between that info and a separate dataset for 2017 with the same type of data

data.frame(AQI = c(67L, 84L, 61L, 49L, 58L, 67L, 49L, 44L, 43L, 
    44L, 44L, 42L, 37L, 26L, 35L, 43L, 42L, 39L, 40L, 39L), State.Name = as.factor(c("Alabama", 
    "Alabama", "Alaska", "Alaska", "Washington", "Washington", 
    "Oregon", "Oregon", "Alabama", "Alabama", "Wyoming", "Wyoming", 
    "Wyoming", "Wyoming", "Wyoming", "Utah", "Washington", "Wyoming", 
    "Idaho", "Wyoming")), county.Name = as.factor(c("Autauga", 
    "Fayette", "Ada", "Fayette", "Clark", "Washington", "Clark", 
    "Cowlitz", "Clark", "Autauga", "Weston", "Fayette", "Weston", 
    "Cowlitz", "Weston", "Weston", "Weston", "Weston", "Clark", 
    "Weston")), Date = as.factor(c("1980-06-05", "1985-09-06", 
    "1989-04-07", "2007-11-08", "1980-10-09", "1990-11-10", "1980-10-11", 
    "2000-04-12", "1980-09-13", "1980-04-14", "2016-12-22", "2016-12-23", 
    "2016-12-24", "2012-10-25", "2013-12-26", "2010-12-27", "2016-12-28", 
    "2016-10-29", "2016-09-30", "2016-12-31")))
#>    AQI State.Name county.Name       Date
#> 1   67    Alabama     Autauga 1980-06-05
#> 2   84    Alabama     Fayette 1985-09-06
#> 3   61     Alaska         Ada 1989-04-07
#> 4   49     Alaska     Fayette 2007-11-08
#> 5   58 Washington       Clark 1980-10-09
#> 6   67 Washington  Washington 1990-11-10
#> 7   49     Oregon       Clark 1980-10-11
#> 8   44     Oregon     Cowlitz 2000-04-12
#> 9   43    Alabama       Clark 1980-09-13
#> 10  44    Alabama     Autauga 1980-04-14
#> 11  44    Wyoming      Weston 2016-12-22
#> 12  42    Wyoming     Fayette 2016-12-23
#> 13  37    Wyoming      Weston 2016-12-24
#> 14  26    Wyoming     Cowlitz 2012-10-25
#> 15  35    Wyoming      Weston 2013-12-26
#> 16  43       Utah      Weston 2010-12-27
#> 17  42 Washington      Weston 2016-12-28
#> 18  39    Wyoming      Weston 2016-10-29
#> 19  40      Idaho       Clark 2016-09-30
#> 20  39    Wyoming      Weston 2016-12-31

andresrcs · March 15, 2019, 2:35am

I guess this is a sample of your data but could you also paste a sample of your desired output? is not clear to me what exactly are you trying to accomplish.

andresrcs · March 15, 2019, 2:47am

Is this what you want?

library(lubridate)
library(dplyr)
library(tsibble)

df <- data.frame(AQI = c(67L, 84L, 61L, 49L, 58L, 67L, 49L, 44L, 43L, 44L, 44L, 42L,
                         37L, 26L, 35L, 43L, 42L, 39L, 40L, 39L),
                 Date = c("1980-06-05", "1985-09-06", "1989-04-07", "2007-11-08",
                          "1980-10-09", "1990-11-10", "1980-10-11", "2000-04-12",
                          "1980-09-13", "1980-04-14", "2016-12-22", "2016-12-23",
                          "2016-12-24", "2012-10-25", "2013-12-26", "2010-12-27", "2016-12-28",
                          "2016-10-29", "2016-09-30", "2016-12-31"),
                 State.Name = as.factor(c("Alabama", "Alabama", "Alaska", "Alaska",
                                          "Washington", "Washington", "Oregon", "Oregon",
                                          "Alabama", "Alabama", "Wyoming", "Wyoming", "Wyoming",
                                          "Wyoming", "Wyoming", "Utah", "Washington",
                                          "Wyoming", "Idaho", "Wyoming")),
                 county.Name = as.factor(c("Autauga", "Fayette", "Ada", "Fayette", "Clark",
                                           "Washington", "Clark", "Cowlitz", "Clark",
                                           "Autauga", "Weston", "Fayette", "Weston", "Cowlitz",
                                           "Weston", "Weston", "Weston", "Weston", "Clark",
                                           "Weston"))
)

df %>% 
    mutate(Date = ymd(Date)) %>% 
    as_tsibble(key = id(State.Name, county.Name), index = Date) %>% 
    group_by(State.Name, county.Name) %>%
    index_by(year_month = yearmonth(Date)) %>%
    summarise(mean_aqi = mean(AQI))
#> # A tsibble: 18 x 4 [1M]
#> # Key:       State.Name, county.Name [15]
#> # Groups:    State.Name [7]
#>    State.Name county.Name year_month mean_aqi
#>    <fct>      <fct>            <mth>    <dbl>
#>  1 Alabama    Autauga       1980 abr       44
#>  2 Alabama    Autauga       1980 jun       67
#>  3 Alabama    Clark         1980 sep       43
#>  4 Alabama    Fayette       1985 sep       84
#>  5 Alaska     Ada           1989 abr       61
#>  6 Alaska     Fayette       2007 nov       49
#>  7 Idaho      Clark         2016 sep       40
#>  8 Oregon     Clark         1980 oct       49
#>  9 Oregon     Cowlitz       2000 abr       44
#> 10 Utah       Weston        2010 dic       43
#> 11 Washington Clark         1980 oct       58
#> 12 Washington Washington    1990 nov       67
#> 13 Washington Weston        2016 dic       42
#> 14 Wyoming    Cowlitz       2012 oct       26
#> 15 Wyoming    Fayette       2016 dic       42
#> 16 Wyoming    Weston        2013 dic       35
#> 17 Wyoming    Weston        2016 oct       39
#> 18 Wyoming    Weston        2016 dic       40

KarmicDreamwork · March 15, 2019, 2:55am

Basically, I need a list of average AQI for counties per state for september-november

That code helps! my date piece in my dataset doesn't see it as date data, so it's been difficult to handle. I'll see if this works

I need the average AQI for September-November per county per state (different states share same county names)

I need to know how many days or years of data there are for each county and hopefully be able to say only show me counties with over 100 (havent decided #) or so pieces of data to minimize error

Then I'll need to learn how to figure out how to find the difference between that info and a separate dataset for 2017 with the same type of data but may be missing some counties since i think it differs for each year

andresrcs · March 15, 2019, 3:02am

All this sounds doable with the tidyverse you just have to apply what you are learning from the book I advised you to read before, give it a try and if you get stuck with something just ask.

KarmicDreamwork · March 15, 2019, 3:05am

That's great news. That's mostly what I was worried about. I'm working with the filter function right now. Thank you for all your help! I will post again when I need more help.

KarmicDreamwork · March 15, 2019, 9:17am

Okay, so I made a bit of progress. Now I'm struggling with weird errors for the 2017 dataset. The reprex should recreate the problem. I tested it and it showed the same error..

"1: Factor county.Name contains implicit NA, consider using forcats::fct_explicit_na
2: Factor county.Name contains implicit NA, consider using forcats::fct_explicit_na"

I tried using forcats but it didn't fix my problem, also tried dropping levels but with no luck.

Also, when I get this to work. How can I find the difference between the average AQI's between this and the other dataset? I know they will have different counties listed and so it won't match perfectly. How would I get around that problem?

My code:

#1. each county per state
#group_by(county.Name, State.Name) %>%
#2. only those with over 100 days calculated
#filter(n() > 100)
#3. Average AQI for each county
#summarise (AQIMean = mean (AQI, na.rm = TRUE))

#named dataframe below "attempt"

attempt1 <- attempt %>% 
  group_by(county.Name, State.Name) %>%
  filter(n() > 100) %>% 
  summarise (AQIMean = mean (AQI, na.rm = TRUE))

head(attempt1)

library(lubridate)
library(dplyr)
library(tsibble)
library(tidyverse)
data.frame(AQI = c(42L, 31L, 30L, 23L, 25L, 44L, 42L, 48L, 56L, 
    74L), year = c(2017, 2017, 2017, 2017, 2017, 2017, 2017, 
    2017, 2017, 2017), month = c(11, 11, 11, 11, 11, 11, 11, 
    11, 11, 11), day = c(1L, 2L, 3L, 4L, 5L, 26L, 27L, 28L, 29L, 
    30L), State.Name = as.factor(c("Oregon", "Oregon", "Oregon", 
    "Oregon", "Oregon", "Washington", "Washington", "Washington", 
    "Washington", "Washington")), county.Name = as.factor(c("Baker", 
    "Yakima", "Baker", "Baker", "Baker", "Yakima", "Baker", "Yakima", 
    "Yakima", "Yakima")), Category = as.factor(c("Good", "Good", 
    "Good", "Good", "Good", "Good", "Good", "Good", "Moderate", 
    "Moderate")))
#>    AQI year month day State.Name county.Name Category
#> 1   42 2017    11   1     Oregon       Baker     Good
#> 2   31 2017    11   2     Oregon      Yakima     Good
#> 3   30 2017    11   3     Oregon       Baker     Good
#> 4   23 2017    11   4     Oregon       Baker     Good
#> 5   25 2017    11   5     Oregon       Baker     Good
#> 6   44 2017    11  26 Washington      Yakima     Good
#> 7   42 2017    11  27 Washington       Baker     Good
#> 8   48 2017    11  28 Washington      Yakima     Good
#> 9   56 2017    11  29 Washington      Yakima Moderate
#> 10  74 2017    11  30 Washington      Yakima Moderate

KarmicDreamwork · March 15, 2019, 10:14am

My desired output is the average AQI per county per state between the months of September-November for 1980-2016

So, basically a list of the counties per state with average AQI

the dataset I have is for years 1980-2016, so I'll be comparing that with another dataset for year 2017 where I'll calculate the differences between the average AQI's per county and try to find a way to map it on a visual map where you can see highest or lowest differences.

I need per county per state since different states have the same county names

I'm studying impacts of wildfires on air quality, so I'm basically trying to map this to see differences in air quality index (AQI) during the time of the Eagle Creek Fire (Sept-Nov 2017) compared to the overall average. I'll do the same type of thing with PM2.5 data and climate data.

I hope that helps. I'm new to R so trying to learn it as i go.

mara · March 15, 2019, 10:23am

If you actually run the code below using reprex, it will show the output as well as the input — that's why it's so helpful. You can also assign the data to the name, etc. so that all anyone else has to do to recreate what you're seeing on their own machine. is copy and paste.

The code you have now doesn't include your library() calls (one of the key pieces of a reprex — see the guide @andresrcs linked to in his first reply), so if I run it, I get errors about not having the functions.

So, if I put all these pieces together, I'm able to reproduce what you're seeing, but the goal of the reprex is to help us help you!

library(tidyverse)

attempt <- data.frame(
           AQI = c(42L, 31L, 30L, 23L, 25L, 44L, 42L, 48L, 56L, 74L),
          year = c(2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017),
         month = c(11, 11, 11, 11, 11, 11, 11, 11, 11, 11),
           day = c(1L, 2L, 3L, 4L, 5L, 26L, 27L, 28L, 29L, 30L),
    State.Name = as.factor(c("Oregon", "Oregon", "Oregon", "Oregon", "Oregon",
                             "Washington", "Washington", "Washington",
                             "Washington", "Washington")),
   county.Name = as.factor(c("Baker", "Yakima", "Baker", "Baker", "Baker",
                             "Yakima", "Baker", "Yakima", "Yakima", "Yakima")),
      Category = as.factor(c("Good", "Good", "Good", "Good", "Good", "Good",
                             "Good", "Good", "Moderate", "Moderate"))
)

attempt1 <- attempt %>% 
  group_by(county.Name, State.Name) %>%
  filter(n() > 100) %>% 
  summarise (AQIMean = mean (AQI, na.rm = TRUE))

head(attempt1)
#> Warning: Factor `county.Name` contains implicit NA, consider using
#> `forcats::fct_explicit_na`

#> Warning: Factor `county.Name` contains implicit NA, consider using
#> `forcats::fct_explicit_na`
#> # A tibble: 1 x 3
#> # Groups:   county.Name [1]
#>   county.Name State.Name AQIMean
#>   <fct>       <fct>        <dbl>
#> 1 <NA>        <NA>           NaN

^{Created on 2019-03-15 by the reprex package (v0.2.1.9000)}

andresrcs · March 15, 2019, 2:18pm

This is happening because you can't have more than 100 daily observations between 3 months so no County-State combination satisfies that condition, if you lower the bar your code works e.g.

attempt %>% 
    group_by(county.Name, State.Name) %>%
    filter(n() > 3) %>% 
    summarise (AQIMean = mean (AQI, na.rm = TRUE))
#> # A tibble: 2 x 3
#> # Groups:   county.Name [2]
#>   county.Name State.Name AQIMean
#>   <fct>       <fct>        <dbl>
#> 1 Baker       Oregon        30  
#> 2 Yakima      Washington    55.5

KarmicDreamwork · March 15, 2019, 10:29pm

Sorry about that! I edited my original reprex. I'll keep in mind to have that in the future

library(lubridate)
library(dplyr)
library(tsibble)
library(tidyverse)

I figured it out with the help of andresrcs, though.

Now I'm trying to figure out how to find the differences of the average AQI in two separate dataframes with different column sizes. Such as these (i don't think they have any needed packages?)

I want the difference in its own column so I can map it later. I will need to find a way to match lat/long with the county names so I can map the differences visually on a county map. I'm told I can do this in R but no idea yet.

data.frame(AQIMean = c(24.3663594470046, 37.8342696629214, 26.8333333333333), 
    county.Name = as.factor(c("Adams", "Asotin", "Baker")), State.Name = as.factor(c("Washington", 
        "Washington", "Oregon")))
#>    AQIMean county.Name State.Name
#> 1 24.36636       Adams Washington
#> 2 37.83427      Asotin Washington
#> 3 26.83333       Baker     Oregon

data.frame(AQIMean = c(24.3663594470046, 37.8342696629214, 26.8333333333333), 
    county.Name = as.factor(c("Adams", "Asotin", "Columbia")), 
    State.Name = as.factor(c("Washington", "Washington", "Oregon")))
#>    AQIMean county.Name State.Name
#> 1 24.36636       Adams Washington
#> 2 37.83427      Asotin Washington
#> 3 26.83333    Columbia     Oregon

KarmicDreamwork · March 17, 2019, 6:51am

Oh, duh. haha. thanks! I got it

Now I'm trying to figure out how to find the differences of the average AQI in two separate dataframes with different column sizes. Such as these

I want the difference in its own column so I can map it later. I will need to find a way to match lat/long with the county names so I can map the differences visually on a county map. I'm told I can do this in R but no idea yet.

data.frame(AQIMean = c(24.3663594470046, 37.8342696629214, 26.8333333333333), 
    county.Name = as.factor(c("Adams", "Asotin", "Baker")), State.Name = as.factor(c("Washington", 
        "Washington", "Oregon")))
#>    AQIMean county.Name State.Name
#> 1 24.36636       Adams Washington
#> 2 37.83427      Asotin Washington
#> 3 26.83333       Baker     Oregon

data.frame(AQIMean = c(24.3663594470046, 37.8342696629214, 26.8333333333333), 
    county.Name = as.factor(c("Adams", "Asotin", "Columbia")), 
    State.Name = as.factor(c("Washington", "Washington", "Oregon")))
#>    AQIMean county.Name State.Name
#> 1 24.36636       Adams Washington
#> 2 37.83427      Asotin Washington
#> 3 26.83333    Columbia     Oregon

KarmicDreamwork · March 17, 2019, 11:27am

Ah man, i guess I was tired, haha. Thank you!

So tackling two things now.

First, Is there a way I can see how much data (AQI for each county) was collected in each year in the 1980-2016 dataset? My concern is maybe most the data was collected in a few years like 1980-1990 versus the entire span.

Second (and more importantly), how can I take the difference for each average AQI between the dataframes (such as #1 and #2)? the counties are a little different. One has about 50 and the other has 53.

I want to keep the county names for each state with the differences added in its own column.

The goal is to find a way to add the lat/long to the table so I can make a map with the differences for each county which I'll be tackling next.

Such as data.frame #1 (1980-2016) (I don't think I need packages here??)

data.frame(AQIMean = c(24.3663594470046, 37.8342696629214, 26.8333333333333), 
    county.Name = as.factor(c("Adams", "Asotin", "Baker")), State.Name = as.factor(c("Washington", 
        "Washington", "Oregon")))
#>    AQIMean county.Name State.Name
#> 1 24.36636       Adams Washington
#> 2 37.83427      Asotin Washington
#> 3 26.83333       Baker     Oregon

data.frame #2 (2017)

data.frame(AQIMean = c(24.3663594470046, 37.8342696629214, 26.8333333333333), 
    county.Name = as.factor(c("Adams", "Asotin", "Columbia")), 
    State.Name = as.factor(c("Washington", "Washington", "Oregon")))
#>    AQIMean county.Name State.Name
#> 1 24.36636       Adams Washington
#> 2 37.83427      Asotin Washington
#> 3 26.83333    Columbia     Oregon

andresrcs · March 17, 2019, 4:04pm

You just have to aggregate by year and add a count

df %>% 
    mutate(Date = ymd(Date)) %>%
    filter(month(Date) %in% c(9, 10, 11)) %>%
    as_tsibble(key = id(State.Name, county.Name), index = Date) %>%
    group_by(State.Name, county.Name) %>%
    index_by(Year = year(Date)) %>%
    summarise(mean_AQI = mean(AQI),
              days = n())
#> # A tsibble: 9 x 5 [1Y]
#> # Key:       State.Name, county.Name [9]
#> # Groups:    State.Name [6]
#>   State.Name county.Name  Year mean_AQI  days
#>   <fct>      <fct>       <dbl>    <dbl> <int>
#> 1 Alabama    Clark        1980       43     1
#> 2 Alabama    Fayette      1985       84     1
#> 3 Alaska     Fayette      2007       49     1
#> 4 Idaho      Clark        2016       40     2
#> 5 Oregon     Clark        1980       49     1
#> 6 Washington Clark        1980       58     1
#> 7 Washington Washington   1990       67     1
#> 8 Wyoming    Cowlitz      2012       26     1
#> 9 Wyoming    Weston       2016       39     2

It's not clear to me what you want but this is how to join them and calculate difference when posibble

library(dplyr)

df1 <- data.frame(AQIMean = c(24.3663594470046, 37.8342696629214, 26.8333333333333), 
                  county.Name = as.factor(c("Adams", "Asotin", "Baker")), State.Name = as.factor(c("Washington", 
                                                                                                   "Washington", "Oregon")))

df2 <- data.frame(AQIMean = c(14.3663594470046, 17.8342696629214, 16.8333333333333), 
                  county.Name = as.factor(c("Adams", "Asotin", "Columbia")), 
                  State.Name = as.factor(c("Washington", "Washington", "Oregon")))
df1 %>% 
    full_join(df2, by = c("State.Name", "county.Name")) %>% 
    select(State.Name, county.Name, AQIMean_1980_2016 = AQIMean.x, AQIMean_2017 = AQIMean.y) %>% 
    mutate(difference = AQIMean_2017 - AQIMean_1980_2016)
#> Warning: Column `county.Name` joining factors with different levels,
#> coercing to character vector
#>   State.Name county.Name AQIMean_1980_2016 AQIMean_2017 difference
#> 1 Washington       Adams          24.36636     14.36636        -10
#> 2 Washington      Asotin          37.83427     17.83427        -20
#> 3     Oregon       Baker          26.83333           NA         NA
#> 4     Oregon    Columbia                NA     16.83333         NA

KarmicDreamwork · March 18, 2019, 4:30am

This worked like a charm, thank you so much. It's exactly what I needed.

Okay... so, last thing. I need to figure out how to show the differences visually on a county map. I believe I need to add lat/long to each county in the dataframe and figure out how to map it in R.

I found a promising website on how to do this with tidyverse and urbnmapr but when I tried to install.packages("urbnmapr"), I was met with:

"Warning in install.packages :
package ‘urbnmapr’ is not available (for R version 3.5.1)"

Do you know how to bind the county names to lat/long and map it in R with the AQI differences data? Or where I might find how to do this? I noticed the book you sent me noted they didn't have enough pages to go over mapping in R.

andresrcs · March 18, 2019, 4:57am

Actually I think that urbnmapr it's a good choice for you, but you need to install it from GitHub not from CRAN, try with this command

devtools::install_github(“UrbanInstitute/urbnmapr”)