There is a large NYTimes Covid data set at
It's about 4mbytes
I'm an old guy, can baretly tie my shoes, but I badly need to get that sucker
into a dataframe in RStudio.
If anybody would show me how, forever grateful
Thanks
You can read CSV files into R by using the read.csv()
function.
data <- read.csv(file = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv")
head(data)
#> date county state fips cases deaths
#> 1 2020-01-21 Snohomish Washington 53061 1 0
#> 2 2020-01-22 Snohomish Washington 53061 1 0
#> 3 2020-01-23 Snohomish Washington 53061 1 0
#> 4 2020-01-24 Cook Illinois 17031 1 0
#> 5 2020-01-24 Snohomish Washington 53061 1 0
#> 6 2020-01-25 Orange California 6059 1 0
Created on 2020-05-08 by the reprex package (v0.3.0)
You sir, are a very kind man. I hope your parents are proud.
But we have a saying, "No kind act goes unpunished."
Now I wish to select a subset of "data", all entries for a given state,
and a subset of counties c(county1, county2 etc)
with a dataframe format with headings
Date State County1 County2 ...
with each County column showing the total number of Covid19 cases
I would be too proud to ask, but I suspect this code will be of great use to others as well
with gratitude
Allan
For these kinds of data manipulation tasks, I strongly recommend the dplyr
package. You can learn more about it here.
The filter()
verb is used to subset a data frame. In the example below, I've filtered for all records pertaining to Orange county, CA.
The second operation you're describing is a job for group_by()
+ summarise()
. This will calculate the total number of cases for each state-county combination.
library(dplyr, warn.conflicts = FALSE)
library(readr)
data <- read_csv(file = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv")
#> Parsed with column specification:
#> cols(
#> date = col_date(format = ""),
#> county = col_character(),
#> state = col_character(),
#> fips = col_character(),
#> cases = col_double(),
#> deaths = col_double()
#> )
filter(data, county == "Orange" & state == "California")
#> # A tibble: 103 x 6
#> date county state fips cases deaths
#> <date> <chr> <chr> <chr> <dbl> <dbl>
#> 1 2020-01-25 Orange California 06059 1 0
#> 2 2020-01-26 Orange California 06059 1 0
#> 3 2020-01-27 Orange California 06059 1 0
#> 4 2020-01-28 Orange California 06059 1 0
#> 5 2020-01-29 Orange California 06059 1 0
#> 6 2020-01-30 Orange California 06059 1 0
#> 7 2020-01-31 Orange California 06059 1 0
#> 8 2020-02-01 Orange California 06059 1 0
#> 9 2020-02-02 Orange California 06059 1 0
#> 10 2020-02-03 Orange California 06059 1 0
#> # ... with 93 more rows
data %>%
group_by(state, county) %>%
summarise(total_cases = sum(cases))
#> # A tibble: 2,911 x 3
#> # Groups: state [55]
#> state county total_cases
#> <chr> <chr> <dbl>
#> 1 Alabama Autauga 1073
#> 2 Alabama Baldwin 4088
#> 3 Alabama Barbour 775
#> 4 Alabama Bibb 917
#> 5 Alabama Blount 860
#> 6 Alabama Bullock 330
#> 7 Alabama Butler 1073
#> 8 Alabama Calhoun 2543
#> 9 Alabama Chambers 8073
#> 10 Alabama Cherokee 401
#> # ... with 2,901 more rows
Created on 2020-05-08 by the reprex package (v0.3.0)
Note: Unlike in my previous post, I've used read_csv()
from the readr
package instead of read.csv()
to read the file. This has (among other things) some advantages when it comes to neatly displaying the contents of large data frames.
And here is my updated plot for the San Francisco Bay area
where cases per day are going back up, reasons unclear
Thanks again, Allan
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.