There is a large NYTimes Covid data set at It's about 4mbytes I'm an old guy, can baretly tie my shoes, but I badly need to get that sucker into a dataframe in RStudio. If anybody would show me how, forever grateful Thanks
You can read CSV files into R by using the read.csv() function.
read.csv()
data <- read.csv(file = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv") head(data) #> date county state fips cases deaths #> 1 2020-01-21 Snohomish Washington 53061 1 0 #> 2 2020-01-22 Snohomish Washington 53061 1 0 #> 3 2020-01-23 Snohomish Washington 53061 1 0 #> 4 2020-01-24 Cook Illinois 17031 1 0 #> 5 2020-01-24 Snohomish Washington 53061 1 0 #> 6 2020-01-25 Orange California 6059 1 0
Created on 2020-05-08 by the reprex package (v0.3.0)
You sir, are a very kind man. I hope your parents are proud. But we have a saying, "No kind act goes unpunished." Now I wish to select a subset of "data", all entries for a given state, and a subset of counties c(county1, county2 etc) with a dataframe format with headings Date State County1 County2 ... with each County column showing the total number of Covid19 cases I would be too proud to ask, but I suspect this code will be of great use to others as well with gratitude Allan
For these kinds of data manipulation tasks, I strongly recommend the dplyr package. You can learn more about it here.
dplyr
The filter() verb is used to subset a data frame. In the example below, I've filtered for all records pertaining to Orange county, CA.
filter()
The second operation you're describing is a job for group_by() + summarise(). This will calculate the total number of cases for each state-county combination.
group_by()
summarise()
library(dplyr, warn.conflicts = FALSE) library(readr) data <- read_csv(file = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv") #> Parsed with column specification: #> cols( #> date = col_date(format = ""), #> county = col_character(), #> state = col_character(), #> fips = col_character(), #> cases = col_double(), #> deaths = col_double() #> ) filter(data, county == "Orange" & state == "California") #> # A tibble: 103 x 6 #> date county state fips cases deaths #> <date> <chr> <chr> <chr> <dbl> <dbl> #> 1 2020-01-25 Orange California 06059 1 0 #> 2 2020-01-26 Orange California 06059 1 0 #> 3 2020-01-27 Orange California 06059 1 0 #> 4 2020-01-28 Orange California 06059 1 0 #> 5 2020-01-29 Orange California 06059 1 0 #> 6 2020-01-30 Orange California 06059 1 0 #> 7 2020-01-31 Orange California 06059 1 0 #> 8 2020-02-01 Orange California 06059 1 0 #> 9 2020-02-02 Orange California 06059 1 0 #> 10 2020-02-03 Orange California 06059 1 0 #> # ... with 93 more rows data %>% group_by(state, county) %>% summarise(total_cases = sum(cases)) #> # A tibble: 2,911 x 3 #> # Groups: state [55] #> state county total_cases #> <chr> <chr> <dbl> #> 1 Alabama Autauga 1073 #> 2 Alabama Baldwin 4088 #> 3 Alabama Barbour 775 #> 4 Alabama Bibb 917 #> 5 Alabama Blount 860 #> 6 Alabama Bullock 330 #> 7 Alabama Butler 1073 #> 8 Alabama Calhoun 2543 #> 9 Alabama Chambers 8073 #> 10 Alabama Cherokee 401 #> # ... with 2,901 more rows
Note: Unlike in my previous post, I've used read_csv() from the readr package instead of read.csv() to read the file. This has (among other things) some advantages when it comes to neatly displaying the contents of large data frames.
read_csv()
readr
And here is my updated plot for the San Francisco Bay area where cases per day are going back up, reasons unclear
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.