Reading Covid-19 .csv files

There is a large NYTimes Covid data set at


It's about 4mbytes
I'm an old guy, can baretly tie my shoes, but I badly need to get that sucker
into a dataframe in RStudio.
If anybody would show me how, forever grateful
Thanks

You can read CSV files into R by using the read.csv() function.

data <- read.csv(file = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv")

head(data)
#>         date    county      state  fips cases deaths
#> 1 2020-01-21 Snohomish Washington 53061     1      0
#> 2 2020-01-22 Snohomish Washington 53061     1      0
#> 3 2020-01-23 Snohomish Washington 53061     1      0
#> 4 2020-01-24      Cook   Illinois 17031     1      0
#> 5 2020-01-24 Snohomish Washington 53061     1      0
#> 6 2020-01-25    Orange California  6059     1      0

Created on 2020-05-08 by the reprex package (v0.3.0)

You sir, are a very kind man. I hope your parents are proud.
But we have a saying, "No kind act goes unpunished."
Now I wish to select a subset of "data", all entries for a given state,
and a subset of counties c(county1, county2 etc)
with a dataframe format with headings
Date State County1 County2 ...
with each County column showing the total number of Covid19 cases
I would be too proud to ask, but I suspect this code will be of great use to others as well
with gratitude
Allan

For these kinds of data manipulation tasks, I strongly recommend the dplyr package. You can learn more about it here.

The filter() verb is used to subset a data frame. In the example below, I've filtered for all records pertaining to Orange county, CA.

The second operation you're describing is a job for group_by() + summarise(). This will calculate the total number of cases for each state-county combination.

library(dplyr, warn.conflicts = FALSE)
library(readr)

data <- read_csv(file = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv")
#> Parsed with column specification:
#> cols(
#>   date = col_date(format = ""),
#>   county = col_character(),
#>   state = col_character(),
#>   fips = col_character(),
#>   cases = col_double(),
#>   deaths = col_double()
#> )

filter(data, county == "Orange" & state == "California")
#> # A tibble: 103 x 6
#>    date       county state      fips  cases deaths
#>    <date>     <chr>  <chr>      <chr> <dbl>  <dbl>
#>  1 2020-01-25 Orange California 06059     1      0
#>  2 2020-01-26 Orange California 06059     1      0
#>  3 2020-01-27 Orange California 06059     1      0
#>  4 2020-01-28 Orange California 06059     1      0
#>  5 2020-01-29 Orange California 06059     1      0
#>  6 2020-01-30 Orange California 06059     1      0
#>  7 2020-01-31 Orange California 06059     1      0
#>  8 2020-02-01 Orange California 06059     1      0
#>  9 2020-02-02 Orange California 06059     1      0
#> 10 2020-02-03 Orange California 06059     1      0
#> # ... with 93 more rows

data %>% 
  group_by(state, county) %>% 
  summarise(total_cases = sum(cases))
#> # A tibble: 2,911 x 3
#> # Groups:   state [55]
#>    state   county   total_cases
#>    <chr>   <chr>          <dbl>
#>  1 Alabama Autauga         1073
#>  2 Alabama Baldwin         4088
#>  3 Alabama Barbour          775
#>  4 Alabama Bibb             917
#>  5 Alabama Blount           860
#>  6 Alabama Bullock          330
#>  7 Alabama Butler          1073
#>  8 Alabama Calhoun         2543
#>  9 Alabama Chambers        8073
#> 10 Alabama Cherokee         401
#> # ... with 2,901 more rows

Created on 2020-05-08 by the reprex package (v0.3.0)

Note: Unlike in my previous post, I've used read_csv() from the readr package instead of read.csv() to read the file. This has (among other things) some advantages when it comes to neatly displaying the contents of large data frames.

And here is my updated plot for the San Francisco Bay area
where cases per day are going back up, reasons unclear


Thanks again, Allan

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.