New, calculated column from Date column in dataframe returning NAs

Hello, I am fairly new to R & RStudio. I'm reworking some code developed by another person, trying to update a dataset and re-run the same analyses. The code ran just fine previously but for some reason now it is returning NAs when it is supposed to create a new column called Days based on subtracting the Date column from a calculated StartDate column. It looks like the StartDate is not correct. The minimum of Date should be 4/1/2020, not 1/1/2021. I think the Date column needs to be in Date format = Year-month-day, maybe?

Any help would be much appreciated!

data.frame(
stringsAsFactors = FALSE,
Date = c("4/1/2020","4/2/2020",
"4/3/2020","4/4/2020","4/5/2020","4/6/2020","4/7/2020",
"4/8/2020","4/9/2020","4/10/2020"),
Tests = c(13, 15, 12, 5, 9, 25, 23, 13, 30, 35),
Positives = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 1),
Covid_Admits = c(1, 0, 0, 0, 0, 0, 1, 0, 0, 0)
)

# Load needed libraries
library(readxl)
library(dplyr)
library(tidyr)
library(effects)
library(zoo)
#> 
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric
library(Hmisc)


# Read the excel file into data frame.
covid_data <- read_excel(path="C:/Users/csk4117/Documents/Working files/Covid_Predictions_Dataset4.xlsx",sheet="Data_Admits")

# Create a subset with select columns
COVID19_Data <- select(covid_data, "Date", "Tests", "Positives", "Covid_Admits")
print(head(COVID19_Data,10))
#> # A tibble: 10 x 4
#>    Date      Tests Positives Covid_Admits
#>    <chr>     <dbl>     <dbl>        <dbl>
#>  1 4/1/2020     13         0            1
#>  2 4/2/2020     15         1            0
#>  3 4/3/2020     12         0            0
#>  4 4/4/2020      5         0            0
#>  5 4/5/2020      9         0            0
#>  6 4/6/2020     25         0            0
#>  7 4/7/2020     23         0            1
#>  8 4/8/2020     13         0            0
#>  9 4/9/2020     30         0            0
#> 10 4/10/2020    35         1            0

# Get the starting date
COVID19_Data$StartDate <- min(COVID19_Data$Date)

# Get the days from the start
COVID19_Data$Day <- as.Date(as.character(COVID19_Data$Date), format="%Y-%m-%d")-
  as.Date(as.character(COVID19_Data$StartDate), format="%Y-%m-%d")

# Checking my data
print(head(COVID19_Data))
#> # A tibble: 6 x 6
#>   Date     Tests Positives Covid_Admits StartDate Day    
#>   <chr>    <dbl>     <dbl>        <dbl> <chr>     <drtn> 
#> 1 4/1/2020    13         0            1 1/1/2021  NA days
#> 2 4/2/2020    15         1            0 1/1/2021  NA days
#> 3 4/3/2020    12         0            0 1/1/2021  NA days
#> 4 4/4/2020     5         0            0 1/1/2021  NA days
#> 5 4/5/2020     9         0            0 1/1/2021  NA days
#> 6 4/6/2020    25         0            0 1/1/2021  NA days

plot(COVID19_Data$Day, COVID19_Data$Tests, main = "Tests by Date",
     ylab = "Tests",
     xlab = "Test Day")
#> Warning in min(x): no non-missing arguments to min; returning Inf
#> Warning in max(x): no non-missing arguments to max; returning -Inf
#> Error in plot.window(...): need finite 'xlim' values


Created on 2021-04-13 by the reprex package (v0.3.0)
1 Like

The dates are all typeof character, which doesn’t support date arithmetic. See the vignette for the {lubridate} package for the mdy() function to convert to typeof dttm and the section on date arithmetic

I think your main issue is that you are using a format string that doesn't match the character format that your Date column is in. You are using %Y-%m-%d, but your dates are in a format of %m/%d/%Y!

Also, you need to convert your dates to a Date vector before you compute the min() of the column, otherwise that won't work correctly.

So, you might need something like this:

df <- tibble::tibble(
  Date = c(
    "4/1/2020","4/2/2020", "4/3/2020",
    "4/4/2020","4/5/2020","4/6/2020","4/7/2020",
    "4/8/2020","4/9/2020","4/10/2020"
  ),
  Tests = c(13, 15, 12, 5, 9, 25, 23, 13, 30, 35),
  Positives = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 1),
  Covid_Admits = c(1, 0, 0, 0, 0, 0, 1, 0, 0, 0)
)

# Date is a character column
head(df, n = 2)
#> # A tibble: 2 x 4
#>   Date     Tests Positives Covid_Admits
#>   <chr>    <dbl>     <dbl>        <dbl>
#> 1 4/1/2020    13         0            1
#> 2 4/2/2020    15         1            0

# Convert Date character column to Date
# (Notice my format string is different from yours!)
df$Date <- as.Date(df$Date, format = "%m/%d/%Y")

# Date is now a date column
head(df, n = 2)
#> # A tibble: 2 x 4
#>   Date       Tests Positives Covid_Admits
#>   <date>     <dbl>     <dbl>        <dbl>
#> 1 2020-04-01    13         0            1
#> 2 2020-04-02    15         1            0

# Get the starting date
start_date <- min(df$Date)
start_date
#> [1] "2020-04-01"

# Get the days from the start
df$Day <- df$Date - start_date

df
#> # A tibble: 10 x 5
#>    Date       Tests Positives Covid_Admits Day   
#>    <date>     <dbl>     <dbl>        <dbl> <drtn>
#>  1 2020-04-01    13         0            1 0 days
#>  2 2020-04-02    15         1            0 1 days
#>  3 2020-04-03    12         0            0 2 days
#>  4 2020-04-04     5         0            0 3 days
#>  5 2020-04-05     9         0            0 4 days
#>  6 2020-04-06    25         0            0 5 days
#>  7 2020-04-07    23         0            1 6 days
#>  8 2020-04-08    13         0            0 7 days
#>  9 2020-04-09    30         0            0 8 days
#> 10 2020-04-10    35         1            0 9 days

Created on 2021-04-13 by the reprex package (v1.0.0)

3 Likes

Yes! That did it. Thanks. I thought that was the issue and was trying to change it but had the format string wrong.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.