difftime not working with NA values

I am brand new to R/RStudio and am stumbling over how to deal with NA values.
Apologies in advance if post is not properly formatted, or if I've made other newbie mistakes.
Thanks for any suggestions (simplified explanations much appreciated!):

Here's a simplified example of the problem:

##########################################
# Test data
##########################################
df = data.frame(
  DT1 = c(NA, NA, NA, "2020-01-01 0900", "2020-01-02 0915", "2020-01-03 0930"), 
  DT2 = c("2020-01-01 0900", "2020-01-01 0900", "2020-02-01 1000", "2020-01-01 1000", "2020-01-02 1100", "2020-01-03 1200"),
  stringsAsFactors = F
)
##########################################
# Convert to POSIXct
##########################################
df$DT1 <- ymd_hm(df$DT1)
df$DT2 <- ymd_hm(df$DT2)
df
##########################################
# Try to use difftime() to calculate elapsed time
##########################################
for(i in 1:nrow(df)){
  if(!is.na(df$DT1[i])) {df$TimeElapsedDays[i] <- difftime(df$DT2[i], df$DT1[i], units = c("days"))}
}

df before last command looks ok:

                 DT1                 DT2
1                <NA> 2020-01-01 09:00:00
2                <NA> 2020-01-01 09:00:00
3                <NA> 2020-02-01 10:00:00
4 2020-01-01 09:00:00 2020-01-01 10:00:00
5 2020-01-02 09:15:00 2020-01-02 11:00:00
6 2020-01-03 09:30:00 2020-01-03 12:00:00

Here's the error:

Error in `$<-.data.frame`(`*tmp*`, "TimeElapsedDays", value = c(NA, NA,  : 
  replacement has 4 rows, data has 6

Your problem is caused by trying to construct TimeElapsedDays one element at a time.
initialising it empty first would work

df$TimeElapsedDays <- NA
for (i in 1:nrow(df)) {
  if (!is.na(df$DT1[i])) {
    df$TimeElapsedDays[i] <- difftime(df$DT2[i], df$DT1[i], units = c("days"))
  }
}

alternatively using tidyverse)

library(tidyverse)
mutate(df,
       TimeElapsedDays = difftime(DT2,DT1,units= c("days")))

This is great! Thanks much for the quick and helpful reply.

Might I impose on your kindness to help me understand this? I have the following questions:

  1. Why didn't constructing TimeElapsedDays work via one element at a time? Because, since DT[1] to DT[3] were NA, the first 3 entries of TimeElapsedDays were left open--kind of indeterminate?

  2. Why did mutate work? Or, perhaps, what is special about mutate?

  3. Why is "<-" used sometimes [as in your first solution], vs "=" [as in your second]?

Feel free just to point me at some reading.
Thanks again! Got me unstuck!

Its really just 'how R works'
if you wrote

abc[5] <- 1

this would fail as there is no abc to set the 5th thing of.
you can initialise like

zzz <- NULL
#then 
(zzz[5] <-1) # would work

mutate is a function thats been designed to work the way it does, I can't say more than that. Its a tool, if you like what it does, and how it does what it does, use it :slight_smile:

<- is always preferred when assigning objects. = is necessary when assigning a value to a parameter name. you could actualy use <- inside the mutate, though it would look weird, and result in a 'bad' name for the column (i.e. the column name would be the expression), mutate encourages the use of = within it. This makes it look like when you pass parameters in a typical function, and is distinct from conventional do it yourself assignment

Welcome! Most functions in R are "vectorized" so you don't have to construct a loop. Mutate isn't special. You can simply subtract one column from another and create a new column at the same time. Plain old subtraction works because there is a subtraction method for datetime objects that creates a difftime object automatically. The 'difftime()` function gives you finer control over the result as you show by changing the interval of the result to days.

> df$diff <- df$DT2 - df$DT1 # That was easy!
> df
# A tibble: 6 x 3
  DT1                 DT2                 diff      
  <dttm>              <dttm>              <drtn>    
1 NA                  2020-01-01 09:00:00   NA hours
2 NA                  2020-01-01 09:00:00   NA hours
3 NA                  2020-02-01 10:00:00   NA hours
4 2020-01-01 09:00:00 2020-01-01 10:00:00 1.00 hours
5 2020-01-02 09:15:00 2020-01-02 11:00:00 1.75 hours
6 2020-01-03 09:30:00 2020-01-03 12:00:00 2.50 hours

As an aside, I'm not sure why you create a stringsAsFactors column in your data frame. It isn't doing anything. Customarily, you might use options(stringsAsFactors = FALSE) to start your session.

Thank you, both!! Very helpful.
I knew R was 'vectorized' but theory vs practice is a bit of a leap.
And, there is an old dog/new tricks challenge.

Anyway, I much appreciate your replies. Suspect I will be back!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.