Compute continuous running days of a set of engines

dplyr
lubridate
tibble

#1

I need to find the days of continuous operations for a set of engines. Since I have sensor measurements for these engines at more or less regular intervals in time, the idea is to look for wide gaps in the sensor measurements. If the gaps between sensor measurements are short, we carry on and continue incrementing the number of running days - the environment in which these engines operate is very hostile, so some missed measurements every now and then are to be expected. However, if the time between two sensor measurements is larger than a user-chosen threshold (1 day in the example below), we consider that to be a failure, and we restart computing running days from 0. Here is an example to illustrate the problem: as always, you don't need to go into the details of how I generated the data set, you can just look at the input and desired output.

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:lubridate':
#> 
#>     intersect, setdiff, union
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tibble)

set.seed(3)

# build input dataframe
create_input_df <- function(){
  m <- 3
  a_day <- 6
  n <- a_day*10
  lvl <- paste0("engine_", LETTERS[1:m])
  end_date <- as_datetime("2018-09-13 19:26:29")
  start_date <- end_date - n * hours(4)
  date_time <- seq(start_date, end_date, length.out = n)
  # short stops don't restart the running days count
  short_stops <- sample(seq_len(n), 5)
  # long stops, however, do
  medium_stop <- sample(seq_len(n), 1)
  medium_stop <- rep(medium_stop, each = a_day) + (-3:2)
  long_stop <- seq(30,40)
  # merge stop indices
  index <- sort(unique(c(short_stops, medium_stop, long_stop)))
  # remove the rows corresponding to the stops
  date_time <- date_time[-index]

  # build data frame
  n <- length(date_time)
  ntot <- n * m
  engines <- factor(rep(lvl, each = n), levels = lvl)
  x <- runif(ntot)
  y <- rnorm(ntot)
  my_df <- data_frame(engines, date_time = rep(date_time, m), x, y) 
  return(list(my_df, start_date))
  
}

foo <- create_input_df()
input_df <- foo[[1]]
start_date <- foo[[2]]
rm(foo)

# this is the data frame I would like to generate, with the added
# run_days column
output_df <- input_df %>%
  group_by(engines) %>%
  rowid_to_column() %>%
  mutate(run_days = as.integer(floor(case_when(
            rowid < 14 ~ (date_time - start_date) / ddays(1),
            rowid < 20 ~ (date_time - date_time[14]) / ddays(1),
            rowid >= 20 ~ (date_time - date_time[20]) / ddays(1))))) %>%
  ungroup

output_df
#> # A tibble: 135 x 6
#>    rowid engines  date_time               x      y run_days
#>    <int> <fct>    <dttm>              <dbl>  <dbl>    <int>
#>  1     1 engine_A 2018-09-03 19:26:29 0.125 -0.470        0
#>  2     2 engine_A 2018-09-03 23:30:33 0.295  1.09         0
#>  3     3 engine_A 2018-09-04 03:34:37 0.578 -0.621        0
#>  4     4 engine_A 2018-09-04 07:38:41 0.631  0.958        0
#>  5     5 engine_A 2018-09-04 11:42:45 0.512 -0.846        0
#>  6     6 engine_A 2018-09-04 15:46:49 0.505 -1.15         0
#>  7     7 engine_A 2018-09-04 19:50:53 0.534 -0.498        1
#>  8     8 engine_A 2018-09-04 23:54:57 0.557 -0.636        1
#>  9     9 engine_A 2018-09-05 03:59:01 0.868 -1.25         1
#> 10    10 engine_A 2018-09-05 08:03:05 0.830 -1.77         1
#> # ... with 125 more rows

Created on 2018-09-14 by the reprex package (v0.2.0).

Question: how can I generate the column run_days automatically, without of course inspecting the dataframe manually? The solution should be as fast as possible, because the actual dataframe is about 10^6 x 100, so relatively big.


#2

The lubridate Duration function-reference section is definitely your friend for this one.

I'm sure there's a faster way of computing this, but, for starters, I'd get the previous and/or next date_time for each recording by engine.:

output_df %>%
  group_by(engines) %>%
  mutate(next_dt = lead(date_time, order_by = engines),
         prev_dt = lag(date_time, order_by = engines))

You could then set your threshold based on a difftime, something to the effect of:

dt_diff = next_dt - date_time

The question then becomes a matter of what you mean by "days." Do you care about the calendar date or the duration of time? You'll be able to compute it either way, it just changes your specific calculation.


#3

Heeey @mara, thanks a lot for your answer! I had a look at the help for lead and lag and I understand that one finds the following value in the vector (the appropriately named next_dt) and the other the preceding value prev_dt. Concerning days: I care about the duration of time, not the calendar date (my final goal is to build a TTE (Time to Event) model). I'm not interested in fraction of days because, unlike my simple example where continuous operation only lasts a few days, in my real problem it can easily last hundreds of days. Thus the use of the floor function in building my output data frame. So I guess I could define dt_diff as

dt_diff = floor((next_dt - date_time)/ddays(1))

And I could define a stop_threshold <- 1. But now, how do I restart the run_days counter when dt_diff > stop_threshold? I guess some mutate=ifelse(...) magic, but I'm not sure exactly how would I go about this....Maybe something like

mutate(run_days = 0,
       run_days = ifelse(dt_diff < stop_threshold, cumsum(dt_diff), 0))

Something like this?


#4

Yeah, you basically want a "conditional restart" (I don't know if that's a real term, but it's how I think of it). See, for example, the answer in this SO thread:


#5

Thanks, I'll have a look at it in the next days and let you know if it solves my issues.


#6

Hi @mara, I made some progress but I can't just nail it and I'm getting crazy. I simplified the example, so that it's easier for follow for someone else, and I think I only need one little push to solve it. Should I post another question, or shall I just add a reply including my new reprex?


#7

If it's a continuation of this question, I'd add a reply with your simplified reprex.


#8

Ok, I considerably simplified my Minimal Reproducible Example:

  • only 2 engines, the minimum necessary
  • simpler/more readable dates: 10 days of operation for both engines, same time stamps for both engines.
  • At a sampling rate of 3 samples/day, 10 days of operation would make 30 observations for engine, but we have 10 skipped samples -> 20 observations for engine. The shorter data frame makes it easier to double check the solution by hand
  • "days" here indicates a period of time (not calendar dates), rounded down to the closest integer (as indicated by the floor function)
  • the days counter should be only reset on rows 12 (>1 day break), 17 (>2 days break), 21 (change from engine A to engine B), 32 and 37 (same dates as for engine A, but for engine B)

Here's the calculated_output and the expected_output: as you can see, they don't correspond

library(lubridate, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(tibble, warn.conflicts = FALSE)
# tibble printing options
options(tibble.print_min = 100, tibble.width = Inf)

# build input dataframe
create_input_df <- function(){
  m <- 2
  a_day <- 3
  n <- a_day*10
  lvl <- paste0("engine_", LETTERS[1:m])
  start_date <- as_datetime("2018-09-13 00:00:00")
  end_date <- start_date + n * hours(8)
  date_time <- seq(start_date, end_date, length.out = n)
  # short stops don't restart the running days count
  short_stops <- seq(2, n, by = n/5)
  # long stops, however, do
  center <- floor(n/2)
  medium_stop <- seq(center-1, center)
  long_stop <- seq(22,26)
  # merge stop indices
  index <- sort(unique(c(short_stops, medium_stop, long_stop)))
  # remove the rows corresponding to the stops
  date_time <- date_time[-index]

  # build data frame
  n <- length(date_time)
  ntot <- n * m
  engines <- factor(rep(lvl, each = n), levels = lvl)
  x <- runif(ntot)
  y <- rnorm(ntot)
  my_df <- data_frame(engines, date_time = rep(date_time, m), x, y) 
  return(list(my_df, start_date))
  
}

foo <- create_input_df()
input_df <- foo[[1]]
start_date <- foo[[2]]
rm(foo)

stop_threshold <- 1
calculated_output <- input_df %>%
  group_by(engines) %>%
  mutate(prev_dt = lag(date_time, order_by = engines),
         dt_diff = ifelse(is.na(date_time - prev_dt), 0, (date_time - prev_dt)/ddays(1))) %>%
  group_by(engines, grp = cumsum(dt_diff > stop_threshold)) %>% 
  mutate(run_days = floor(cumsum(dt_diff))) %>%
  ungroup
calculated_output
#> # A tibble: 40 x 8
#>    engines  date_time                 x       y prev_dt             dt_diff
#>    <fct>    <dttm>                <dbl>   <dbl> <dttm>                <dbl>
#>  1 engine_A 2018-09-13 00:00:00 0.196   -1.30   NA                    0    
#>  2 engine_A 2018-09-13 16:33:06 0.0333  -0.108  2018-09-13 00:00:00   0.690
#>  3 engine_A 2018-09-14 00:49:39 0.693   -1.29   2018-09-13 16:33:06   0.345
#>  4 engine_A 2018-09-14 09:06:12 0.735    1.56   2018-09-14 00:49:39   0.345
#>  5 engine_A 2018-09-14 17:22:45 0.651   -2.11   2018-09-14 09:06:12   0.345
#>  6 engine_A 2018-09-15 01:39:18 0.837   -0.152  2018-09-14 17:22:45   0.345
#>  7 engine_A 2018-09-15 18:12:24 0.492   -0.189  2018-09-15 01:39:18   0.690
#>  8 engine_A 2018-09-16 02:28:57 0.664   -0.553  2018-09-15 18:12:24   0.345
#>  9 engine_A 2018-09-16 10:45:31 0.180    0.600  2018-09-16 02:28:57   0.345
#> 10 engine_A 2018-09-16 19:02:04 0.0507  -0.697  2018-09-16 10:45:31   0.345
#> 11 engine_A 2018-09-17 03:18:37 0.837   -0.326  2018-09-16 19:02:04   0.345
#> 12 engine_A 2018-09-18 04:08:16 0.0239   0.0353 2018-09-17 03:18:37   1.03 
#> 13 engine_A 2018-09-18 12:24:49 0.237   -0.195  2018-09-18 04:08:16   0.345
#> 14 engine_A 2018-09-18 20:41:22 0.941   -0.170  2018-09-18 12:24:49   0.345
#> 15 engine_A 2018-09-19 04:57:55 0.120   -0.338  2018-09-18 20:41:22   0.345
#> 16 engine_A 2018-09-19 21:31:02 0.902   -0.0708 2018-09-19 04:57:55   0.690
#> 17 engine_A 2018-09-21 23:10:20 0.853   -0.174  2018-09-19 21:31:02   2.07 
#> 18 engine_A 2018-09-22 07:26:53 0.807    0.0990 2018-09-21 23:10:20   0.345
#> 19 engine_A 2018-09-22 15:43:26 0.799    0.865  2018-09-22 07:26:53   0.345
#> 20 engine_A 2018-09-23 00:00:00 0.237    0.757  2018-09-22 15:43:26   0.345
#> 21 engine_B 2018-09-13 00:00:00 0.424   -0.0950 NA                    0    
#> 22 engine_B 2018-09-13 16:33:06 0.597   -0.0994 2018-09-13 00:00:00   0.690
#> 23 engine_B 2018-09-14 00:49:39 0.589    0.0720 2018-09-13 16:33:06   0.345
#> 24 engine_B 2018-09-14 09:06:12 0.471   -0.884  2018-09-14 00:49:39   0.345
#> 25 engine_B 2018-09-14 17:22:45 0.755   -0.688  2018-09-14 09:06:12   0.345
#> 26 engine_B 2018-09-15 01:39:18 0.497   -0.0582 2018-09-14 17:22:45   0.345
#> 27 engine_B 2018-09-15 18:12:24 0.809   -0.839  2018-09-15 01:39:18   0.690
#> 28 engine_B 2018-09-16 02:28:57 0.0380  -1.31   2018-09-15 18:12:24   0.345
#> 29 engine_B 2018-09-16 10:45:31 0.193   -0.645  2018-09-16 02:28:57   0.345
#> 30 engine_B 2018-09-16 19:02:04 0.991    0.0701 2018-09-16 10:45:31   0.345
#> 31 engine_B 2018-09-17 03:18:37 0.422   -0.835  2018-09-16 19:02:04   0.345
#> 32 engine_B 2018-09-18 04:08:16 0.222   -1.60   2018-09-17 03:18:37   1.03 
#> 33 engine_B 2018-09-18 12:24:49 0.207    0.814  2018-09-18 04:08:16   0.345
#> 34 engine_B 2018-09-18 20:41:22 0.957   -0.458  2018-09-18 12:24:49   0.345
#> 35 engine_B 2018-09-19 04:57:55 0.00400  0.932  2018-09-18 20:41:22   0.345
#> 36 engine_B 2018-09-19 21:31:02 0.482   -0.774  2018-09-19 04:57:55   0.690
#> 37 engine_B 2018-09-21 23:10:20 0.0872   1.16   2018-09-19 21:31:02   2.07 
#> 38 engine_B 2018-09-22 07:26:53 0.908    1.68   2018-09-21 23:10:20   0.345
#> 39 engine_B 2018-09-22 15:43:26 0.696   -0.383  2018-09-22 07:26:53   0.345
#> 40 engine_B 2018-09-23 00:00:00 0.451    0.922  2018-09-22 15:43:26   0.345
#>      grp run_days
#>    <int>    <dbl>
#>  1     0        0
#>  2     0        0
#>  3     0        1
#>  4     0        1
#>  5     0        1
#>  6     0        2
#>  7     0        2
#>  8     0        3
#>  9     0        3
#> 10     0        3
#> 11     0        4
#> 12     1        1
#> 13     1        1
#> 14     1        1
#> 15     1        2
#> 16     1        2
#> 17     2        2
#> 18     2        2
#> 19     2        2
#> 20     2        3
#> 21     0        0
#> 22     0        0
#> 23     0        1
#> 24     0        1
#> 25     0        1
#> 26     0        2
#> 27     0        2
#> 28     0        3
#> 29     0        3
#> 30     0        3
#> 31     0        4
#> 32     1        1
#> 33     1        1
#> 34     1        1
#> 35     1        2
#> 36     1        2
#> 37     2        2
#> 38     2        2
#> 39     2        2
#> 40     2        3

# this is the data frame I would like to generate, with the added run_days column
expected_output <- input_df %>%
  mutate(run_days = rep(c(0, 0, 1, 1, 1, 2, 2, 3, 3, 3, 4, 0, 0, 0, 1, 1, 0, 0, 0, 1), times = 2))
expected_output
#> # A tibble: 40 x 5
#>    engines  date_time                 x       y run_days
#>    <fct>    <dttm>                <dbl>   <dbl>    <dbl>
#>  1 engine_A 2018-09-13 00:00:00 0.196   -1.30          0
#>  2 engine_A 2018-09-13 16:33:06 0.0333  -0.108         0
#>  3 engine_A 2018-09-14 00:49:39 0.693   -1.29          1
#>  4 engine_A 2018-09-14 09:06:12 0.735    1.56          1
#>  5 engine_A 2018-09-14 17:22:45 0.651   -2.11          1
#>  6 engine_A 2018-09-15 01:39:18 0.837   -0.152         2
#>  7 engine_A 2018-09-15 18:12:24 0.492   -0.189         2
#>  8 engine_A 2018-09-16 02:28:57 0.664   -0.553         3
#>  9 engine_A 2018-09-16 10:45:31 0.180    0.600         3
#> 10 engine_A 2018-09-16 19:02:04 0.0507  -0.697         3
#> 11 engine_A 2018-09-17 03:18:37 0.837   -0.326         4
#> 12 engine_A 2018-09-18 04:08:16 0.0239   0.0353        0
#> 13 engine_A 2018-09-18 12:24:49 0.237   -0.195         0
#> 14 engine_A 2018-09-18 20:41:22 0.941   -0.170         0
#> 15 engine_A 2018-09-19 04:57:55 0.120   -0.338         1
#> 16 engine_A 2018-09-19 21:31:02 0.902   -0.0708        1
#> 17 engine_A 2018-09-21 23:10:20 0.853   -0.174         0
#> 18 engine_A 2018-09-22 07:26:53 0.807    0.0990        0
#> 19 engine_A 2018-09-22 15:43:26 0.799    0.865         0
#> 20 engine_A 2018-09-23 00:00:00 0.237    0.757         1
#> 21 engine_B 2018-09-13 00:00:00 0.424   -0.0950        0
#> 22 engine_B 2018-09-13 16:33:06 0.597   -0.0994        0
#> 23 engine_B 2018-09-14 00:49:39 0.589    0.0720        1
#> 24 engine_B 2018-09-14 09:06:12 0.471   -0.884         1
#> 25 engine_B 2018-09-14 17:22:45 0.755   -0.688         1
#> 26 engine_B 2018-09-15 01:39:18 0.497   -0.0582        2
#> 27 engine_B 2018-09-15 18:12:24 0.809   -0.839         2
#> 28 engine_B 2018-09-16 02:28:57 0.0380  -1.31          3
#> 29 engine_B 2018-09-16 10:45:31 0.193   -0.645         3
#> 30 engine_B 2018-09-16 19:02:04 0.991    0.0701        3
#> 31 engine_B 2018-09-17 03:18:37 0.422   -0.835         4
#> 32 engine_B 2018-09-18 04:08:16 0.222   -1.60          0
#> 33 engine_B 2018-09-18 12:24:49 0.207    0.814         0
#> 34 engine_B 2018-09-18 20:41:22 0.957   -0.458         0
#> 35 engine_B 2018-09-19 04:57:55 0.00400  0.932         1
#> 36 engine_B 2018-09-19 21:31:02 0.482   -0.774         1
#> 37 engine_B 2018-09-21 23:10:20 0.0872   1.16          0
#> 38 engine_B 2018-09-22 07:26:53 0.908    1.68          0
#> 39 engine_B 2018-09-22 15:43:26 0.696   -0.383         0
#> 40 engine_B 2018-09-23 00:00:00 0.451    0.922         1

The problem with my "solution" is that the counter is not reset to 0 - it's reset to the "restarting" value of dt_diff, which is larger than stop_threshold, thus it's always 1 at least.