Generating time intervals of a data set based on shortest time interval and retaining corresponding values

Hi all!

I am quite new to R, so I would like to ask help with what kind of approach I should be taking.

I have time series data of gaze behavior that I would like to analyze which is structured as such:

Participant - Start - End - Duration - Gazed_Entity

The issue is that for each participant, I have unique time intervals. Although the time intervals are different for each gaze duration, it is occuring at the same time as other participants. This looks like as the following:

|Participant_Code|Start|End|Duration|Gazed_entity|
|Pink |00:00:00,000|00:00:50,368|00:00:50,368|Laptop|
|Pink |00:00:50,368|00:00:51,316|00:00:00,948|Yellow|
|Pink |00:00:51,316|00:01:12,287|00:00:20,971|Laptop|
|Pink |00:01:12,287|00:01:12,874|00:00:00,587|Other|
|Green|00:00:00,000|00:00:14,222|0:00:14,222|Laptop|
|Green|00:00:14,222|00:00:15,023|0:00:00,801|Pink|
|Green|00:00:15,023|00:01:16,201|0:01:01,178|Laptop|
|Green|00:01:16,201|00:01:16,869|0:00:00,668|Yellow|

For the analysis that I will be doing (which is the crqa package in R), I need to have equal length time intervals for each participant. How can I do this while also retaining "Gazed_Entity" that corresponds to that time interval?

The results should look something like this (which I am manually doing):

Shortest duration in this example: |00:00:00,587| so,
Pink - start: |00:00:00,000| end: |00:00:00,587| Gazed_entity: Laptop
Pink - start: |00:00:00,587| end: |00:00:01,174| Gazed_entity: Laptop
Pink - start: |00:00:01,174| end: |00:00:01,761| Gazed_entity: Laptop

I am not specifically asking for the formula, I have to figure it out based on the data I have; however, any suggestions towards the direction I should take in terms of functions and methodology would be appreciated!

Thanks all!

You will have to have a look at the lubridate package, which can deal with times and also fractional seconds. However, since you are dealing with such precise measurements, you should take the time and read a bit on stack overflow about the (numerical) precision of time storage in R, to get an idea about potential conflicts you will face:

R xts: .001 millisecond in index - Stack Overflow

R lubridate ymd_hms millisecond diff - Stack Overflow

As for the general procedure (not covering the fractional seconds problem), you can do something like this, which only uses data.table and collapse, since those are pretty fast and data.tables ITime class works well with the fast statistical functions inside collapse and can also be used to perform arithmetic operations (like division), which cannot be done with the base R POSIXt class:

## Read in the data
Data <- data.table::fread(text = '|Participant_Code|Start|End|Duration|Gazed_entity|
|Pink |00:00:01|00:00:50|00:00:49|Laptop|
|Pink |00:00:50|00:00:51|00:00:01|Yellow|
|Pink |00:00:51|00:01:12|00:00:21|Laptop|
|Pink |00:01:12|00:01:13|00:00:01|Other|
|Green|00:00:00|00:00:14|0:00:14|Laptop|
|Green|00:00:14|00:00:15|0:00:01|Pink|
|Green|00:00:15|00:01:16|0:01:01|Laptop|
|Green|00:01:16|00:01:17|0:00:01|Yellow|',
                          sep = '|',header = TRUE) |>
  ## only keep columns 2 to 6, because the others are NA
  collapse::fselect(-c(1,7)) |>
  ## convert the times to ITime format from data.table
  ## (which can be easily used within collapse)
  collapse::ftransformv(vars = Start:Duration, FUN = data.table::as.ITime)

## Find the smallest duration
Data  |>
  (\(x) collapse::fmin(x$Duration))() -> min_dur

Data <- Data |>
  ## Now create a weight, to expand the data corresponding to the smallest duration
  collapse::fmutate(weight = as.integer(Duration / min_dur) + 1) |>
  ## Expand the Data with `tidyr::uncount()`
  tidyr::uncount(weights = weight) |>
  ## Recreate the Start and End according to the Duration
  # First, split into corresponding groups
  collapse::rsplit(~ list(Participant_Code,Gazed_entity)) |>
  # Second, apply a function which takes the start and end as well as the weight as arguments
  collapse::rapply2d(FUN = \(x){
    start <- as.POSIXct(collapse::fmin(x$Start))
    end   <- as.POSIXct(collapse::fmax(x$End))
    # There will be the current date added, since POSIXct is a date time format
    x$sequence_start <- seq.POSIXt(start, end, by = 1) |>
      # remove the date
      data.table::as.ITime()
    x$sequence_end <- data.table::shift(x$sequence_start, type = 'lead')
    # remove the last row, since there will be sequence_end equal to NA (due to the lead shift)
    x[1:nrow(x) - 1,]
  }) |>
  ## Recreate the data.frame
  collapse::unlist2d(idcols = c("Participant_Code","Gazed_entity"))

head(Data)
#>   Participant_Code Gazed_entity    Start      End Duration sequence_start
#> 1            Green       Laptop 00:00:00 00:00:14 00:00:14       00:00:00
#> 2            Green       Laptop 00:00:00 00:00:14 00:00:14       00:00:01
#> 3            Green       Laptop 00:00:00 00:00:14 00:00:14       00:00:02
#> 4            Green       Laptop 00:00:00 00:00:14 00:00:14       00:00:03
#> 5            Green       Laptop 00:00:00 00:00:14 00:00:14       00:00:04
#> 6            Green       Laptop 00:00:00 00:00:14 00:00:14       00:00:05
#>   sequence_end
#> 1     00:00:01
#> 2     00:00:02
#> 3     00:00:03
#> 4     00:00:04
#> 5     00:00:05
#> 6     00:00:06

Created on 2022-11-28 with reprex v2.0.2

I hope you have got a general understanding of the procedure with this.
Hopefully there is somebody else able to cover the milliseconds problem you have, since I don't know for the moment and I don't have the time to dig into it.

Kind regards

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.