Joining datasets in range of time.

I have two datasets that I would like to join based on the time range. for instance Event name total render need to be joined based in the time range in df2. Is this possible?

tibble::tribble(
                  ~timestamp,                                ~hostname,      ~eventName, ~eventType,                                            ~jobId,                                ~taskId,
  "2018-11-08T07:41:45.459Z", "04dc4e9647154250beeee51b866b0715000000",   "TotalRender",    "START", "1024-lvl12-7e026be3-5fd0-48ee-b7d1-abd61f747705", "00390eee-c26c-41da-a02d-556bb7fcac67",
  "2018-11-08T07:41:45.459Z", "04dc4e9647154250beeee51b866b0715000000", "Saving Config",    "START", "1024-lvl12-7e026be3-5fd0-48ee-b7d1-abd61f747705", "00390eee-c26c-41da-a02d-556bb7fcac67",
  "2018-11-08T07:41:32.461Z", "04dc4e9647154250beeee51b866b0715000000",        "Render",    "START", "1024-lvl12-7e026be3-5fd0-48ee-b7d1-abd61f747705", "00390eee-c26c-41da-a02d-556bb7fcac67",
  "2018-11-08T07:41:32.461Z", "04dc4e9647154250beeee51b866b0715000000", "Saving Config",     "STOP", "1024-lvl12-7e026be3-5fd0-48ee-b7d1-abd61f747705", "00390eee-c26c-41da-a02d-556bb7fcac67",
  "2018-11-08T07:42:09.344Z", "04dc4e9647154250beeee51b866b0715000000",        "Render",     "STOP", "1024-lvl12-7e026be3-5fd0-48ee-b7d1-abd61f747705", "00390eee-c26c-41da-a02d-556bb7fcac67"
  )

DF2

tibble::tribble(
                  ~timestamp,                                ~hostname,   ~gpuSerial,                                   ~gpuUUID, ~powerDrawWatt, ~gpuTempC, ~gpuUtilPerc, ~gpuMemUtilPerc,
  "2018-11-08T07:41:44.242Z", "04dc4e9647154250beeee51b866b0715000000", 323217056165, "GPU-a1119ee9-9cd1-919f-a479-b902142c717d",          25.94,       32L,           0L,              0L,
  "2018-11-08T07:41:47.259Z", "04dc4e9647154250beeee51b866b0715000000", 323217056165, "GPU-a1119ee9-9cd1-919f-a479-b902142c717d",          25.84,       32L,           0L,              0L,
  "2018-11-08T07:41:31.285Z", "04dc4e9647154250beeee51b866b0715000000", 323217056165, "GPU-a1119ee9-9cd1-919f-a479-b902142c717d",          25.84,       32L,           0L,              0L,
  "2018-11-08T07:41:33.301Z", "04dc4e9647154250beeee51b866b0715000000", 323217056165, "GPU-a1119ee9-9cd1-919f-a479-b902142c717d",          25.94,       32L,           0L,              0L,
  "2018-11-08T07:41:35.322Z", "04dc4e9647154250beeee51b866b0715000000", 323217056165, "GPU-a1119ee9-9cd1-919f-a479-b902142c717d",          25.84,       32L,           0L,              0L
  )

It's not clear with that sample data, because you have not calculated the time range between start and stop off any given event, but apparently they overlap each other, so this will depend on what conclusion are you trying to get from these analysis.

As @andresrcs says, it's not entirely clear what your criteria for joining are, but you can accomplish this type of join using the fuzzyjoin package. Take a look at this question and answer for an example of a join based on time intervals:

data.table has a very nice function for this: foverlaps https://rdrr.io/cran/data.table/man/foverlaps.html

@mfherman I have tried the fussy join but didn't get the results I am after. It gave the same out put as the left_join.

I initially tried to calculate the time difference between each event based on the stop and start value and then joined based on the hostname and timestamp, but the data dimensions was two small and my analysis wasn't accurate. In df1 the measurement is recorded every 5 seconds where in df2ly it is every 2 second.

Thanks for pointing this out. I'n struggling to apply this into my data. Please can you indicate whta i need to do?

foverlaps(x, y, by.x = if (!is.null(key(x))) key(x) else key(y),
    by.y = key(y), maxgap = 0L, minoverlap = 1L,
    type = c("any", "within", "start", "end", "equal"),
    mult = c("all", "first", "last"),
    nomatch = getOption("datatable.nomatch"),
    which = FALSE, verbose = getOption("datatable.verbose"))

When trying the fuzzy join i am getting a messge regarding memory: Error: cannot allocate vector of size 3.7 Gb
Is there anyway round this please?

I feel like you could use some of lubridate's interval functions (see https://lubridate.tidyverse.org/reference/lubridate-package.html#intervals for more), but without a clear deliverable, I'm not sure what would you need exactly.

Another way to do this is to perform a cartesian join / cross join (e.g. merge(df1, df2, all=TRUE), and then pair it up with filter call, where your conditions are <, == or > on the respective timestamps.

This is my lastest code. It states that: Error in is.data.table(y) : object 'y' not found

Any idea what I am missing?

require(data.table)
simple example:
  x = data.table(df2,start=c(timestamp),end=c(taskId))
y = data.table(df1,start=c(timestamp),end=c(gpuMemUtilPerc))
setkey(y,start,end)
foverlaps(x, y, by.x = c("timestamp","hostname"), type="within", nomatch = 0L)

This discussion seems to have moved here Joining datasets using foverlaps