Joining two datasets

joewranglesdata · November 23, 2022, 6:47pm

Hi all,

I have two datasets that look a bit like this:

df_1 <- data.frame(time_point = c(1, 2, 3, 4, 5, 6, 7, 8))
df_2 <- data.frame(stretch_id = c("stretch1", "stretch2", "stretch3"),
                   stretch_starts = c(1, 4, 6),
                   stretch_ends = c(3, 5, 9))

Dataframe 1 is a list of time points. Dataframe 2 contains information about these time points should be organised - where they fall in between a value for stretch_starts and stretch_ends, they belong to that stretch.

I would like to join these two sets of data to reflect this, so that it looks like this:

df_aim <- data.frame(stretch_id = c("stretch1", "stretch1", "stretch1", "stretch2", "stretch2", "stretch3", "stretch3", "stretch3"),
                     time_point = c(1, 2, 3, 4, 5, 6, 7, 8),
                   stretch_starts = c(1, 1, 1, 4, 4, 6, 6, 6),
                   stretch_ends = c(3, 3, 3, 5, 5, 9, 9, 9))

i.e., if a value for time_point is equal or greater than a value in stretch_starts, and less than a value in stretch_ends, it belongs to that stretch - I want to create a single dataframe with information about which stretch my time_points belong to.

Can anyone propose any suggestions as to how I could get to something like df_aim from df_1 and df_2?

Thanks in advance

FactOREO · November 23, 2022, 7:39pm

To avoid confusion I renamed the two data.frames in my example below (they are data.table, not data.frame objects, hence the renaming):

library('data.table')
library('collapse')
df1 <- qDT(df_1)
df2 <- qDT(df_2)

df2 |>
  tidyr::uncount(weights = stretch_ends - stretch_starts + 1) |>
  fgroup_by(stretch_id) |>
  ### ignore the warnings
  fmutate(time_point = stretch_starts:stretch_ends) |>
  fungroup() |>
  ### join on df1
  (\(x) x[df1, on = 'time_point'])()

This will give the desired result of df_aim, but I bet there is a more elegant solution I cannot think of for now. This solutions throws warnings, since the mutate step create for every row in the group of stretch_id a vector of length max - min + 1 which is used to implement the counter. But since this happens for all rows per group, only the first one is used (which is fine, since we only need that one and the others are just copys). But it looks odd, if you have all those warning messages on screen.

Kind regards

Edit: With the same libraries as above, this one runs without warnings:

df_2 |>
  tidyr::uncount(weights = stretch_ends - stretch_starts + 1) |>
  rsplit(~ stretch_id) |> 
  rapply2d(
    FUN = \(x){
      time_point = seq.default(fmin(x$stretch_starts),fmax(x$stretch_ends),1)
      qDT(cbind(x, time_point))
    }
  ) |>
  rbindlist(idcol = 'stretch_id') |>
  ### join on df1
  (\(x) x[df1, on = 'time_point'])()

Is utilizes collapse::rapply2d(), which applies a function like lapply() to a list, but is faster and easily pipable. The resulting data.table objects inside the list are then bind together with data.table::rbindlist() and again joined on df1. Note that this approach starts with your data.frame object df_2 instead of the data.table object.

Matthias · November 24, 2022, 9:34am

The fuzzyjoin-package has a function to match "Intervals of (start, end) that overlap (interval_inner_join)"

I remember it wasn't very quick on big datasets, but at least easy to use.

system · December 15, 2022, 9:35am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.