R data.table to calculate new columns from existing columns bases on certain conditions

Let's say I have the following data table:

dta <- data.table(
  criteria = c('A', 'A', 'B', 'A', 'A', 'B'),
  phase = list('block3', c('block1', 'block2'), 'block2', 'block2', 'block3', 'block1'),
  start_val = c(12.0, 1.0, 7.0, 7.0, 12.0, 1.0),
  end_val = c(15.0, 11.0, 11.0, 11.0, 15.0, 6.0),
  max_val = c(13.0, 8.0, 9.5, 11.0, 15.0, 6.0)
)

from which I need the resulting table with two additional column's, cor_start and cor_end

dtb <- data.table(
  criteria = c('A', 'A', 'B', 'A', 'A', 'B'),
  phase = list('block3', c('block1', 'block2'), 'block2', 'block2', 'block3', 'block1'),
  start_val = c(12.0, 1.0, 7.0, 7.0, 12.0, 1.0),
  end_val = c(15.0, 11.0, 11.0, 11.0, 15.0, 6.0),
  max_val = c(13.0, 8.0, 9.5, 11.0, 15.0, 6.0),
  cor_start = c(12.0, 1.0, 8.0, 9.5, 13.0, 6.0),
  cor_end = c(13.0, 8.0, 9.5, 11.0, 15.0, 6.0)
)

the new columns need to be calculated with reference to phase column by checking if there is any previous row with the current matching phase value.

For better understanding, in this example:

  • row 3 has a matching phase of block2 in row 2
  • row 4 has a matching phase of block2 in row 3
  • row 5 has a matching phase of block3 in row 1
  • row 6 has a matching phase of block1 in row 2

however, row 1 and row 2 have no previous matching phase rows. Note that the phase is of type list.

So, when there is a previous matching row, below are the conditions:

if (max_val in previous matching row is < end_val in current row)
  cor_start = previous matching row max_val
  cor_end = current row end_val

if (max_val in previous matching row is > end_val in current row)
  cor_start = current row end_val
  cor_end = current row end_val

and when there is no previous matching row, below are the conditions:

  cor_start = current row start_val
  cor_end = current row max_val

I looked into shift(), but could not figure out on how to set the above conditions ? Thanks!

In my experience, list columns are only good for storing non-basic values. It's tidier to have one "value" for each row/column intersection. For example, a list column is good for storing fitted models, but even then, it should be one model object per row.

Keeping with that line of thinking, working with the phases in "long form" is much simpler. We'll add an index to match things later. In your problem, you need to carry the last max_val forward for each phase. We can do this with the roll argument for data.table. We'll add a "just behind" index to force the rolling.

dta[, index := as.numeric(.I)]
phase_history <- dta[
  ,
  list(
    index   = rep(index, lengths(phase)),
    max_val = rep(max_val, lengths(phase)),
    phase   = unlist(phase)
  )
][
  ,
  pre_index := index - 0.5
]
phase_history
#    index max_val  phase pre_index
# 1:     1    13.0 block3       0.5
# 2:     2     8.0 block1       1.5
# 3:     2     8.0 block2       1.5
# 4:     3     9.5 block2       2.5
# 5:     4    11.0 block2       3.5
# 6:     5    15.0 block3       4.5
# 7:     6     6.0 block1       5.5

Next we'll join phase_history to itself, but using the pre_index as index. This matches each phase to the row with the closest lower value of index. First, let's see what this will look like.

phase_history[
  phase_history,
  on = c(phase = "phase", index = "pre_index"),
  roll = TRUE
]
#    index max_val  phase pre_index i.index i.max_val
# 1:   0.5      NA block3        NA       1      13.0
# 2:   1.5      NA block1        NA       2       8.0
# 3:   1.5      NA block2        NA       2       8.0
# 4:   2.5     8.0 block2       1.5       3       9.5
# 5:   3.5     9.5 block2       2.5       4      11.0
# 6:   4.5    13.0 block3       0.5       5      15.0
# 7:   5.5     8.0 block1       1.5       6       6.0

So indices 3, 4, 5, and 6 (shown in the i.index column) have matches, which means they repeat previous phases. Let's save the results of this merge to add the previous max_val values back into dta.

previous_values <- phase_history[
  phase_history,
  on = c(phase = "phase", index = "pre_index"),
  roll = TRUE,
  list(
    index        = i.index,
    prev_max_val = max_val
  )
]

dta[
  previous_values,
  on = "index",
  prev_max_val := prev_max_val
]

dta
#    criteria         phase start_val end_val max_val index prev_max_val
# 1:        A        block3        12      15    13.0     1           NA
# 2:        A block1,block2         1      11     8.0     2           NA
# 3:        B        block2         7      11     9.5     3          8.0
# 4:        A        block2         7      11    11.0     4          9.5
# 5:        A        block3        12      15    15.0     5         13.0
# 6:        B        block1         1       6     6.0     6          8.0

It's finally time to apply the cor_start and cor_end logic. data.table's ability to update only selected rows is great. It will ignore any NAs in the selecting vector, which means any non-matched rows won't be updated until the end.

dta[
  prev_max_val < end_val,
  ":="(
    cor_start = prev_max_val,
    cor_end   = end_val
  )
][
  prev_max_val > end_val,
  ":="(
    cor_start = end_val,
    cor_end   = end_val
  )
][
  is.na(cor_start),
  ":="(
    cor_start = start_val,
    cor_end   = max_val
  )
]

dta
#    criteria         phase start_val end_val max_val index prev_max_val cor_start cor_end
# 1:        A        block3        12      15    13.0     1           NA      12.0      13
# 2:        A block1,block2         1      11     8.0     2           NA       1.0       8
# 3:        B        block2         7      11     9.5     3          8.0       8.0      11
# 4:        A        block2         7      11    11.0     4          9.5       9.5      11
# 5:        A        block3        12      15    15.0     5         13.0      13.0      15
# 6:        B        block1         1       6     6.0     6          8.0       6.0       6

1 Like

Thanks for your support.. this is what I was looking for. Great explanation with detailed and clear illustration. Helps to better understand the power of data table. :+1::+1:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.