Conditionally interpolate values for one data frame based on another lookup table per group [solved]?

Reese · September 27, 2019, 5:11am

This is a follow up on this question. Now I want to do a few more steps:

Group by columns ID and order
For every val in df_dat, look up the corresponding ratio in the df_lookup table with the following conditions:

o If val < min(df_lookup$val), set new_ratio = min(df_lookup$ratio)
o If val > max(df_lookup$val), set new_ratio = max(df_lookup$ratio)
o If val falls within df_lookup$val range, do a simple linear interpolation

    library(dplyr)

    df_lookup <- tribble(
      ~ID, ~order, ~pct, ~val, ~ratio,
      "batch1", 1, 1,  1, 0.2,
      "batch1", 1, 10, 8, 0.5,
      "batch1", 1, 25, 25, 1.2,
      "batch2", 2, 1, 2, 0.1,
      "batch2", 2, 10, 15, 0.75,
      "batch2", 2, 25, 33, 1.5,
      "batch2", 2, 50, 55, 3.2,
    )

    df_dat <- tribble(
      ~order, ~ID, ~val,
      1, "batch1", 0.1,
      1, "batch1", 30,
      1, "batch1", 2,
      1, "batch1", 12,
      2, "batch1", 45,
      2, "batch2", 1.5,
      2, "batch2", 30,
      2, "batch2", 13,
      2, "batch2", 60,
    )

I have a working data.table code. Is it possible to achieve the same goal with tidyverse's verbs?


library(data.table)
setDT(df_lookup)
setDT(df_dat)

df_lookup[, m := (ratio - shift(ratio, -1L)) / (val - shift(val, -1L))]

df_dat[, new_ratio :=
         df_lookup[.SD, on=.(order, ID, val), roll=Inf, rollends=c(FALSE, FALSE),
                   x.m * (i.val - x.val) + x.ratio]
       ]
df_dat[is.na(new_ratio), new_ratio :=
         df_lookup[copy(.SD), on=.(order, ID, val), roll=Inf, x.ratio]]
df_dat[is.na(new_ratio), new_ratio :=
         df_lookup[copy(.SD), on=.(order, ID, val), roll=-Inf, x.ratio]]

	    order ID     val new_ratio
	1     1 batch1   0.1     0.2  
	2     1 batch1  30       1.2  
	3     1 batch1   2       0.243
	4     1 batch1  12       0.643
	5     2 batch1  45      NA    
	6     2 batch2   1.5     0.1 
	7     2 batch2  30       1.38 
	8     2 batch2  13       0.65 
	9     2 batch2  60       3.2

Example:

For order = 1, ID = batch2 and val = 30, new_ratio = 1.2 (max ratio value).
For order = 1, ID = batch1 and val = 2, new_ratio = 0.243 which is the interpolated ratio value between 0.2 and 0.5.
For order = 2 and ID = batch1, new_ratio = NA as those conditions aren’t in the lookup table.

Any help appreciated!

raytong · September 27, 2019, 6:51am

Hi @Reese. I would group and summaries the lookup table according to ID and order. Join two table together by ID and order. And test the condition row-wise.

library(tidyverse)

df_lookup <- tribble(
  ~ID, ~order, ~pct, ~val, ~ratio,
  "batch1", 1, 1,  1, 0.2,
  "batch1", 1, 10, 8, 0.5,
  "batch1", 1, 25, 25, 1.2,
  "batch2", 2, 1, 2, 0.1,
  "batch2", 2, 10, 15, 0.75,
  "batch2", 2, 25, 33, 1.5,
  "batch2", 2, 50, 55, 3.2,
)

df_dat <- tribble(
  ~order, ~ID, ~val,
  1, "batch1", 0.1,
  1, "batch1", 30,
  1, "batch1", 2,
  1, "batch1", 12,
  2, "batch1", 45,
  2, "batch2", 1.5,
  2, "batch2", 30,
  2, "batch2", 13,
  2, "batch2", 60,
)

df_lookup <- df_lookup %>%
  group_by(ID, order) %>%
  summarise(val_lookup = list(val), ratio = list(ratio))
  
df_dat %>%
  left_join(df_lookup, by = c("ID", "order")) %>%
  mutate(new_ratio = pmap_dbl(list(val, val_lookup, ratio), function(x, y, z){
    if(is.null(y)) {
      NA_integer_
    } else if (x < min(y)) {
      min(z)
    } else if (x > max(y)) {
      max(z)
    } else {
      approx(y, z, xout = x)$y
    }
  })) %>%
  select(order, ID, val, new_ratio)
#> # A tibble: 9 x 4
#>   order ID       val new_ratio
#>   <dbl> <chr>  <dbl>     <dbl>
#> 1     1 batch1   0.1     0.2  
#> 2     1 batch1  30       1.2  
#> 3     1 batch1   2       0.243
#> 4     1 batch1  12       0.665
#> 5     2 batch1  45      NA    
#> 6     2 batch2   1.5     0.1  
#> 7     2 batch2  30       1.38 
#> 8     2 batch2  13       0.65 
#> 9     2 batch2  60       3.2

^{Created on 2019-09-27 by the reprex package (v0.3.0)}

Reese · September 27, 2019, 12:46pm

Thank you so very much raytong. I tried your code with dtplyr package to get better performance for a much bigger dataset but it did not work. Are you familiar with dtplyr?

library(data.table)
library(dtplyr)
library(dplyr)

dt_lookup <- lazy_dt(df_lookup)
dt_dat <- lazy_dt(df_dat)

dt_lookup <- dt_lookup %>%
  group_by(ID, order) %>%
  summarise(val_lookup = list(val), ratio = list(ratio)) %>%
  as_tibble()
dt_lookup
#> # A tibble: 2 x 4
#>   ID     order val_lookup ratio    
#>   <chr>  <dbl> <list>     <list>   
#> 1 batch1     1 <dbl [3]>  <dbl [3]>
#> 2 batch2     2 <dbl [4]>  <dbl [4]>

dt_dat %>%
  left_join(dt_lookup, by = c("ID", "order")) %>%
  mutate(new_ratio = pmap_dbl(list(val, val_lookup, ratio), function(x, y, z){
    if(is.null(y)) {
      NA_integer_
    } else if (x < min(y)) {
      min(z)
    } else if (x > max(y)) {
      max(z)
    } else {
      approx(y, z, xout = x)$y
    }
  })) %>%
  select(order, ID, val, new_ratio)
#> Invalid input

rlang::last_error()
#> <error>
#> message: Invalid input
#> class:   `rlang_error`
#> backtrace:
#>   1. dplyr::left_join(., dt_lookup, by = c("ID", "order"))
#>   9. dplyr::mutate(...)
#>  10. dtplyr:::capture_dots(.data, ...)
#>  11. base::lapply(dots, dt_squash, vars = .data$vars, j = .j)
#>  12. dtplyr:::FUN(X[[i]], ...)
#>  13. dtplyr:::dt_squash(get_expr(x), get_env(x), vars = vars, j = j)
#>  14. base::lapply(x[-1], dt_squash, vars = vars, env = env, j = j)
#>  15. dtplyr:::FUN(X[[i]], ...)
#>  16. base::lapply(x[-1], dt_squash, vars = vars, env = env, j = j)
#>  17. dtplyr:::FUN(X[[i]], ...)
#> Call `rlang::last_trace()` to see the full backtrace

raytong · September 27, 2019, 1:52pm

@Reese. Sorry, I am not familiar with dtplyr package.

Reese · September 27, 2019, 5:40pm

There was warning when I ran your code for longer data. Is it safe to ignore?

There were 50 or more warnings (use warnings() to see the first 50)

Warning messages:
1: In regularize.values(x, y, ties, missing(ties)) :
  collapsing to unique 'x' values
2: In regularize.values(x, y, ties, missing(ties)) :
  collapsing to unique 'x' values
3: In regularize.values(x, y, ties, missing(ties)) :
  collapsing to unique 'x' values
4: In regularize.values(x, y, ties, missing(ties)) :
  collapsing to unique 'x' values
5: In regularize.values(x, y, ties, missing(ties)) :
  collapsing to unique 'x' values
6: In regularize.values(x, y, ties, missing(ties)) :
  collapsing to unique 'x' values

raytong · September 28, 2019, 2:40am

@Reese. The warning message is come from approx function which do the linear interpolation. Some report that the warning message appear after update to R 3.6. I cannot figure out the problem. You may try to add the argument ties in approx function which default value is mean.

approx(y, z, xout = x, ties = mean)$y

system · October 5, 2019, 2:40am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.