VLOOKUP, match not exact

Hi. I am trying to do the equivalent of an Excel VLOOKUP when the match is not exact.

Here is a data frame of input and output values, where each input value is the upper end of a bin. The second bin for example goes from 0.5 < x <= 0.8. I would like to assign a uniform [0,1] random number to the output value in the data frame, so for example, .67 gets assigned to 1. I want to avoid a series of ifelse's. How do i this this?

input <- c(0.5, 0.8, 0.9, 0.97, 1)
output <- c(0,1,2,3,4)
df <- data.frame(cbind(input, output)

If all you want is a vector of random integers

input <- c(0.5, 0.8, 0.9, 0.97, 1)
replacements <- seq(1:9)
sample(replacements, length(input))
#> [1] 3 2 9 7 5

Created on 2020-08-12 by the reprex package (v0.3.0)

Thanks, but perhaps I did a poor job of explaining. If the random number were between 0 and 0.5, assign 0. If the random number were between 0.5 and 0.8, assign 1. If the random number were between 0.8 and 0.9, assign 2, etc. Can this be done without IFELSE-ing each range?

Ah, that makes more sense. To avoid multiple ifelse use case_when

case_when(
  item <= 0.5 ~ 0,
  item > 0.5 & <= 0.8 ~ 1,
  ...
 )

Thanks again, but isn't that the equivalent of multiple IFELSE? In Excel what I'm trying to do is a one line VLOOKUP statement. There ought to be a way to do this in R without enumerating each bin. What if there were 50 bins?

You might want to use the cut function.

cutpoints <- c(0, .5, .8, 1)
input <- c(0.5, 0.8, 0.9, 0.97, 1)
#note that 1 will go in highest category - with a random number, it will never be exactly 1 or exatcly 0
cut(input, cutpoints, include.lowest = TRUE) 
#> [1] [0,0.5]   (0.5,0.8] (0.8,1]   (0.8,1]   (0.8,1]  
#> Levels: [0,0.5] (0.5,0.8] (0.8,1]
#with labels as you wanted
cut(input, cutpoints, include.lowest = TRUE, labels=1:(length(cutpoints)-1)-1)
#> [1] 0 1 2 2 2
#> Levels: 0 1 2

Created on 2020-08-12 by the reprex package (v0.3.0)

2 Likes

StatSteph, thank you, this is close to what I need. In the following, x=.67 gives me the correct bin range.

breaks <- c(0, .5, .8, .9, .97, 1)
x <- .67
cut(x, breaks, include.lowest=TRUE)

How do I get it to tell me which bin number this is? WHICH is not the solution.

An additional option, if you want to keep mnemonic labels for the bin ranges, is to use the match function. For example:

breaks <- c(0, .5, .8, .9, .97, 1)

x <- c(.67, 0.2, 0.85, 0.97)

xcut = cut(x, breaks, include.lowest=TRUE)

xcut
#> [1] (0.5,0.8]  [0,0.5]    (0.8,0.9]  (0.9,0.97]
#> Levels: [0,0.5] (0.5,0.8] (0.8,0.9] (0.9,0.97] (0.97,1]

levels(xcut)
#> [1] "[0,0.5]"    "(0.5,0.8]"  "(0.8,0.9]"  "(0.9,0.97]" "(0.97,1]"

# Get bin numbers
match(xcut, levels(xcut))
#> [1] 2 1 3 4

cut(x, breaks, include.lowest=TRUE, labels = FALSE)?

I do like the case_when method, but if you didn't want to type it all out, you could create a function to accomplish your goal.

The below would need to be modified if you wanted to treat for the final else statement.

# load tidyverse
library(tidyverse)

# function ----
my_match <- function(match_tbl, n) {
  match_df <- match_tbl %>%
    rowid_to_column("key")

  new_key <- nrow(match_df) + 1

  match_df %>%
    full_join(
      tibble(
        key = new_key,
        input = n
      ),
      by = c("key", "input")
    ) %>%
    arrange(input) %>%
    fill(output, .direction = "updown") %>%
    filter(key == new_key) %>%
    pull(output)
}


# break table ----
my_breaks <- tibble(
  input = c(0.5, 0.8, 0.9, 0.97, 1),
  output = c(0, 1, 2, 3, 4)
)


# test ----
tibble(
  new_inputs = c(
    (c(0.5, 0.8, 0.9, 0.97, 1) - .01),
    (c(0.5, 0.8, 0.9, 0.97, 1) + .01)
  )
) %>%
  rowwise() %>% 
  mutate(matched_n = my_match(my_breaks,new_inputs)) %>% 
  ungroup()
#> # A tibble: 10 x 2
#>    new_inputs matched_n
#>         <dbl>     <dbl>
#>  1       0.49         0
#>  2       0.79         1
#>  3       0.89         2
#>  4       0.96         3
#>  5       0.99         4
#>  6       0.51         1
#>  7       0.81         2
#>  8       0.91         3
#>  9       0.98         4
#> 10       1.01         4

Created on 2020-08-13 by the reprex package (v0.3.0)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.