sort values into groups by ranges

Hi
I have 2 data frames: the 1st includes numeric values (x column), and the 2nd set of ranges with unique IDs
I would like to add the 1st df a column for group ID, with default NA (which means - no group). For rows which df$x fall within a range - I want to set the group ID

df1<-data.frame(x=c(68, 39, 1, 34, 87, 43, 14, 82, 59, 21))
df.ranges<-data.frame(group=c("A", "B", "C"),
                     from=c(40, 20, 80),
                     to=c(70, 30, 90))

df1$group<-NA

I can do it with for loop

library(dplyr)
#expected results
for (i in 1:length(df.ranges$group)){
 df1$group[which(between(df1$x, df.ranges$from[i], df.ranges$to[i]))]<-levels(factor(df.ranges$group[i]))
}
df1

but I wondered if there is a vectorization method to do it (without loop) ?

Hi @sivan,

This is a good use-case for using the {dplyr} function case_when(...)

df1 %>% 
  mutate(group = case_when(
    between(x, 40, 70) ~ "A",
    between(x, 20, 30) ~ "B",
    between(x, 80, 90) ~ "C",
    TRUE ~ NA_character_
  ))

Thanks @mattattwarkentin
Actually, the data frame with the ranges is big, and generated automatically by other function. So, a solution based on the 2nd data frame ranges is not suiting to me.
I simplify the scenario for the example

I also tried to use inrange() {data.table} - which mark as TRUE any value, which fall into one of the ranges. The problem with inrange() that I don't know how to assign to the relevant group...

Ah I see. This code works but it's a bit hacky, I think. I wrote a helper function which returns the group label. Then you can apply this to your df1. Not sure this is any better than a for loop, expect maybe just a little easier to reason about.

library(purrr)
which_group <- function(x, df) {
  group_index <- pmap_lgl(df, ~between(x, ..2, ..3)) %>% which()
  if(!is_empty(group_index)) {
  df.ranges %>% 
    slice(group_index) %>% 
    pull(group)
  } else {
    NA
  }
}

df1 %>% 
  mutate(group = map_chr(x, which_group, df.ranges))

The function which_group() assumes df has three columns which are the group labels, starting value, and ending value, in that order.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.