Grouping Data into Ranges

I have the following data:

set.seed(123)
my_data = data.frame(var1 =  rnorm(100,100,100))
min = min(my_data$var1)
max = max(my_data$var)

Here is what I am trying to do:

  • Starting from the smallest value of var1, I would like to create a variable that groups values of var1 by some "fixed increment" (e.g. by 10) until the maximum value of var1 is reached
  • Then, I would then like to create another variable which labels each of these groups by the min/max value of that group

Here is my attempt to do this:

# create a vector of increments
breaks <- seq(min(my_data$var1), max(my_data$var1), by = 10)

# initialize new variables
my_data$class <- NA
my_data$label <- NA

# get the number of breaks
n <- length(breaks)

# Loop 
for (i in 1:(n - 1)) {
    # find which "class" (i.e. break) each value of var1 is located within
    indices <- which(my_data$var1 > breaks[i] & my_data$var1 <= breaks[i + 1])
    
    # make assignment
    my_data$class[indices] <- i
    
    # create labels
    my_data$label[indices] <- paste(breaks[i], breaks[i + 1])
}

The code seems to have run, but I am not sure if this is correct (I don't think I have done this correctly because I see some NA's).

Can someone please tell show me how to do this correctly?

Thanks!

Hi, how about this?

set.seed(123)
my_data = data.frame(var1 =  rnorm(100,100,100))
min = min(my_data$var1)
max = max(my_data$var)

library(tidyverse)
my_data %>% 
  as_tibble() %>% # just to make it easier to view
  mutate(bins = cut(var1, breaks = pretty(var1, n = (max-min)/10), include.lowest = TRUE))

# # A tibble: 100 x 2
# var1 bins     
# <dbl> <fct>    
# 1  44.0 (40,50]  
# 2  77.0 (70,80]  
# 3 256.  (250,260]
# 4 107.  (100,110]
# 5 113.  (110,120]
# 6 272.  (270,280]
# 7 146.  (140,150]
# 8 -26.5 (-30,-20]
# 9  31.3 (30,40]  
# 10  55.4 (50,60]  
# # ... with 90 more rows
# # i Use `print(n = ...)` to see more rows
1 Like

Two Questions:

  • Is it possible to replace "10" with some other number in the future?
  • Is it possible to create a new column with a "rank" variable that assigns a number (e.g. 1 to 10) to each range?

Here is what I thought of for adding a "rank" variable - Is this correct?

set.seed(123)
my_data = data.frame(var1 =  rnorm(100,100,100))
min = min(my_data$var1)
max = max(my_data$var)

library(tidyverse)
my_data = my_data %>% 
  as_tibble() %>% # just to make it easier to view
  mutate(bins = cut(var1, breaks = pretty(var1, n = (max-min)/10), include.lowest = TRUE))%>%  mutate(rank = dense_rank(bins))

pretty just produces a vector of numbers: pretty function - RDocumentation - so you can change 10 to something else -- it is the lower end used in cut()

pretty(my_data$var1, n = (max-min)/10)
# [1] -140 -130 -120 -110 -100  -90  -80  -70  -60  -50  -40  -30  -20  -10    0   10   20   30   40   50   60   70   80   90  100  110  120  130  140  150  160  170  180  190  200  210  220  230
# [39]  240  250  260  270  280  290  300  310  320

pretty(my_data$var1, n = (max-min)/20)
# [1] -140 -120 -100  -80  -60  -40  -20    0   20   40   60   80  100  120  140  160  180  200  220  240  260  280  300  320

I'm not sure what you're after in the rank column. Is the number. How is it meant to be assigned? At the moment in your code, there are 33 categories.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.