Grouping Data into Ranges

omario · January 24, 2023, 12:18am

I have the following data:

set.seed(123)
my_data = data.frame(var1 =  rnorm(100,100,100))
min = min(my_data$var1)
max = max(my_data$var)

Here is what I am trying to do:

Starting from the smallest value of var1, I would like to create a variable that groups values of var1 by some "fixed increment" (e.g. by 10) until the maximum value of var1 is reached
Then, I would then like to create another variable which labels each of these groups by the min/max value of that group

Here is my attempt to do this:

# create a vector of increments
breaks <- seq(min(my_data$var1), max(my_data$var1), by = 10)

# initialize new variables
my_data$class <- NA
my_data$label <- NA

# get the number of breaks
n <- length(breaks)

# Loop 
for (i in 1:(n - 1)) {
    # find which "class" (i.e. break) each value of var1 is located within
    indices <- which(my_data$var1 > breaks[i] & my_data$var1 <= breaks[i + 1])
    
    # make assignment
    my_data$class[indices] <- i
    
    # create labels
    my_data$label[indices] <- paste(breaks[i], breaks[i + 1])
}

The code seems to have run, but I am not sure if this is correct (I don't think I have done this correctly because I see some NA's).

Can someone please tell show me how to do this correctly?

Thanks!

williaml · January 24, 2023, 12:29am

Hi, how about this?

set.seed(123)
my_data = data.frame(var1 =  rnorm(100,100,100))
min = min(my_data$var1)
max = max(my_data$var)

library(tidyverse)
my_data %>% 
  as_tibble() %>% # just to make it easier to view
  mutate(bins = cut(var1, breaks = pretty(var1, n = (max-min)/10), include.lowest = TRUE))

# # A tibble: 100 x 2
# var1 bins     
# <dbl> <fct>    
# 1  44.0 (40,50]  
# 2  77.0 (70,80]  
# 3 256.  (250,260]
# 4 107.  (100,110]
# 5 113.  (110,120]
# 6 272.  (270,280]
# 7 146.  (140,150]
# 8 -26.5 (-30,-20]
# 9  31.3 (30,40]  
# 10  55.4 (50,60]  
# # ... with 90 more rows
# # i Use `print(n = ...)` to see more rows

omario · January 24, 2023, 4:42am

williaml:

et.seed(123)
my_data = data.frame(var1 =  rnorm(100,100,100))
min = min(my_data$var1)
max = max(my_data$var)

library(tidyverse)
my_data %>% 
  as_tibble() %>% # just to make it easier to view
  mutate(bins = cut(var1, breaks = pretty(var1, n = (max-min)/10), include.lowest = TRUE))

Two Questions:

Is it possible to replace "10" with some other number in the future?
Is it possible to create a new column with a "rank" variable that assigns a number (e.g. 1 to 10) to each range?

omario · January 24, 2023, 4:45am

Here is what I thought of for adding a "rank" variable - Is this correct?

set.seed(123)
my_data = data.frame(var1 =  rnorm(100,100,100))
min = min(my_data$var1)
max = max(my_data$var)

library(tidyverse)
my_data = my_data %>% 
  as_tibble() %>% # just to make it easier to view
  mutate(bins = cut(var1, breaks = pretty(var1, n = (max-min)/10), include.lowest = TRUE))%>%  mutate(rank = dense_rank(bins))

williaml · January 24, 2023, 5:32am

pretty just produces a vector of numbers: pretty function - RDocumentation - so you can change 10 to something else -- it is the lower end used in cut()

pretty(my_data$var1, n = (max-min)/10)
# [1] -140 -130 -120 -110 -100  -90  -80  -70  -60  -50  -40  -30  -20  -10    0   10   20   30   40   50   60   70   80   90  100  110  120  130  140  150  160  170  180  190  200  210  220  230
# [39]  240  250  260  270  280  290  300  310  320

pretty(my_data$var1, n = (max-min)/20)
# [1] -140 -120 -100  -80  -60  -40  -20    0   20   40   60   80  100  120  140  160  180  200  220  240  260  280  300  320

I'm not sure what you're after in the rank column. Is the number. How is it meant to be assigned? At the moment in your code, there are 33 categories.

system · January 31, 2023, 5:33am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.