Binning and Aggregating Data

I have the following dataset:

library(dplyr)
library(purrr)
library(ggplot2)
set.seed(123)

my_data1 = data.frame(Weight =  rnorm(500,100,100), asthma = sample(c(0,1), prob = c(0.7,0.3), replace=TRUE, size= 500))
my_data2 = data.frame(Weight = rnorm(500, 200, 50),  asthma = sample(c(0,1), prob = c(0.3,0.7), replace=TRUE, size= 500))
my_data_a = rbind(my_data1, my_data2)
my_data_a$gender =  "male"


my_data1 = data.frame(Weight =  rnorm(500,100,100), asthma = sample(c(0,1), prob = c(0.7,0.3), replace=TRUE, size= 500))
my_data2 = data.frame(Weight = rnorm(500, 200, 50),  asthma = sample(c(0,1), prob = c(0.3,0.7), replace=TRUE, size= 500))
my_data_b = rbind(my_data1, my_data2)
my_data_b$gender =  "female"


my_data = rbind(my_data_a, my_data_b)
my_data$id = 1:2000

My Question: For this dataset, I want to :

  • "Bin" everyone (regardless of gender) in this dataset into "n" bins (e.g. n = 30) in ascending order based on the available weight ranges (e.g. min_weight : min_weight+ 30 = bin_1, min_weigh+ 30 : min_weight+ 60 = bin_2, etc.)
  • Then, find out how many males and females are in each bin
  • Finally, find out the average asthma rate for men and women within each bin

I think the end, I am trying to get an aggregate dataset that has this format:

   bin gender count mean_asthma bin_min bin_max
1 bin1   male   244       0.320     100     130
2 bin1 female   123       0.440     100     130
3 bin2   male   199       0.311     160     190
4 bin2 female   205       0.233     160     190

Here is my attempt:

Final = my_data %>% 
  mutate(bins = cut(Weight , breaks = pretty(Weight , 
   n = (max(Weight)-min(Weight))/30), include.lowest = TRUE),
  rank = dense_rank(bins)) %>% 
 group_by(gender, bins) %>% 
 mutate(min_weight = min(Weight), max_weight = max(Weight), 
   count = n(), mean = mean(asthma)) %>% 
 ungroup

While the above code does not produce the output in the desired format - I am not sure if these numbers I have calculated are correct or incorrect.

Can someone please help me fix this?

Thanks!

you are close; the main issue; is that the final mutate() should have been a summarise()

1 Like

@ Nigrahamuk: Thank you so much for your reply!

Do you mean like this?

Final = my_data %>% 
    mutate(bins = cut(Weight , breaks = pretty(Weight , 
                                               n = (max(Weight)-min(Weight))/30), include.lowest = TRUE),
           rank = dense_rank(bins)) %>% 
    group_by(gender, bins) %>% 
    summarise(min_weight = min(Weight), max_weight = max(Weight), 
           count = n(), mean = mean(asthma)) %>% 
    ungroup

Do you know how I can bring the "rank" variable into the final version? For instance, is this correct?

Final = my_data %>% 
    mutate(bins = cut(Weight , breaks = pretty(Weight , 
                                               n = (max(Weight)-min(Weight))/30), include.lowest = TRUE),
           rank = dense_rank(bins)) %>% 
    group_by(gender, bins) %>% 
    summarise(min_weight = min(Weight), max_weight = max(Weight), 
              count = n(), mean = mean(asthma), rank = max(rank)) %>% 
    ungroup

Thanks!