I have the following dataset:
library(dplyr)
library(purrr)
library(ggplot2)
set.seed(123)
my_data1 = data.frame(Weight = rnorm(500,100,100), asthma = sample(c(0,1), prob = c(0.7,0.3), replace=TRUE, size= 500))
my_data2 = data.frame(Weight = rnorm(500, 200, 50), asthma = sample(c(0,1), prob = c(0.3,0.7), replace=TRUE, size= 500))
my_data_a = rbind(my_data1, my_data2)
my_data_a$gender = "male"
my_data1 = data.frame(Weight = rnorm(500,100,100), asthma = sample(c(0,1), prob = c(0.7,0.3), replace=TRUE, size= 500))
my_data2 = data.frame(Weight = rnorm(500, 200, 50), asthma = sample(c(0,1), prob = c(0.3,0.7), replace=TRUE, size= 500))
my_data_b = rbind(my_data1, my_data2)
my_data_b$gender = "female"
my_data = rbind(my_data_a, my_data_b)
my_data$id = 1:2000
My Question: For this dataset, I want to :
- "Bin" everyone (regardless of gender) in this dataset into "n" bins (e.g. n = 30) in ascending order based on the available weight ranges (e.g. min_weight : min_weight+ 30 = bin_1, min_weigh+ 30 : min_weight+ 60 = bin_2, etc.)
- Then, find out how many males and females are in each bin
- Finally, find out the average asthma rate for men and women within each bin
I think the end, I am trying to get an aggregate dataset that has this format:
bin gender count mean_asthma bin_min bin_max
1 bin1 male 244 0.320 100 130
2 bin1 female 123 0.440 100 130
3 bin2 male 199 0.311 160 190
4 bin2 female 205 0.233 160 190
Here is my attempt:
Final = my_data %>%
mutate(bins = cut(Weight , breaks = pretty(Weight ,
n = (max(Weight)-min(Weight))/30), include.lowest = TRUE),
rank = dense_rank(bins)) %>%
group_by(gender, bins) %>%
mutate(min_weight = min(Weight), max_weight = max(Weight),
count = n(), mean = mean(asthma)) %>%
ungroup
While the above code does not produce the output in the desired format - I am not sure if these numbers I have calculated are correct or incorrect.
Can someone please help me fix this?
Thanks!