Comparing Heights and Weights Across Different Datasets

I have the following dataset that contains the heights and weights of people from Canada - using the value of height (cm), I split weight (kg) into bins based on ntiles, and calculated the average value of var2 within each ntile bin:

library(dplyr)
library(gtools)
set.seed(123)
canada = data.frame(height =  rnorm(10000,150,10), weight = rnorm(10000,90, 10))

Part_1 = canada %>% 
  mutate(quants = quantcut(weight, 100),
         rank = as.numeric(quants)) %>%
  group_by(quants) %>% 
  mutate(min = min(weight), max = max(weight), count = n(), avg_height = mean(height))

Part_1 = Part_1 %>% distinct(rank, .keep_all = TRUE)

> Part_1
# A tibble: 100 x 8
# Groups:   quants [100]
   height weight quants         rank   min   max count avg_height
    <dbl>  <dbl> <fct>         <dbl> <dbl> <dbl> <int>      <dbl>
 1   144.  114.  (110.2,113.9]    99 110.  114.    100       150.
 2   148.   88.3 (88.12,88.38]    44  88.1  88.4   100       149.
 3   166.   99.3 (99.1,99.52]     83  99.1  99.5   100       152.
 4   151.   84.3 (84.14,84.44]    29  84.1  84.4   100       150.

For example, I see that there are 100 people between the weight range of 100.2 - 113.9 kg and the average height of these people is 150 cm

Now, suppose I have a similar dataset for people from the USA:

set.seed(124)
usa = data.frame(height =  rnorm(10000,150,10), weight = rnorm(10000,90, 10))

My Question: Based on the weight ranges I calculated using the Canada dataset - I want to find out how many people from the USA fall within these Canadian ranges and what is the average weight of the Americans within these Canadian ranges

For example:

  • In the Canada dataset, I saw that there are 100 people between the weight range of 100.2 - 113.9 kg and the average height of these people is 150 cm
  • How many Americans are between the weight range of 100.2 - 113.9 kg and what is the average height of these Americans?

I know that I can do this manually for each rank:

americans_in_canadian_rank99 = usa %>% 
  filter(weight > 110.2 & weight < 113.9) %>% 
  group_by() %>% 
  summarize(count = n(), avg_height = mean(height))


   americans_in_canadian_rank44 = usa %>% 
      filter(weight > 88.1 & weight < 88.4) %>% 
      group_by() %>% 
      summarize(count = n(), avg_height = mean(height))

In the end, I would be looking for something a desired output like this:

  canadian_rank min_weight max_weight canadian_count canadian_avg_height american_count american_avg_height
1            99      110.2      113.9            100                 150            116                 150
2            44       88.1       88.4            100                 149            154                 150

Can someone please help me figure out a better way to do this?

Thanks!

Hi there,

Here is my implementation for your issue

library(dplyr)
library(gtools)

set.seed(123)
canada = data.frame(height =  rnorm(10000,150,10), weight = rnorm(10000,90, 10))

Part_1 = canada %>% 
  mutate(quants = quantcut(weight, 100),
         rank = as.numeric(quants)) %>%
  group_by(quants) %>% 
  mutate(min = min(weight), max = max(weight), count = n(), avg_height = mean(height))

Part_1 = Part_1 %>% distinct(rank, .keep_all = TRUE)


set.seed(124)
usa = data.frame(height =  rnorm(10000,150,10), weight = rnorm(10000,90, 10))


#Find the Canadian rank for each of the the American weights

#Get the center of each quantile range in the Canadian dataset
#Sort from low to high so rank 1 is first 
y = sort((Part_1$min + Part_1$max)/2)

#For each of the American weights, find the closest quantile by taking the 
#absolute difference with each range center (y) and returning the index of the
#smallest value (which equals the rank since y was sorted)
usa$rank = sapply(usa$weight, function(x){
  x = abs(x - y)
  which(x == min(x))
})

#If the weights are outside the Canadian intervals, replace the rank with NA
#They were assigned the highest or lowest rank before
usa = usa %>% mutate(rank = ifelse(between(weight, min(Part_1$min), max(Part_1$max)), rank, NA))
head(usa)
#>     height   weight rank
#> 1 136.1493 70.15438    3
#> 2 150.3832 95.21628   71
#> 3 142.3697 84.79760   31
#> 4 152.1231 95.80216   73
#> 5 164.2554 85.46808   34
#> 6 157.4448 75.43867    8

#Summarise the American stats (note that NA combines both extremes)
usa %>% group_by(rank) %>% summarise(n = n(), avg_weight = mean(weight))
#> # A tibble: 101 × 3
#>     rank     n avg_weight
#>    <int> <int>      <dbl>
#>  1     1    67       61.6
#>  2     2   147       67.4
#>  3     3   111       70.1
#>  4     4   103       71.7
#>  5     5   106       73.0
#>  6     6   106       73.8
#>  7     7    84       74.7
#>  8     8    93       75.4
#>  9     9    93       76.2
#> 10    10    86       76.8
#> # … with 91 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Created on 2023-01-26 by the reprex package (v2.0.1)

Hope this helps,
PJ

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.