Pivoting in Data.Table : Percentages Greater Than 100?

I have the following dataset:

set.seed(123)
gender <- c("Male","Female")
gender <- sample(gender, 5000, replace=TRUE, prob=c(0.45, 0.55))
gender <- as.factor(gender)


status <- c("Immigrant","Citizen")
status <- sample(status, 5000, replace=TRUE, prob=c(0.3, 0.7))
status  <- as.factor(status )


country <- c("A", "B", "C", "D")
country <- sample(country, 5000, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25))
country  <- as.factor(country)



################

disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
disease <- as.factor(disease)

###################
my_data = data.frame(gender, status, disease, country)

I learned how to make a summary table that shows the (relative) disease rates for each unique combination of factors:

library(tidyverse)
my_data %>% group_by (gender, status, country, disease) %>%
  summarise (n=n()) %>%
  mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%")) -> step_1
#> `summarise()` has grouped output by 'gender', 'status', 'country'. You can
#> override using the `.groups` argument.

step_1 |>group_by(country) |>  
  pivot_wider(names_from = disease, 
              values_from = c(n:rel.freq), 
              names_prefix = "disease_") |> 
  mutate(overallPerc = (n_disease_No + n_disease_Yes)/sum(step_1$n))

  gender status    country n_disease_No n_disease_Yes rel.freq_disease_No rel.freq_disease_Yes overallPerc
   <fct>  <fct>     <fct>          <int>         <int> <chr>               <chr>                      <dbl>
 1 Female Citizen   A                308           200 61%                 39%                       0.102 
 2 Female Citizen   B                291           169 63%                 37%                       0.092 
 3 Female Citizen   C                301           228 57%                 43%                       0.106 
 4 Female Citizen   D                245           189 56%                 44%        

Now, I am trying to convert this above code into "data.table" format to increase the efficiency of this code.

Here is my attempt:

library(data.table)

my_data <- as.data.table(my_data)

step_1 <- my_data[, .(n = .N), by = c("gender", "status", "country", "disease")][, rel.freq := paste0(round(100 * n/.N, 0), "%")]

step_1 <- dcast(step_1, gender + status + country ~ disease, value.var = c("n", "rel.freq"))[, overallPerc := (n_No + n_Yes)/.N]


    gender    status country n_No n_Yes rel.freq_No rel.freq_Yes overallPerc
 1: Female   Citizen       A  308   200        962%         625%     31.7500
 2: Female   Citizen       B  291   169        909%         528%     28.7500
 3: Female   Citizen       C  301   228        941%         712%     33.0625
 4: Female   Citizen       D  245   189        766%         591%     27.1250

However, many of these percentages are greater than 100 - can someone please show me how I can resolve this problem?

Thanks!

Is this what you want?

a <- step_1 |>group_by(country) |>  
  pivot_wider(names_from = disease, 
              values_from = c(n:rel.freq), 
              names_prefix = "disease_") |> 
  mutate(overallPerc = (n_disease_No + n_disease_Yes)/sum(step_1$n)) 

a %>% 
  group_by(gender, status, country) %>% 
  summarise(nno = sum(n_disease_No),
            nyes = sum(n_disease_Yes),
            ovperc = sum(overallPerc)*100,
            .groups = "drop"
  )
# A tibble: 16 × 6
   gender status    country   nno  nyes ovperc
   <fct>  <fct>     <fct>   <int> <int>  <dbl>
 1 Female Citizen   A         308   200  10.2 
 2 Female Citizen   B         291   169   9.2 
 3 Female Citizen   C         301   228  10.6 
 4 Female Citizen   D         245   189   8.68
 5 Female Immigrant A         107    95   4.04
 6 Female Immigrant B         126    76   4.04
 7 Female Immigrant C         137    70   4.14
 8 Female Immigrant D         129    74   4.06
 9 Male   Citizen   A         237   167   8.08
10 Male   Citizen   B         247   163   8.2 
11 Male   Citizen   C         250   171   8.42
12 Male   Citizen   D         230   139   7.38
13 Male   Immigrant A         103    68   3.42
14 Male   Immigrant B         117    63   3.6 
15 Male   Immigrant C          93    53   2.92
16 Male   Immigrant D         102    52   3.08

Maybe

step_1 <- my_data[, .(n = .N), bys = c("gender", "status", "country", "disease")][, rel.freq :=  n / nrow(my_data)]

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.