Sample size calculation for two independent proportions

MetinBulus · November 10, 2023, 8:42am

There are several methods available for determining the sample size with unequal allocation in the context of an experimental design. Nevertheless, it is generally advisable to avoid an uneven allocation unless there are compelling reasons. Below, I will use the R package {pwrss} (https://pwrss.shinyapps.io/index/) to demonstrate various approaches that can be used to answer your question:

1. Proportion Difference

Change the kappa argument below for an unbalanced allocation to treatment and control groups. kappa is defined as the ratio of n1 / n2.

## Segment A
pwrss.z.2props(p1 = 0.014, p2 = 0.009, power = 0.80
               kappa = 1)

#>  Approach: Normal Approximation 
#>  Difference between Two Proportions 
#>  (Independent Samples z Test) 
#>  H0: p1 = p2 
#>  HA: p1 != p2 
#>  ------------------------------ 
#>   Statistical power = 0.8 
#>   n1 = 7135 
#>   n2 = 7135 
#>  ------------------------------ 
#>  Alternative = “not equal” 
#>  Non-centrality parameter = 2.802 
#>  Type I error rate = 0.05 
#>  Type II error rate = 0.2 

## Segment B
pwrss.z.2props(p1 = 0.007, p2 = 0.0029, power = 0.80
               kappa = 1)

#>   Approach: Normal Approximation 
#>   Difference between Two Proportions 
#>   (Independent Samples z Test) 
#>   H0: p1 = p2 
#>   HA: p1 != p2 
#>   ------------------------------ 
#>    Statistical power = 0.8 
#>    n1 = 4596 
#>    n2 = 4596 
#>   ------------------------------ 
#>   Alternative = “not equal” 
#>   Non-centrality parameter = 2.802 
#>   Type I error rate = 0.05 
#>   Type II error rate = 0.2

2. Logistic Regression

Change prob argument below for an unbalanced allocation to treatment and control groups. prob is defined as the sampling rate which is the ratio of n1 / (n1 + n2). Also note that n = n1 + n2.

## Segment A
pwrss.z.logreg(p0 = 0.009, p1 = 0.014, power = 0.80, 
               distribution = list(dist = "bernoulli", prob = 0.50))

#>   Logistic Regression Coefficient 
#>   (Large Sample Approx. Wald's z Test) 
#>   H0: beta1 = 0 
#>   HA: beta1 != 0 
#>   Distribution of X = ‘bernoulli’ 
#>   Method = DEMIDENKO(VC) 
#>   ------------------------------ 
#>    Statistical power = 0.8 
#>    n = 14333 
#>   ------------------------------ 
#>   Alternative = “not equal” 
#>   Non-centrality parameter = 2.785 
#>   Type I error rate = 0.05 
#>   Type II error rate = 0.2 

## Segment B
pwrss.z.logreg(p0 = 0.007, p1 =  0.0029, power = 0.80,
               distribution = list(dist = "bernoulli", prob = 0.50))

#>   Logistic Regression Coefficient 
#>   (Large Sample Approx. Wald's z Test) 
#>   H0: beta1 = 0 
#>   HA: beta1 != 0 
#>   Distribution of X = ‘bernoulli’ 
#>   Method = DEMIDENKO(VC) 
#>    ------------------------------ 
#>    Statistical power = 0.8 
#>    n = 9369 
#>    ------------------------------ 
#>   Alternative = “not equal” 
#>   Non-centrality parameter = -2.738 
#>   Type I error rate = 0.05 
#>   Type II error rate = 0.2

3. Chi-square Test
Uncertain how to incorporate unequal allocation here. Included here as per your question, in case you choose to conduct a chi-square test.


## Segment A
## create 2 x 2 table of cell probabilities as
cell.probs <- rbind(c(0.009, 0.014),
                    c(1 - 0.009, 1 - 0.014))

colnames(cell.probs) <- c("Control", "Treatment")
rownames(cell.probs) <- c("Purchased (Yes)", "Purchased (No)")
cell.probs

#>                  Control Treatment
#>  Purchased (Yes)   0.009    0.014
#>  Purchased (No)    0.991    0.986

## find the total sample size
pwrss.chisq.gofit(p1 = cell.probs, power = 0.80)

#>  Pearson's Chi-square Goodness-of-fit Test 
#>  for Contingency Tables 
#>   ------------------------------ 
#>   Statistical power = 0.8 
#>   Total n = 14276 
#>   ------------------------------ 
#>  Degrees of freedom = 1 
#>  Non-centrality parameter = 7.849
#>  Type I error rate = 0.05 
#>  Type II error rate = 0.2 

## Segment B
## create 2 x 2 table of cell probabilities as
cell.probs <- rbind(c(0.007, 0.0029),
                    c(1 - 0.007, 1 - 0.0029))

colnames(cell.probs) <- c("Control", "Treatment")
rownames(cell.probs) <- c("Purchased (Yes)", "Purchased (No)")
cell.probs

#>                 Control Treatment
#> Purchased (Yes)   0.007    0.0029
#> Purchased (No)    0.993    0.9971

## find the total sample size
pwrss.chisq.gofit(p1 = cell.probs, power = 0.80)

#>  Pearson's Chi-square Goodness-of-fit Test 
#>  for Contingency Tables 
#>  ------------------------------ 
#>   Statistical power = 0.8 
#>   Total n = 9200 
#>  ------------------------------ 
#>  Degrees of freedom = 1 
#>  Non-centrality parameter = 7.849 
#>  Type I error rate = 0.05 
#>  Type II error rate = 0.2