Let's say df present aggregated metric in AB test with groups A and B. x is for example number of page visits, n number of users with this number of visits. (In reality, there are way more users and differences are small). Note that there's different number of users per group.
library(tidyverse)
df <- bind_rows(
tibble(group = "A", x = rpois(100, 1)),
tibble(group = "B", x = rpois(200, 2))
) %>%
count(group, x)
I want to compare tiles of users. By tile, I mean users in group A that have the same x value. For example, I if 34.17% of users in group A has value 0, I want to compare it to average number of x for the lowest 34.17% of users in group B. Next, for example, users with 1 visits in group A are between 34.17% and 74.8% - I want to compare them with the same percentile (but should be more precise) users in group B. Etc...
Here's my try:
n_fake <- 1000
df_agg_per_imp <- df %>%
group_by(group) %>%
mutate(
p_max = n_fake * cumsum(n) / sum(n),
p_min = lag(p_max, default = 0),
p = map2(p_min + 1, p_max, seq)
) %>%
ungroup()
df_agg_per_imp %>%
unnest(p) %>%
pivot_wider(id_cols = p, names_from = group, values_from = x) %>%
group_by(A) %>%
summarise(
p_min = min(p) / n_fake,
p_max = max(p) / n_fake,
rel_uplift = mean(B) / mean(A)
)
#> # A tibble: 6 × 4
#> A p_min p_max rel_uplift
#> <int> <dbl> <dbl> <dbl>
#> 1 0 0.001 0.34 Inf
#> 2 1 0.341 0.74 1.92
#> 3 2 0.741 0.91 1.57
#> 4 3 0.911 0.96 1.33
#> 5 4 0.961 0.99 1.21
#> 6 5 0.991 1 1.2
What I don't like is that I have to create row for each user (and this could be millions) to get the results I want. Is there simpler/better way to do it?