df present aggregated metric in AB test with groups A and B.
x is for example number of page visits,
n number of users with this number of visits. (In reality, there are way more users and differences are small). Note that there's different number of users per group.
library(tidyverse) df <- bind_rows( tibble(group = "A", x = rpois(100, 1)), tibble(group = "B", x = rpois(200, 2)) ) %>% count(group, x)
I want to compare tiles of users. By tile, I mean users in group A that have the same
x value. For example, I if 34.17% of users in group A has value 0, I want to compare it to average number of
x for the lowest 34.17% of users in group B. Next, for example, users with 1 visits in group A are between 34.17% and 74.8% - I want to compare them with the same percentile (but should be more precise) users in group B. Etc...
Here's my try:
n_fake <- 1000 df_agg_per_imp <- df %>% group_by(group) %>% mutate( p_max = n_fake * cumsum(n) / sum(n), p_min = lag(p_max, default = 0), p = map2(p_min + 1, p_max, seq) ) %>% ungroup() df_agg_per_imp %>% unnest(p) %>% pivot_wider(id_cols = p, names_from = group, values_from = x) %>% group_by(A) %>% summarise( p_min = min(p) / n_fake, p_max = max(p) / n_fake, rel_uplift = mean(B) / mean(A) ) #> # A tibble: 6 × 4 #> A p_min p_max rel_uplift #> <int> <dbl> <dbl> <dbl> #> 1 0 0.001 0.34 Inf #> 2 1 0.341 0.74 1.92 #> 3 2 0.741 0.91 1.57 #> 4 3 0.911 0.96 1.33 #> 5 4 0.961 0.99 1.21 #> 6 5 0.991 1 1.2
What I don't like is that I have to create row for each user (and this could be millions) to get the results I want. Is there simpler/better way to do it?