Tidy way of comparing "tiles" of users

Let's say df present aggregated metric in AB test with groups A and B. x is for example number of page visits, n number of users with this number of visits. (In reality, there are way more users and differences are small). Note that there's different number of users per group.

library(tidyverse)

df <- bind_rows(
    tibble(group = "A", x = rpois(100, 1)),
    tibble(group = "B", x = rpois(200, 2))
) %>% 
    count(group, x)

I want to compare tiles of users. By tile, I mean users in group A that have the same x value. For example, I if 34.17% of users in group A has value 0, I want to compare it to average number of x for the lowest 34.17% of users in group B. Next, for example, users with 1 visits in group A are between 34.17% and 74.8% - I want to compare them with the same percentile (but should be more precise) users in group B. Etc...

Here's my try:

n_fake <- 1000

df_agg_per_imp <- df %>% 
    group_by(group) %>% 
    mutate(
        p_max = n_fake * cumsum(n) / sum(n),
        p_min = lag(p_max, default = 0),
        p = map2(p_min + 1, p_max, seq)
    ) %>%
    ungroup()

df_agg_per_imp %>% 
    unnest(p) %>% 
    pivot_wider(id_cols = p, names_from = group, values_from = x) %>% 
    group_by(A) %>% 
    summarise(
        p_min = min(p) / n_fake,
        p_max = max(p) / n_fake,
        rel_uplift = mean(B) / mean(A)
    )

#> # A tibble: 6 × 4
#>       A p_min p_max rel_uplift
#>   <int> <dbl> <dbl>      <dbl>
#> 1     0 0.001  0.34     Inf   
#> 2     1 0.341  0.74       1.92
#> 3     2 0.741  0.91       1.57
#> 4     3 0.911  0.96       1.33
#> 5     4 0.961  0.99       1.21
#> 6     5 0.991  1          1.2

What I don't like is that I have to create row for each user (and this could be millions) to get the results I want. Is there simpler/better way to do it?

Do you mean like this:

suppressPackageStartupMessages(
  suppressWarnings(
    {
      library(tibble)
      library(dplyr)
      library(purrr)
      library(tidyr)
    }
  )
)

set.seed(2022)

df <- bind_rows(
    tibble(group = "A", x = rpois(100, 1)),
    tibble(group = "B", x = rpois(200, 2))
) %>% 
  count(group, x) %>%
  group_by(group) %>%
  mutate(perc = n / sum(n)) %>%
  ungroup() %>%
  print()
#> # A tibble: 14 x 4
#>    group     x     n  perc
#>    <chr> <int> <int> <dbl>
#>  1 A         0    29 0.29 
#>  2 A         1    43 0.43 
#>  3 A         2    25 0.25 
#>  4 A         3     1 0.01 
#>  5 A         4     1 0.01 
#>  6 A         5     1 0.01 
#>  7 B         0    29 0.145
#>  8 B         1    47 0.235
#>  9 B         2    51 0.255
#> 10 B         3    38 0.19 
#> 11 B         4    25 0.125
#> 12 B         5     4 0.02 
#> 13 B         6     3 0.015
#> 14 B         7     3 0.015

df2 <- df %>%
  pivot_wider(id_cols = x, names_from = group, values_from = perc) %>%
  mutate(B_A = B / A) %>%
  print()
#> # A tibble: 8 x 4
#>       x     A     B    B_A
#>   <int> <dbl> <dbl>  <dbl>
#> 1     0  0.29 0.145  0.5  
#> 2     1  0.43 0.235  0.547
#> 3     2  0.25 0.255  1.02 
#> 4     3  0.01 0.19  19    
#> 5     4  0.01 0.125 12.5  
#> 6     5  0.01 0.02   2    
#> 7     6 NA    0.015 NA    
#> 8     7 NA    0.015 NA
Created on 2022-03-15 by the reprex package (v2.0.1)

No, please look at the printed result at the end of my post.
To repeat, lowest 34 percentiles of users in A has x value 0. I want to also take the lowest 34 percentiles of users in B and compare their average x to A. Then, users from A with x=1 are between 34 and 74 percentile. Their mean(x)=1. I want to calculate mean(x) for users in B between 34 and 74 percentile. Etc...

library(tidyverse)
set.seed(42)
df <- bind_rows(
  tibble(group = "A", x = rpois(100, 1)),
  tibble(group = "B", x = rpois(200, 2))
) %>% 
  count(group, x) 

astat <- df %>% filter(group=="A") %>% mutate(sum=sum(n),
                                              csum=cumsum(n),
                                              cprop=csum/sum,
                                              lcprop = lag(cprop,default = 0))
raw_df <- uncount(df,weights = n)
raw_df_b <- filter(raw_df,
                   group=="B")

blength <- nrow(raw_df_b)


(fin <- bind_cols(astat,
                  b_mean = map2_dbl(.x = astat$lcprop,
                                    .y =  astat$cprop,
                                    ~ {
                                      b_low <- blength*.x
                                      b_hi  <- blength*.y
  cat(paste("finding mean of b between",b_low, "&", b_hi,"\n"))
  slice(raw_df_b,
   b_low:b_hi
  ) %>%
    pull(x) %>%
    mean()
})))

Then maybe this:

suppressPackageStartupMessages(
  suppressWarnings(
    {
      library(tibble)
      library(dplyr)
      library(purrr)
      library(tidyr)
    }
  )
)

set.seed(2022)

df <- bind_rows(
    tibble(group = "A", x = rpois(100, 1)),
    tibble(group = "B", x = rpois(200, 2))
) %>% 
  arrange(group,x)

df1 <- df %>% 
  filter(group=="A") %>%
  count(x) %>%
  mutate(perc = n / sum(n) ,
         hperc = cumsum(perc)) %>%
  print()
#> # A tibble: 6 x 4
#>       x     n  perc hperc
#>   <int> <int> <dbl> <dbl>
#> 1     0    29  0.29  0.29
#> 2     1    43  0.43  0.72
#> 3     2    25  0.25  0.97
#> 4     3     1  0.01  0.98
#> 5     4     1  0.01  0.99
#> 6     5     1  0.01  1

df2 <- df %>% 
  filter(group=="B") %>%
  mutate(cperc = row_number() / n() ,
         grp=cut(cperc,c(0,df1$hperc),labels=F)-1) %>%
  group_by(grp) %>%
  summarise(meanx=mean(x)) %>%
  print()
#> # A tibble: 6 x 2
#>     grp meanx
#>   <dbl> <dbl>
#> 1     0  0.5 
#> 2     1  1.99
#> 3     2  3.66
#> 4     3  6   
#> 5     4  6.5 
#> 6     5  7
Created on 2022-03-15 by the reprex package (v2.0.1)

Here's an approach using fuzzyjoin to perform a non-equi join between the data and itself, where I've pre-computed the cumulative means for each group.

library(tidyverse); library(fuzzyjoin)

df_share <- df %>%
  uncount(n) %>%
  group_by(group) %>%
  mutate(obs = row_number(),
         share = obs / n(),
         cuml_avg = cummean(x)) %>%
  ungroup()

df_share %>%
  filter(x != lag(x, default = -1)) %>%   # take first occasion for each x
  fuzzy_left_join(df_share, by = c("group", "share"), match_fun = list(`!=`, `>=`)) %>%
  group_by(group.x, x.x, obs.x, share.x, group.y) %>%
  slice_max(share.y) %>% ungroup()

Result

We see that the first observation in A (#1/100) represents the first percentile. This is matched with the second observation in B (#2/200).

In my data, which I produced using set.seed(0) at the outset, the first observation in A with x of 1 is #32, at percentile 0.32, with a cumulative mean x of 0.0312, or 1/32th. This is matched with the observation 64 in B, which has a cumulative mean of 0.594, since 38 of the first 64 values in B have an x of 1.

# A tibble: 11 × 10
   group.x   x.x obs.x share.x cuml_avg.x group.y   x.y obs.y share.y cuml_avg.y
   <chr>   <int> <int>   <dbl>      <dbl> <chr>   <int> <int>   <dbl>      <dbl>
 1 A           0     1   0.01      0      B           0     2    0.01      0    
 2 A           1    32   0.32      0.0312 B           1    64    0.32      0.594
 3 A           2    73   0.73      0.589  B           2   146    0.73      1.22 
 4 A           3    97   0.97      0.948  B           5   194    0.97      1.78 
 5 A           4   100   1         1.02   B           6   200    1         1.89 
 6 B           1    27   0.135     0.0370 A           0    13    0.13      0    
 7 B           2    89   0.445     0.719  A           1    44    0.44      0.295
 8 B           3   147   0.735     1.23   A           2    73    0.73      0.589
 9 B           4   175   0.875     1.52   A           2    87    0.87      0.816
10 B           5   192   0.96      1.74   A           2    96    0.96      0.927
11 B           6   198   0.99      1.85   A           3    99    0.99      0.990

I think I didn't explain it good enough... I want to get the same result as I got BUT at the same time avoid having one row per user during the calculation since in reality there will be millions of users.

Then please explain better: where are these users?
I understood they belong to group A or B and the calculations are done on group level (not on user level) ??

I understood they belong to group A or B and the calculations are done on group level (not on user level) ??
You understood it correctly.

I'm just afraid of the potential performance issues due to generating large numbers of rows (up to 10 million) and would like to avoid this step if possible.

At the end, I would like to have the same tibble as I have.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.