Calculate difference in column based on every combination of other column

robinsones · January 4, 2018, 8:40pm

I have some results for ab tests in a dataframe, recording the percentage of visitors with different events in different variants. For each variant combination of a test on a given date, I would like to know the change in the percentage of visitors with that event.

See below for a toy example; original_df is what I have, and I would like to get to desired result. I've tried widyr and tidyr in combination with dplyr, but nothing quite works.

library(tibble)
#> Warning: package 'tibble' was built under R version 3.4.1
original_df <- tribble(
  ~date, ~ab_test,  ~ab_variant, ~event_name, ~perc_w_event,
  "10-09-2017", "cool_test",  "old", "clicks", .1,
  "10-09-2017", "cool_test", "new",  "clicks", .2,
  "10-09-2017", "cool_test", "third_variant",  "clicks", .5,
  "10-09-2017", "awesome_test", "off", "clicks", .3,
  "10-09-2017", "awesome_test", "on",  "clicks", .4,
  "10-10-2017", "awesome_test", "off",  "clicks", .6,
  "10-10-2017", "awesome_test", "on",  "clicks", .8

)

desired_result <- tribble(
  ~date, ~ab_test,  ~variant_1, ~variant_2, ~event_name, ~perc_change,
  "10-09-2017", "cool_test",  "old", "new", "clicks", 100,
  "10-09-2017", "cool_test", "old", "third_variant",  "clicks", 400,
  "10-09-2017", "cool_test", "new", "third_variant",  "clicks", 150,
  "10-09-2017", "awesome_test", "off", "on", "clicks", 33.33,
  "10-10-2017", "awesome_test", "off", "on", "clicks", 33.33

)

scottbrenstuhl · January 4, 2018, 9:21pm

I don't know if this is the "best" way to do it but I think it should get the job done:

library(dplyr)

original_df %>%
    inner_join(original_df, 
               by = c("date", "ab_test", "event_name")) %>%
    filter(ab_variant.x != ab_variant.y) %>%
    mutate(perc_change = ((perc_w_event.y-perc_w_event.x)/perc_w_event.x)*100,
           perc_change = round(perc_change,2)) %>%
    filter(perc_change >= 0) %>%
    select(date,
           ab_test, 
           variant_1 = ab_variant.x, 
           variant_2 = ab_variant.y,
           event_name,
           perc_change)

Output:

Thanks for posting! I love taking a break to try something with a clearly defined goal.

nick · January 4, 2018, 9:25pm

Going a slightly different route that assumes that you want percent change based on order instead of which one was greater, I came up with the following. Similar logic, though (didn't see @scottbrenstuhl's until I was done!).

suppressPackageStartupMessages(library(tidyverse))

original_df <- tribble(
  ~date, ~ab_test,  ~ab_variant, ~event_name, ~perc_w_event,
  "10-09-2017", "cool_test",  "old", "clicks", .1,
  "10-09-2017", "cool_test", "new",  "clicks", .2,
  "10-09-2017", "cool_test", "third_variant",  "clicks", .5,
  "10-09-2017", "awesome_test", "off", "clicks", .3,
  "10-09-2017", "awesome_test", "on",  "clicks", .4,
  "10-10-2017", "awesome_test", "off",  "clicks", .6,
  "10-10-2017", "awesome_test", "on",  "clicks", .8)

compare_variants <- function(dat) {
  dat %>%
    rowid_to_column() %>%
    crossing(., .) %>%
    filter(rowid < rowid1) %>%
    mutate(perc_change = (perc_w_event1/perc_w_event - 1) * 100) %>%
    select(variant_1 = ab_variant, variant_2 = ab_variant1, perc_change)
}

original_df %>%
  group_by(date, ab_test, event_name) %>%
  nest() %>%
  mutate(results = map(data, compare_variants)) %>%
  select(-data) %>%
  unnest()
#> # A tibble: 5 x 6
#>   date       ab_test      event_name variant_1 variant_2     perc_change
#>   <chr>      <chr>        <chr>      <chr>     <chr>               <dbl>
#> 1 10-09-2017 cool_test    clicks     old       new                 100  
#> 2 10-09-2017 cool_test    clicks     old       third_variant       400  
#> 3 10-09-2017 cool_test    clicks     new       third_variant       150  
#> 4 10-09-2017 awesome_test clicks     off       on                   33.3
#> 5 10-10-2017 awesome_test clicks     off       on                   33.3

robinsones · January 4, 2018, 9:38pm

Thank you! I need to translate this to SQL eventually (to go in looker), so this is perfect!

sgranitz · January 4, 2018, 10:03pm

Careful, using >= 0 you may run into issues when the clicks decreased. If the DF is in order, something like this might work:

original_df <- tribble(
  ~date, ~ab_test,  ~ab_variant, ~event_name, ~perc_w_event,
  "10-09-2017", "cool_test",  "old", "clicks", .1,
  "10-09-2017", "cool_test", "new",  "clicks", .2,
  "10-09-2017", "cool_test", "third_variant",  "clicks", .5,
  "10-09-2017", "awesome_test", "off", "clicks", .3,
  "10-09-2017", "awesome_test", "on",  "clicks", .4,
  "10-10-2017", "awesome_test", "off",  "clicks", .6,
  "10-10-2017", "awesome_test", "on",  "clicks", .8
)

ord <- as_factor(unique(original_df$ab_variant))

original_df %>%
  group_by(ab_test, date) %>% 
  inner_join(original_df, by = c("date", "ab_test")) %>% 
  filter(ab_variant.x != ab_variant.y) %>% 
  mutate(ab1_rank = match(ab_variant.x, ord),
         ab2_rank = match(ab_variant.y, ord)) %>% 
  filter(ab1_rank < ab2_rank) %>% 
  mutate(chg = (perc_w_event.y - perc_w_event.x) / perc_w_event.x * 100) %>% 
  select(date, ab_test, ab_variant.x, ab_variant.y, chg)

# A tibble: 5 x 5
# Groups:   ab_test, date [3]
        date      ab_test ab_variant.x  ab_variant.y       chg
       <chr>        <chr>        <chr>         <chr>     <dbl>
1 10-09-2017    cool_test          old           new 100.00000
2 10-09-2017    cool_test          old third_variant 400.00000
3 10-09-2017    cool_test          new third_variant 150.00000
4 10-09-2017 awesome_test          off            on  33.33333
5 10-10-2017 awesome_test          off            on  33.33333

Omayma · January 4, 2018, 10:48pm

How about this?

original_df %>% 
  group_by(date, ab_test) %>% 
  nest() %>% 
  mutate(test_comb = map(data, ~ combn(.x[["ab_variant"]], 2, simplify = T)) %>% 
           map(t) %>% 
           map(as_data_frame)) %>% 
  unnest(test_comb) %>% 
  left_join(original_df, by = c("date",
                                "ab_test",
                                "V1" = "ab_variant")) %>% 
  left_join(original_df, by = c("date",
                                "ab_test",
                                "event_name",
                                "V2" = "ab_variant" )) %>% 
  mutate(perc_change = (100*((perc_w_event.y - perc_w_event.x)/perc_w_event.x)) %>% 
           round(2)) %>% 
  select(-perc_w_event.x, -perc_w_event.y)

# A tibble: 5 x 6
        date      ab_test    V1            V2 event_name perc_change
       <chr>        <chr> <chr>         <chr>      <chr>       <dbl>
1 10-09-2017    cool_test   old           new     clicks      100.00
2 10-09-2017    cool_test   old third_variant     clicks      400.00
3 10-09-2017    cool_test   new third_variant     clicks      150.00
4 10-09-2017 awesome_test   off            on     clicks       33.33
5 10-10-2017 awesome_test   off            on     clicks       33.33

To clarify : The first part, gives all the combinations directly:

original_df %>% 
  group_by(date, ab_test) %>% 
  nest() %>% 
  mutate(test_comb = map(data, ~ combn(.x[["ab_variant"]], 2, simplify = T)) %>% 
           map(t) %>% 
           map(as_data_frame)) %>% 
  unnest(test_comb)

# A tibble: 5 x 4
        date      ab_test    V1            V2
       <chr>        <chr> <chr>         <chr>
1 10-09-2017    cool_test   old           new
2 10-09-2017    cool_test   old third_variant
3 10-09-2017    cool_test   new third_variant
4 10-09-2017 awesome_test   off            on
5 10-10-2017 awesome_test   off            on

Then we can join with the original dataframe to get the corresponding values and calculate perc_change

scottbrenstuhl · January 4, 2018, 11:53pm

I initially had the same concern but then thought it should be safe. An example of where it would be a problem would be greatly appreciated since I have used this sort of logic before and should revisit it if its problematic.

Here is my thought process:
Inner_join will cause a fan out for all combinations of each test/date/event combo, then there should be one positive and one negative perc_change for each unique combo (or two at exactly 0 but that would be wild in a real A/B test).

Since each unique pair is represented in the dataframe twice you can just take the >= 0 changes and then you're basically saying that the variant_1 column is the "winner" for that pairing and the perc_change is by how much.

So in the example outcome, all of the tests actually made clicks decrease which leads to old and off always being shown in the variant_1 column when compared to new, third_variant, or on.

sgranitz · January 5, 2018, 1:26am

Scott, it only would matter if direction is important. If you have a base of Old = 1, New = 0.5, Third = 2 and you want to know Old to New, Old to Third and New to Third then the >= 0 won't work. If direction is unimportant then your way would be fine.

scottbrenstuhl · January 5, 2018, 3:38am

Awesome! Thanks for the explanation, that totally makes sense. I'm breathing a huge sigh of relief right now since I'm pretty sure this doesn't apply to any of the times I have used this strategy before. Great thing for me to keep in mind in the future!!