Counting grouping occurence

DesperatePhDStudent · March 13, 2020, 2:56pm

Hi!
First of all, thanks for taking the time to read this.
I have a set of data (AdvPreTest) in which I have different advertisements (let's say advertisements Adv1, Adv2, Adv3, Adv4, and Adv5) classify by groups (group1, group2, group3) based on their similarities. What I want to know is how many times each advertisements were grouped together. E.g. how many times were Adv1 and Adv2 were grouped / consider similar. The number of the group in which they are is not important.

My data are like
Name Groups Value
AB282 group1 Adv1
AB282 group1 Adv2
AB282 group1 Adv3
AB282 group2 Adv4
AB282 group2 Adv5
AB20 group3 Adv1
AB20 group3 Adv2
AB20 group2 Adv3
AB20 group2 Adv4
AB20 group2 Adv5
LM28 group3 Adv1
LM28 group3 Adv2
LM28 group3 Adv3
LM28 group2 Adv4
LM28 group2 Adv5
GM25 group2 Adv1
GM25 group2 Adv2
GM25 group2 Adv3
GM25 group1 Adv4
GM25 group1 Adv5

And at the end I hope to have something like
Adv1 Adv2 Adv3 Adv4 Adv5
Adv1 X 4 3 0 0
Adv2 4 X 3 0 0
Adv3 3 3 X 1 1
Adv4 0 0 1 X 4
Adv5 0 0 1 4 X

But I have no idea how to compute this.

Thanks for your time and your help.

ttrodrigz · March 13, 2020, 5:07pm

So there are likely a number of ways to set this up, but here's what my approach would be:

Step 1: rearrange the data such that...

Columns represent the advertisements
Each row represents a group/name entry
Cells have a value of 1 if Adv_i was binned in the corresponding group, otherwise 0.

library(tidyverse)

AdvPreTest <- tribble(
      ~Name,  ~Groups, ~Value,
    "AB282", "group1", "Adv1",
    "AB282", "group1", "Adv2",
    "AB282", "group1", "Adv3",
    "AB282", "group2", "Adv4",
    "AB282", "group2", "Adv5",
     "AB20", "group3", "Adv1",
     "AB20", "group3", "Adv2",
     "AB20", "group2", "Adv3",
     "AB20", "group2", "Adv4",
     "AB20", "group2", "Adv5",
     "LM28", "group3", "Adv1",
     "LM28", "group3", "Adv2",
     "LM28", "group3", "Adv3",
     "LM28", "group2", "Adv4",
     "LM28", "group2", "Adv5",
     "GM25", "group2", "Adv1",
     "GM25", "group2", "Adv2",
     "GM25", "group2", "Adv3",
     "GM25", "group1", "Adv4",
     "GM25", "group1", "Adv5"
    )


AdvClean <-
    
    AdvPreTest %>%
    
    # ads placed in similar groups receive a value of 1
    mutate(dummy = 1) %>%
    
    # put ads in the columns, fill cells with 1/0
    pivot_wider(
        names_from = Value,
        values_from = dummy,
        values_fill = list(dummy = 0)
    )

AdvClean
#> # A tibble: 8 x 7
#>   Name  Groups  Adv1  Adv2  Adv3  Adv4  Adv5
#>   <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AB282 group1     1     1     1     0     0
#> 2 AB282 group2     0     0     0     1     1
#> 3 AB20  group3     1     1     0     0     0
#> 4 AB20  group2     0     0     1     1     1
#> 5 LM28  group3     1     1     1     0     0
#> 6 LM28  group2     0     0     0     1     1
#> 7 GM25  group2     1     1     1     0     0
#> 8 GM25  group1     0     0     0     1     1

Step 2: initialize a matrix to hold the co-occurrences

# only need the Advertisement columns
AdvClean <- select(AdvClean, Adv1:Adv5)

M <- ncol(AdvClean)

co_occur <- matrix(
    nrow = M, ncol = M,
    dimnames = list(
        names(AdvClean),
        names(AdvClean)
    )
)

co_occur
#>      Adv1 Adv2 Adv3 Adv4 Adv5
#> Adv1   NA   NA   NA   NA   NA
#> Adv2   NA   NA   NA   NA   NA
#> Adv3   NA   NA   NA   NA   NA
#> Adv4   NA   NA   NA   NA   NA
#> Adv5   NA   NA   NA   NA   NA

Step 3: use a `for` loop to tabulate the results

for (i in 1:M) {
    for (j in 1:M) {
        
        # logical vector of if i and j are grouped together
        grouped.together <- AdvClean[[i]] == 1 & AdvClean[[j]] == 1

        # sum that vector to tally results
        co_occur[i, j] <- sum(grouped.together)
        
    }
}

co_occur
#>      Adv1 Adv2 Adv3 Adv4 Adv5
#> Adv1    4    4    3    0    0
#> Adv2    4    4    3    0    0
#> Adv3    3    3    4    1    1
#> Adv4    0    0    1    4    4
#> Adv5    0    0    1    4    4

I see you're a new poster, so this might seem like there's a lot going on here if you're new to R, let me know if anything needs clarification!

nirgrahamuk · March 13, 2020, 5:51pm

I found this quite challenging, here was the solution, I came up with...

library(tidyverse)
# example data not provided so will simulate
set.seed(42) # to get shareable random outcomes

num_ads <- 5
num_groups <- 3
num_samples <- 30
example_g <- sample.int(n=num_groups,
                        size=num_samples,
                        replace=TRUE)

example_ad <- sample.int(n=num_ads,
                        size=num_samples,
                        replace=TRUE)

example_df <- data.frame(groups=paste0("group",example_g),
                         ads=paste0("Adv",example_ad))  

## you can assign your own df here ... 

# for couccurence only distinct examples matter 
example_df <- distinct(example_df) %>% arrange(groups,ads)


# pairs of Adv
rootads <- paste0("Adv", 1:num_ads)
pairs_to_track <- expand_grid(ad1 = rootads, ad2 = rootads) # %>% filter(ad1!=ad2)

count_results <- map2_int(
  .x = pairs_to_track$ad1,
  .y = pairs_to_track$ad2,
  .f = ~ filter(example_df, ads %in% c(..1, ..2)) %>%
    group_by(groups) %>%
    count %>%
    filter(n > 1) %>%
    ungroup %>%
    count %>%
    # select(n) %>%
    unlist()
)
pairs_to_track$count_results <- count_results
table_res_1 <- pivot_wider(pairs_to_track,names_from = ad2,values_from = count_results)
table_res_1

dromano · March 13, 2020, 5:57pm

If you're familiar with the dplyr package (also part of the tidyverse package), this might work for you, too:

library(tidyverse)

AdvPreTest <- tribble(
~Name,  ~Groups, ~Value,
"AB282", "group1", "Adv1",
"AB282", "group1", "Adv2",
"AB282", "group1", "Adv3",
"AB282", "group2", "Adv4",
"AB282", "group2", "Adv5",
"AB20", "group3", "Adv1",
"AB20", "group3", "Adv2",
"AB20", "group2", "Adv3",
"AB20", "group2", "Adv4",
"AB20", "group2", "Adv5",
"LM28", "group3", "Adv1",
"LM28", "group3", "Adv2",
"LM28", "group3", "Adv3",
"LM28", "group2", "Adv4",
"LM28", "group2", "Adv5",
"GM25", "group2", "Adv1",
"GM25", "group2", "Adv2",
"GM25", "group2", "Adv3",
"GM25", "group1", "Adv4",
"GM25", "group1", "Adv5"
)

AdvPreTest %>% 
  inner_join(AdvPreTest, by = c('Name', 'Groups')) %>% 
  filter(Value.x < Value.y) %>% 
  with(table(Value.x, Value.y))
#>        Value.y
#> Value.x Adv2 Adv3 Adv4 Adv5
#>    Adv1    4    3    0    0
#>    Adv2    0    3    0    0
#>    Adv3    0    0    1    1
#>    Adv4    0    0    0    4

^{Created on 2020-03-13 by the reprex package (v0.3.0)}

nirgrahamuk · March 13, 2020, 5:59pm

woh, that with(table()) stuff is magical !

dromano · March 13, 2020, 6:08pm

Took a long time for the magic to happen, though -- lots of getting twisted into knots before a vague memory of table() surfaced.

nirgrahamuk · March 13, 2020, 6:11pm

hmmm, i think there is an issue though. perhaps solved simply be deduplicating/ distincting somewhere in your pipe. because there are only 3 groups, no value in the matrix should be greater than 3. ie. that adv1 v adv2 count should be 3 rather than 4, as it measures the number of groups the pair has been in which is 3, even though this was drawn from 4 samples say.

dromano · March 13, 2020, 6:19pm

I'm not sure I'm following you, but I think the context is that a person decides which ads should be grouped together, so that group1 for person A is a different group from group1 for person B. In other words, my impression was that @DesperatePhDStudent was simply reusing group names arbitrarily, but that the sets they reference are what the counts come from. But maybe I misunderstood? @DesperatePhDStudent?

nirgrahamuk · March 13, 2020, 6:24pm

ahh, I think I see, so really Names concatenated with Group can be considereed the countable element when thinking about cooccurrences of ads within 'groups'. I hadnt undertsood that, and had thought the name column to have been just noise. more fool me.

DesperatePhDStudent · March 18, 2020, 4:57pm

Hi ! Thanks a lot for the answer! That helps a lot. The first solution is exactly what I needed, however, I don't know why it only works with the example of data / a subset and not the whole datafile (46 advertisements). The solution with dplyr seems to work though, I will take a look at it.

system · April 8, 2020, 4:57pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

Counting grouping occurence

Step 1: rearrange the data such that...

Step 2: initialize a matrix to hold the co-occurrences

Step 3: use a for loop to tabulate the results

Step 3: use a `for` loop to tabulate the results