loops to determine unique group membership

Hi everyone, Updating the post to provide some progress and to more fully detail what I think the logic should be. Basically, in the for loop, I want to do the following. First, for the first row in the dataframe, I want to check if the name exists in the name_group_id list we initialized. If It does, then assign the group_id associated with that name. If not, then check if the location exists in the location_group_id list. If it does, return that group_id, if it does not, then create a new group_id and add this row's name/group_id to the name_group_id and the location/group_id to the location_group_id

This code does not work yet, as row three should have group_id 1 because name interacted with location 1 in row 1. But it seems close. Help tweaking this code would be awesome.

Thanks, everyone.

# Create data frame with the sample data
df <- data.frame(name = c("a", "a", "b", "b", "b", "c", "c", "d", "d", "e", "e", "f", "g", "g", "h"),
                 location = c(1, 2, 1, 3, 4, 3, 2, 5, 6, 7, 8, 4, 9, 10, 5))

# Initialize two lists
name_id <- list()
location_id <- list()

# Counter for creating new group IDs
group_id_counter <- 0

# For loop that implements the logic
for (i in 1:nrow(df)) {
  row <- df[i, ]
  name <- row$_name
  location <- row$location

  # Check if the name exists in name_id
  if (name %in% names(name_id)) {
    group_id <- name_group_id[[name]]
  } else {
    # Check if the location exists in location_group_id
    if (location  %in% names(location _group_id)) {
      group_id <- location _group_id[[location ]]
    } else {
      group_id_counter <- group_id_counter + 1
      group_id <- paste0("Group ", group_id_counter)
      name_id[[name]] <- group_id
      location _group_id[[location ]] <- group_id
    }
  }

  # Assign the group_id to the current row
  df[i, "group_id"] <- group_id
}

OLD thinking below

I have what I think is a set partition problem requiring a for loop. My dataset that looks similar to this

group1 <- sample(c("person a" , "person b" , "person c",
                   "person d" , "person e" , "person f", "person g"),
                   25, replace = TRUE )
group2 <- sample(c("veggie" , "fruit" , "meat",
                   "dairy" , "dirt"),
                   25, replace = TRUE )
df <- data.frame(group1, group2)

You could think of these data as foods (group2) people (group1) eat.

What I would like to do is determine the unique group memberships at play. So if person 1 eats dirt, but no other person does and person 1 eats nothing else, then person 1 is in ID 1.

If both person 2 and person 5 each fruit, but no other person does, then they could be group 2. But, crucially, any overlap creates a group, so if person 5 from above eats both fruits and veggies, and person three also eats veggies, then persons 2, 5, and 3 become a common ID (e.g. ID 2).

My real data has distinct combinations, but no repeat combinations.

My real data is ~7million rows long and the combinations are more complicated, but the general idea stands.

I have toyed with a few ways to try this but increasingly think I need to employ some sort of for loop, which I am not very skilled at.

Any help would be great. I also welcome pointers to other solutions, in case I missed any.

thanks!

Hi @ehl ,
I think this example does what you are requesting. It's a bit clunky but gets the job done:

suppressPackageStartupMessages(library(tidyverse))
set.seed(42)

group1 <- sample(c("person a" , "person b" , "person c",
                   "person d" , "person e" , "person f", "person g"),
                   25, replace = TRUE )
group2 <- sample(c("veggie" , "fruit" , "meat", "dairy" , "dirt"),
                   25, replace = TRUE )
df <- data.frame(group1, group2)

df %>% 
  separate(col=group1, into=c("not_needed","person")) -> df

dat <- as.data.frame.matrix(with(df, table(person, group2)))

dat %>% 
  mutate_if(is.numeric, function(x){ifelse(x>1, 1, x)}) %>% 
  unite(., comb, sep="", remove=FALSE) %>% 
  rownames_to_column("person") %>% 
  mutate(ID = as.numeric(as.factor(comb))) %>% 
  arrange(ID)
#>   person  comb dairy dirt fruit meat veggie ID
#> 1      g 00110     0    0     1    1      0  1
#> 2      f 01000     0    1     0    0      0  2
#> 3      c 01100     0    1     1    0      0  3
#> 4      e 01100     0    1     1    0      0  3
#> 5      d 10110     1    0     1    1      0  4
#> 6      b 11101     1    1     1    0      1  5
#> 7      a 11111     1    1     1    1      1  6

Created on 2023-02-02 with reprex v2.0.2

1 Like

If group membership is thought of as a set of common connections, a graph object would be more tractable than any iteration method to construct the relationships.

library(igraph)
#> 
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> The following object is masked from 'package:base':
#> 
#>     union
from <- sample(c("person a" , "person b" , "person c",
                   "person d" , "person e" , "person f", "person g"),
                 25, replace = TRUE )
to <- sample(c("veggie" , "fruit" , "meat",
                   "dairy" , "dirt"),
                 25, replace = TRUE )
dat <- data.frame(from,to)

g <- graph_from_data_frame(dat, directed=TRUE)


plot(g, layout=layout_with_kk, vertex.color="green")

cliques(g)
#> Warning in cliques(g): At igraph_cliquer.c:56 :Edge directions are ignored for
#> clique calculations
#> [[1]]
#> + 1/11 vertex, named, from 9cd461d:
#> [1] meat
#> 
#> [[2]]
#> + 1/11 vertex, named, from 9cd461d:
#> [1] veggie
#> 
#> [[3]]
#> + 1/11 vertex, named, from 9cd461d:
#> [1] dairy
#> 
#> [[4]]
#> + 1/11 vertex, named, from 9cd461d:
#> [1] dirt
#> 
#> [[5]]
#> + 1/11 vertex, named, from 9cd461d:
#> [1] fruit
#> 
#> [[6]]
#> + 1/11 vertex, named, from 9cd461d:
#> [1] person e
#> 
#> [[7]]
#> + 2/11 vertices, named, from 9cd461d:
#> [1] person e dairy   
#> 
#> [[8]]
#> + 1/11 vertex, named, from 9cd461d:
#> [1] person g
#> 
#> [[9]]
#> + 2/11 vertices, named, from 9cd461d:
#> [1] person g meat    
#> 
#> [[10]]
#> + 1/11 vertex, named, from 9cd461d:
#> [1] person c
#> 
#> [[11]]
#> + 2/11 vertices, named, from 9cd461d:
#> [1] person c fruit   
#> 
#> [[12]]
#> + 2/11 vertices, named, from 9cd461d:
#> [1] person c dirt    
#> 
#> [[13]]
#> + 2/11 vertices, named, from 9cd461d:
#> [1] person c dairy   
#> 
#> [[14]]
#> + 2/11 vertices, named, from 9cd461d:
#> [1] person c meat    
#> 
#> [[15]]
#> + 1/11 vertex, named, from 9cd461d:
#> [1] person b
#> 
#> [[16]]
#> + 2/11 vertices, named, from 9cd461d:
#> [1] person b dirt    
#> 
#> [[17]]
#> + 2/11 vertices, named, from 9cd461d:
#> [1] person b veggie  
#> 
#> [[18]]
#> + 2/11 vertices, named, from 9cd461d:
#> [1] person b meat    
#> 
#> [[19]]
#> + 1/11 vertex, named, from 9cd461d:
#> [1] person d
#> 
#> [[20]]
#> + 2/11 vertices, named, from 9cd461d:
#> [1] person d veggie  
#> 
#> [[21]]
#> + 1/11 vertex, named, from 9cd461d:
#> [1] person f
#> 
#> [[22]]
#> + 2/11 vertices, named, from 9cd461d:
#> [1] person f dairy   
#> 
#> [[23]]
#> + 2/11 vertices, named, from 9cd461d:
#> [1] person f veggie  
#> 
#> [[24]]
#> + 2/11 vertices, named, from 9cd461d:
#> [1] person f meat

g2 <- g
for (cnt in 1:clique.number(g)) {
  V(g)[largest[[1]][cnt]]$color = 'red'
    V(g2)[largest[[2]][cnt]]$color = 'blue'
}
#> Warning in clique.number(g): At cliques.c:1087 :directionality of edges is
#> ignored for directed graphs
#> Error in eval(x$expr, data, x$env): object 'largest' not found


plot(g2)

Thanks for this effort. It is very close to what I need but unfortunately the current approach is calculating combinations of dairy to veggie row wise (comb column). But ultimately, and here is the tricky part, all but person f (row 2) would be related through fruit, but then even person F is in the same group as the rest because they eat dirt and so do C, E, B, and A.

The graph solution could be an elegant one, but I have enough pairs that my R session aborts ever time I try.

Yeah, I was afraid of that. Depending on how many distinct objects in the two groups it may be possible to partition through slicing and attack each separately. Loops are going to be problematic for memory too because of the need to preserve what has been seen previously. Probably need some external cache.

I don't think we know enough to simulate larger data to profile approaches effectively...
but I had fun coming up with an approach to tackle this that at least works on the small data I think...

suppressPackageStartupMessages(library(tidyverse))

find_groups <- function(df){
  
(df2 <- df |> distinct() |> arrange(person,item))


(initial_wide <- pivot_wider(df2,id_cols="person",names_from = "item",values_from = "item",values_fn = \(x)TRUE))
  print("given")
  print(initial_wide)
(live <- initial_wide)
has_overlap <- function(row,table){
  map(seq_len(nrow(table)),\(x) {a <-row |> select(-person) ; b<-  slice(table,x) |> select(-person);any(a & b)})
}

do_merge <- function(row,table){
  p1 <- pull(row,person) |> paste0(collapse=",")
  p2 <- pull(table,person) |> paste0(collapse=",")
  persons <- paste(p1,p2,sep=",",collapse = ",")
  bind_rows(row,table) |> summarise(across(where(is.logical),any)) |> mutate(person=persons) |> relocate(person)
}

results <- initial_wide|>filter(FALSE)
while(nrow(live)>0){
  (active <- live[1,])
  (rest <- setdiff(live,active))
  hado <- has_overlap(active,rest)
  suppressWarnings(any_hado <- any(hado,na.rm = TRUE))
  if(!any_hado){
   results <- bind_rows(results,active)
  } else{
    rows_of_rest_to_merge <- seq_len(nrow(rest))[unlist(hado)]
   merge_result <- do_merge(active,slice(rest,
                          rows_of_rest_to_merge))
   rest <- slice(rest,-rows_of_rest_to_merge)
   results <-bind_rows(results,merge_result)
  }
  live <- rest
}
results
}

set.seed(42)
df <- tibble(person = sample(
  c(
    "person a", "person b", "person c",
    "person d", "person e", "person f", "person g"
  ),
  25,
  replace = TRUE
), item = sample(c("veggie", "fruit", "meat", "dairy", "dirt"),
  25,
  replace = TRUE
))

find_groups(df)
# # A tibble: 1 × 6
# person                                                         dairy dirt  fruit meat  veggie
# <chr>                                                          <lgl> <lgl> <lgl> <lgl> <lgl> 
#   1 person a,person b,person c,person d,person e,person f,person g TRUE  TRUE  TRUE  TRUE  TRUE  



 set.seed(42)

df <- tibble(person = sample(letters[1:7],
  8,
  replace = TRUE
), item = sample(c("veggie", "fruit", "meat", "dairy", "dirt"),
  8,
  replace = TRUE
))


 find_groups(df)
 # A tibble: 2 × 6
 # person dirt  veggie dairy fruit meat 
 # <chr>  <lgl> <lgl>  <lgl> <lgl> <lgl>
 #   1 a      TRUE  TRUE   NA    NA    NA   
 # 2 b,d,e  NA    NA     TRUE  TRUE  TRUE

yeah, my approach wouldnt work on data with many more 'items'/categories as the initial pivot wider would fail most likely...

I had a go reimplementing this same algorith while staying in the long form rather than pivoting wider; seems to work again on the smaller data, if its too slow on bigger data that might be hard to address without writing c/c++ code; but at least it shouldnt fall victim to limits on pivoting wide.

suppressPackageStartupMessages(library(tidyverse))


has_overlap <- function(person,other_people){
  iofp <- person$item
  other_names <- unique(other_people$name) 
  
  sapply(other_names, \(x){other_people |> 
      filter(name==x) |> 
      pull(item)|> 
      intersect(iofp) |> 
      length() > 0 })|>
    set_names(other_names)
}

do_merge <- function(person,other_people){
  p1 <- person$name |> unique() |> paste0(collapse=",") 
  p2 <- other_people$name |> unique() |> paste0(collapse=",")
  persons <- paste(p1,p2,sep=",",collapse = ",")
  items <-unique(c(person$item,
           other_people$item))
  enframe(items, name = NULL,value = "item") |> mutate(name=persons)
}

find_groups2 <- function(df){
  (df2 <- df |> distinct() |> arrange(name,item))
  live <- df2
  results <-df2 |> filter(FALSE)
  while(nrow(live)>0){
    first_name <- head(live$name,1)
    
    (active <- filter(live,name==first_name))
    
    (rest <- setdiff(live,active))
    hado <- has_overlap(active,rest)
    suppressWarnings(any_hado <- any(hado,na.rm = TRUE))
    if(!any_hado){
      results <- bind_rows(results,active)
    } else{
      names_of_rest_to_merge <- names(which(hado))
      merge_result <- do_merge(active,filter(rest,
                                            name %in% names_of_rest_to_merge))
      rest <- filter(rest,! (name %in% names_of_rest_to_merge))
      results <-bind_rows(results,merge_result)
    }
    live <- rest
  }
  results
  
}

set.seed(42)
df <- tibble(name = sample(
  c(
    "person a", "person b", "person c",
    "person d", "person e", "person f", "person g"
  ),
  25,
  replace = TRUE
), item = sample(c("veggie", "fruit", "meat", "dairy", "dirt"),
                 25,
                 replace = TRUE
))

find_groups2(df)
# # A tibble: 1 × 6
# person                                                         dairy dirt  fruit meat  veggie
# <chr>                                                          <lgl> <lgl> <lgl> <lgl> <lgl> 
#   1 person a,person b,person c,person d,person e,person f,person g TRUE  TRUE  TRUE  TRUE  TRUE  



set.seed(42)

df <- tibble(name = sample(letters[1:7],
                             8,
                             replace = TRUE
), item = sample(c("veggie", "fruit", "meat", "dairy", "dirt"),
                 8,
                 replace = TRUE
))


find_groups2(df)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.