Distribute numbers 1:587 between buckets; each number picked 4 times and each bucket has 30 numbers; each bucket can't have numbers with < 2 difference

Hi all,

So I am designing a biological experiment by which i need to distribute 587 different peptides between 80 buckets. Each peptide should be present in 4 buckets, and the restriction is that if peptide X is present in a bucket, then peptides X-2, X-1, X+1, and X+2 are not allowed in that bucket.

I tried to brute force my way through this, by just randomly sampling different combination of buckets until i found a combination that fit this. HOWEVER, i quickly realized I am looking for a quite rare event, because I ended up the same peptide picked twice for at least one pool (and most often in many pools).

This is the code I tried: (not suggesting this is the way forward, but just to share with you what I tried)

# Set rules
buckets = 80
size = 30
coverage = 4
iterations = 100

system.time({
   # Create [iteration] random pools 
  pool_assignent = list()
  for (i in 1:iterations) {
    
    pool_assignent[[i]] <- data.frame(pool = rep(1:buckets, 30), peptide = sample(rep(1:587,4), replace = F)) %>% 
      split(.$pool) %>% 
      map(~ .$peptide)
  }
  
  # Name the pools
  names(pool_assignent) <- paste("i_", 1:iterations, sep="")

# Find minimum within each pool
min <- pool_assignent %>% 
  map(map, dist) %>%
  map(map, as.vector) %>% 
  map(~ bind_rows(.)) %>% 
  map(~ min(.)) %>% 
  do.call(rbind, .)

print(min(min))

})

Please let me know if you have any ideas. I did have a look at Sampling without replacement multiple times and ensuring equal samples but was not able to use this code in an effective way. I do believe however, this has to be a structured selection process, and not a random sampling.

Thanks all,
Nils

Hi strixeh,

See if this works for you. I also included a few tests at the bottom.

fill_buckets <- function(buckets = vector('list', 80), peps = rep(1:587, 4)) {
  for (p in peps) {
    b <- sample(80, 1)
    done <- FALSE
    while (!done) {
      already_in <- p %in% buckets[[b]]
      any_within_2 <- any((p-2):(p+2) %in% buckets[[b]])
      space_in_bucket <- length(buckets[[b]]) < 30
      
      if (!(already_in | any_within_2) &
          space_in_bucket) {
        
        buckets[[b]][max(1, length(buckets[[b]]) + 1)] <- p
        done <- TRUE
        
      } else {
        b <- sample(80, 1) # Might occasionally lead to failure or very small buckets - smallest I recorded was 15
      }
    }
    
  }
  buckets
}

buckets <- fill_buckets()

Tests

Total length of all buckets - should be 587 * 4 = 2348

sum(lengths(buckets))
#> [1] 2348

What are the min and max bucket sizes?

range(lengths(buckets))
#> [1] 21 30
# Replicating over 200 runs the smallest buckets had 15 elements
hist(lengths(buckets), main = '')

image

Do all peptides occur 4 times?

all(table(unlist(buckets)) == 4)
#> [1] TRUE
# any(table(unlist(bucket)) > 4)
# any(table(unlist(bucket)) < 4)
1 Like

I'm behind @jmcvw again. My sketch doesn't rely on randomization, since that's not necessary—the assignments can be arbitrary, provided they follow the rule.

I would first make buckets and split them into the four sub-buckets required for each peptide.

buckets <- paste0("B",seq(1:80),"")

sub_buckets <- split(buckets, ceiling(seq_along(buckets)/4))

I would then iterate through peptides

peptides <- sort(unique(stringi::stri_rand_strings(600, 3, pattern = "[A-Za-z]"))[1:590])

The last three are dummies to be discarded.

For each element in the peptides vector, assign to the next available element of the current sub_bucket. Then recycle: the next recycling of sub_buckets is necessarily far enough away from the last element of sub_bucket.

This depends, of course, on peptides being ordinal.

1 Like

AAAAaaarggggh!

As technocrat said, randomization is not an issue - and I went to bed with that thought rattling around in my skull, knowing there was a better way.

Well here it is:

fill_buckets <- function(buckets = vector('list', 80), n_pep = 587, n_buckets = 4) {
  b <- seq_len(n_buckets)
  l <- 1
  for (p in seq_len(n_pep)) {
    buckets[b] <- lapply(buckets[b], function(bkt) append(bkt, p, l))
    l <- l + 1
    b <- if (b[4] == 80) 1:4 else b + 4
  }
  buckets
}

buckets <- fill_buckets()

This still passes all the test from my other attempt (lets not mention THAT again!), and all buckets now contain 29 or 30 elements.

1 Like

Thank you both for responding and for working on this for me. I appreciate it a lot!

To the problem:

  1. I ran jmcvw's first suggestion that rely on random sampling overnight (1,000,000 iterations) and found buckets with a range of 27-30 peptides in each bucket.

  2. Then I ran jmcvw's second suggestion that did not require randomization, which is obviously both faster and gives more equally filled buckets.

  3. However, this is the thing I should have mentioned earlier....SORRY! One my requirements is that the buckets should be as different as possible. With the approach that does not rely on randomization, the buckets repeat themselves and there is only 20 unique buckets:

> buckets %>% length()
[1] 80
> buckets %>% unique() %>% length()
[1] 20
  1. Do you think there is a way to build unique buckets, while still following the rules below, without incorporating randomization? I understand it makes it more complex, but please let me know what you think.

  2. For the randomizaed buckets, there is no issues with identical buckets (by chance?):

> buckets %>% length()
[1] 80
> buckets %>% unique() %>% length
[1] 80
  1. Still, i don't know whether there are very similar buckets or not, and not sure how to test for that either. Suggestions?

  2. I will have to work a bit to figure out exactly how to implement technocrat's code, haven't gotten that far yet.

Again, thank you so much for your help, I am forever thankful.

I probably should have suggested a longer, rather than shorter, object to hold the four-bucket pieces. With 588 peptides, including the final two dummies to round the number required, it would be just

buckets <- paste0("B",seq(1:80),"")
sub_buckets <- split(buckets, ceiling(seq_along(buckets)/4))
pool <- rep(sub_buckets,147)

Each peptide draws a sub-bucket from the pool; there are twenty unique sub-buckets so, peptides should be consistently 20 sub-buckets apart, meeting the requirement for no near-neighbors in the same bucket. Because the buckets can be created in the same way as peptides above, although not assigned once created, no sub-bucket set is privileged over another. And regenerating the bucket list can be done to test for distribution. I'd expect that one of the tests for normality could be drafted as the metric.

Creating the object from two equal-length vectors shouldn't be difficult at all.

Thank you, I will try to work this out. I appreciate it!

Meanwhile, i worked on the "randomization" code example by jmcvw and calculated the intersect between each bucket. Actually, it looks pretty good, the buckets do not share much:

# Determine "uniquness" of buckets
sim_metric_list = list()
for (i in 1:80) {
  for (j in 1:80) {
    sim_metric_vector = vector()
    intersecting_peptides <- intersect(x = selected_bucket[[i]], y = selected_bucket[[j]]) %>% length() + 0
    sim_metric_vector[j] <- intersecting_peptides
  }
  sim_metric_list[[i]] <- sim_metric_vector 
}

sim_metric <- do.call(rbind, sim_metric_list)

hist(sim_metric, 30)

Seems almost unlikely to get something so good, but I also have not gone into understanding the code (busy day, my only lab member resigned today...). Any thoughts?

Thanks,
Nils

Looks like a power distribution, which is plausible. Proximal ordinal peptides land by design in non-overlapping groups of 4 buckets. Maximizing Euclidean distance might be another approach.

Thanks so much again for your responses and help.

I work in academia, and this will be helpful for a research project I am working on. Can I ask if you are willing to send your names to me via Direct Message so that I can acknowledge you properly for the help in any upcoming publication that uses this code?

Thanks,
Nils

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.