cut_width but minimze over/under runs?

I'm chunking text, trying to create "as equal as possible" length chunks by combining sentences. I have the sentences in a tibble (together with some metadata about the sentence). ggplot2::cut_width seemed ideal, but I've found that it ends up creating groups with larger than minimal differences between the groups.

So I'm seeking a function that takes an ordered set of numbers and groups them (in sequence) to minimize the number of groups and the divergence of the group size from a goal.

e.g.

c(460, 100, 200, 200, 20))  # goal 500 ==> 1: [460], 2: [100, 200, 200, 20]

So, given a goal of 500, we start with 460, then know not to include the second item (100) because it is better to be 40 under, than 60 over. Then we have a second group that starts with 100, easily adding the next two items (200, 200). The last item would push the second group up to 520, but is added anyway, because it's better to be 20 over than have a short final group (which would be 480 under).

I can't decide if this is interval cutting or a type of constrained agglomerative clustering. (I can't use tokenizers::chunk_text because I want to group rows to retain metadata).

Anyway, I've written a function that kinda works, see below, but I had to special case for the last value. But it actually seems harder than this, because there could be a few small values at the end that should be added. So maybe it's recursive?

Anyway, any suggestions appreciated! I'm terrible at iteration in R, as I'm sure is clear.

cut_width_min_diff <- function(df, width, label = F){
  curr_total <- 0
  curr_group <- 0
  group_labels <- c()
  for (i in seq_along(df)) { 
    print(paste0("Considering index", i))
    if (need_new_group(curr_total, df[i], width, i == length(df))) {
      # put this in new group
      curr_group = curr_group + 1
      curr_total <- df[i] # reset, just this item in new group
    } else {
      # keep in this group (either not blown or overage less than underage)
      curr_total <- curr_total + df[i]
    }
    # either way store group for this df[i]
    group_labels <- c(group_labels, curr_group)
   
  }
    
  return(group_labels)
}

# two choices, keep df[i] in current group or start a new one. Test adding this item.
    # If curr group would still be under, always add.
    # If adding this would blow the group, test whether
    # its better to add it or not. ie creating groups too far
    # under (by not adding) or create group too far over (by adding.)
    
need_new_group <- function(curr_total, candidate, goal_width, last = F) {
  print(paste0("Curr total: ", curr_total, "Candidate: ", candidate))
  poss_new_total <- curr_total + candidate
  if (poss_new_total < goal_width) {
    print("not blown")
    return(FALSE) # always add if doesn't blow group
  } 
  # now must be blown, but better to add or not?
  
   # if new group how short would current be?
  overage <- poss_new_total - goal_width
  print(paste0("If no new group, over by ", overage))
  
  underage <- goal_width - curr_total
  print(paste0("If new group, curr under by ", underage))
  
  decision <- underage < overage
  
  if (last) {
    last_underage <- goal_width - candidate
    print(paste0("Last item, if new group, last group under by ", last_underage))
    if (abs(last_underage) > abs(underage) && abs(last_underage) > abs(overage)) {
      decision <- FALSE
    }
    
  }
  return(decision)
}  

Don't worry about your code, it's delightful compared to what's coming: I managed to do slightly better, but at what cost!

The idea is that if we define a cost and evaluate it on every possible sentence combination we could select the best one. Of course, it's impractical to compute every possible combination, so I use a minimum and maximum group length, and only try to compute combinations that are in that range. The small problem is that I basically reinvented object-oriented programming from scratch, and it may not scale well with too many sentences.

I define a "combination" as an object that contains a list of groups, and the current group number.

# example dataset
str <- stringr::sentences[1:100]
hist(str_length(str))

# the thresholds to decide when to dump a combination
min_g <- 200
max_g <- 260

clen <- function(comb, group = NULL){
  # cumulative length of current group
  if(is.null(group)) group <- comb$cur_group
  sum(str_length(comb$groups[[group]]))
}

# Initialization of the list of combinations
combins <- list()
combins[[1]] <- list(cur_group=1, groups=list(str[1]))

for(cur_sent in str[-1]){
  for(act_comb in seq_along(combins)){
    if(clen(combins[[act_comb]]) < min_g){
      #that group is too small, add the sentence
      combins[[act_comb]]$groups[[combins[[act_comb]]$cur_group]] <- c(combins[[act_comb]]$groups[[combins[[act_comb]]$cur_group]],
                                                                       cur_sent)
    } else if(clen(combins[[act_comb]])+str_length(cur_sent) > max_g){
      # that group is too big, start a new one
      combins[[act_comb]]$cur_group <- combins[[act_comb]]$cur_group + 1
      combins[[act_comb]]$groups[[combins[[act_comb]]$cur_group]] <- cur_sent
    } else{
      # create a new combination with a new group
      new_comb <- length(combins)+1
      combins[[new_comb]] <- combins[[act_comb]] # duplicate existing
      combins[[new_comb]]$cur_group <- combins[[new_comb]]$cur_group + 1
      combins[[new_comb]]$groups[[combins[[new_comb]]$cur_group]] <- cur_sent
      # add sentence to current group in current (old) combination
      combins[[act_comb]]$groups[[combins[[act_comb]]$cur_group]] <- c(combins[[act_comb]]$groups[[combins[[act_comb]]$cur_group]],
                                                                       cur_sent)
    }
  }
}

# Now we can look at the results and select the best combination among those we computed
group_lengths <- function(comb){
  map_int(seq_len(comb$cur_group),
          ~ clen(comb, .))
}
cost <- function(group_lengths, preferred){
  sum((group_lengths - preferred)^2)
}

comb_costs <- map(combins, group_lengths) %>%
  map_dbl(cost, preferred = 230)
best <- combins[[which.min(comb_costs)]]
# mean deviation
sqrt(min(comb_costs))/length(best$groups)
# lengths of groups
group_lengths(best)

So to compare to your code:

# With min_g=200, max_g=260, preferred=230
> group_lengths(best)
 [1] 236 232 236 249 233 225 226 225 221 227 234 222 232 203 207 223 211
> tibble(len=str_length(str), group=cut_width_min_diff(len, 230)) %>%
+   group_by(group) %>%
+   summarize(glen = sum(len)) %>%
+   pull(glen)
 [1] 236 232 236 249 233 225 226 225 221 227 234 222 232 240 240 229 135
#With min_g=90, max_g=150, preferred=120, we get the exact same grouping:
 [1] 123 113 121 111 127 109 120 129 112 121 115 110 115 111 112 113 107 114
[19] 114 113 118 116 131 129 119 121 117 114 133 106 117 114  97
#With 300 +/- 20, we get the same, but with 300 +/- 40 I get a better combination
> group_lengths(best)
 [1] 279 280 310 284 294 302 294 309 282 289 315 313 291
#vs cut_width_min_diff
 [1] 279 316 319 309 303 303 293 306 295 321 304 283 211

Note that it doesn't scale well with many sentences and a small group length, as the number of combinations explodes. It could be improved by computing the cost at each new combination and getting rid of the useless ones early. I think that's the concept behind the Knuth-Plass algorithm that stri_wrap() and TeX use to wrap words on a line.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.