I'm chunking text, trying to create "as equal as possible" length chunks by combining sentences. I have the sentences in a tibble (together with some metadata about the sentence). ggplot2::cut_width
seemed ideal, but I've found that it ends up creating groups with larger than minimal differences between the groups.
So I'm seeking a function that takes an ordered set of numbers and groups them (in sequence) to minimize the number of groups and the divergence of the group size from a goal.
e.g.
c(460, 100, 200, 200, 20)) # goal 500 ==> 1: [460], 2: [100, 200, 200, 20]
So, given a goal of 500, we start with 460, then know not to include the second item (100) because it is better to be 40 under, than 60 over. Then we have a second group that starts with 100, easily adding the next two items (200, 200). The last item would push the second group up to 520, but is added anyway, because it's better to be 20 over than have a short final group (which would be 480 under).
I can't decide if this is interval cutting or a type of constrained agglomerative clustering. (I can't use tokenizers::chunk_text
because I want to group rows to retain metadata).
Anyway, I've written a function that kinda works, see below, but I had to special case for the last value. But it actually seems harder than this, because there could be a few small values at the end that should be added. So maybe it's recursive?
Anyway, any suggestions appreciated! I'm terrible at iteration in R, as I'm sure is clear.
cut_width_min_diff <- function(df, width, label = F){
curr_total <- 0
curr_group <- 0
group_labels <- c()
for (i in seq_along(df)) {
print(paste0("Considering index", i))
if (need_new_group(curr_total, df[i], width, i == length(df))) {
# put this in new group
curr_group = curr_group + 1
curr_total <- df[i] # reset, just this item in new group
} else {
# keep in this group (either not blown or overage less than underage)
curr_total <- curr_total + df[i]
}
# either way store group for this df[i]
group_labels <- c(group_labels, curr_group)
}
return(group_labels)
}
# two choices, keep df[i] in current group or start a new one. Test adding this item.
# If curr group would still be under, always add.
# If adding this would blow the group, test whether
# its better to add it or not. ie creating groups too far
# under (by not adding) or create group too far over (by adding.)
need_new_group <- function(curr_total, candidate, goal_width, last = F) {
print(paste0("Curr total: ", curr_total, "Candidate: ", candidate))
poss_new_total <- curr_total + candidate
if (poss_new_total < goal_width) {
print("not blown")
return(FALSE) # always add if doesn't blow group
}
# now must be blown, but better to add or not?
# if new group how short would current be?
overage <- poss_new_total - goal_width
print(paste0("If no new group, over by ", overage))
underage <- goal_width - curr_total
print(paste0("If new group, curr under by ", underage))
decision <- underage < overage
if (last) {
last_underage <- goal_width - candidate
print(paste0("Last item, if new group, last group under by ", last_underage))
if (abs(last_underage) > abs(underage) && abs(last_underage) > abs(overage)) {
decision <- FALSE
}
}
return(decision)
}