Is there a R function to help me create 5 equally balanced lists?

I have 180 words that each have a numerical value for: Concreteness, length, Log Frequency. For example, here are 4 of the words:

No.       Word       Concreteness      Length     Log Frequency
1        Ceiling       606              7         8.516
2        Dancer        558              6         8.004
3        Prince        542              6         9.31
4        Medicine      517              8         9.893

I want to divide the 180 words into five groups(A, B, C, D, E) so that each group has the roughly the same overall average for concreteness, length, and log frequency. In total, there will be 5 groups containing 36 words.

I have been trying to do this by hand but it is very complicated and frustrating because balancing the group on log frequency then messes the average of length or concreteness etc. I also have another set of words that need to balanced based on 12 variables!!

Thank you in advance for the help.

A 3-dimensional problem! There may be some clever stats/modelling approach to estimating this through some kind of stepwise iteration and comparison. But I don't know enough about that.

There's a lot of ways of obtaining a set of 36 elements from a list of 180! I don't think it's reasonable to attempt this by hand.

factorial(180)/(factorial(36)*factorial(144))
[1] Inf

Having said that, you don't need the perfect answer, just reasonably close.

I'd start by pulling out 5 words at random. It probably doesn't matter which words to begin with.
Then for each one, find the word from the remaining list that will best balance their average concreteness, for example, to give you an average concreteness for the pair that is close to the average for the dataset as a whole. Now as you say the average length and log frequency are probably off. So now choose a third word from the remaining 170 that will give you the best average length for each trio. Then choose a fourth word from the remaining 165 to add to each trio to give a quartet with a fairly average log frequency.

Repeat until you've got five sets of 36.

I think you have to build all five groups at the same time, stepwise (in parallel). If you build groups of 36 one at a time (in series), your first group will have a really good approximation to the dataset average but your fifth group might be some way off. Feeding off the scraps.

I've been working on a bit of code to emulate this. It almost works...

This doesn't really work but it sort of illustrates the approach I outlined above.

I stress that I am complete novice here and it's quite likely that there's a cleverer approach or a sound statistical method that could be applied, that I am simply not aware of. I'm also just starting to experiment with the new tools that dplyr 1.0.0 gives us, so the syntax below with some of these might be a mess.

Anyway here's my code, which might be made to work if I knew how to make recursion work properly, It just uses your example data above.

library(tibble)
library(purrr)

words_dataset <- tibble(
  word = c("ceiling", "dancer", "prince", "medicine"),
  concreteness = c(606, 558, 542, 517),
  length = c(7, 6, 6, 8),
  log_frequency = c(8.516, 8.004, 9.31, 9.893)
)

# calculate averages of the initial, whole dataset:
# we want our final groups to have similar averages
avg <- words_dataset %>%
  summarise(across(where(is.numeric), ~ mean(., na.rm = TRUE)))


select_starters <- function(df, n) {            # n is number of groups desired
  
  # choose some random rows to slice
  row_numbers <- sample(1:nrow(df), n)
  
  # create a list of a list of the word groups plus the remaining words
  list(
    word_groups = map(row_numbers, ~ slice(df, .)),
    remainder = slice(df, -row_numbers)
  )
}


grow_groups <- function(df1, df2, avgs) {
  
  # for a group of words, calculate their current average for each variable
  group_avg <- df1 %>% 
    summarise(across(where(is.numeric), ~ mean(., na.rm = TRUE)))
  
  # calculate what the ideal counterweight average would be 
  # that would make the avg of the new group near to the initial dataset avg
  target_avg <- tibble(
    avgs = unlist(avgs),
    group_avg = unlist(group_avg)
  ) %>% 
    transmute(target_avg = (avgs*2) - group_avg) %>% 
    unlist()
  
  # calculate a score for each word in the remainder data
  # this score represents how close each word is overall to the target average
  df2 <- df2 %>% 
    mutate(
      concreteness_score = abs(concreteness - target_avg[["concreteness"]]),
      length_score = abs(length - target_avg[["length"]]),
      log_frequency_score = abs(log_frequency - target_avg[["log_frequency"]])) %>% 
    rowwise() %>% 
    mutate(
      score_total = sum(c_across(ends_with("score")))) %>% 
    # sort by score so that the word with the smallest score (closest to
    # target average) is at the top
    arrange(score_total) %>% 
    select(c(word, concreteness, length, log_frequency))
  
  # add the top row of the remainder to the word group
  df1 <- bind_rows(df1, head(df2, 1))
  
  # remove the top row, leaving the remainder
  if(nrow(df2) > 1) {
    df2 <- tail(df2, -1)
  }
  else {
    df2 <- NULL
  }

  # return the resulting list (new word group plus new remainder)
  list(df1, df2)
  
}


# create list of random initial seed words
initial_groups <- select_starters(words_dataset, 2)

choose_your_partners <- function(df_list) {
  
  if(!length(df_list) == 2) {
    stop("List must have length 2")
  }
  
  # for each word group, add a new word
  new_list <- df_list[[1]] %>% 
    map( ~ grow_groups(., df2 = df_list[[2]], avgs = avg))
  
}

# problem with not knowing how to do recursion properly
# this **doesn't work** because the new (reduced) remainder doesn't get fed back in;
# instead the previous remainder data is re-used,
# which means that a word can be added to more than one group (words ought to be
# permanently removed from the remainder set immediately after being selected)
round1 <- choose_your_partners(df_list = initial_groups)
round2 <- choose_your_partners(df_list = round1)

You could attempt a purely random search and possibly find reasonable results in reasonable time.
You should probably nornalise the variables that require balancing to avoid bias on them.


set.seed(42)
normvar <- function(x) {
  x / sqrt(sum(x^2))
}

iris_norm <- iris %>%
  mutate_if(is.numeric, normvar) %>%
  mutate(original_index = factor(row_number()))
(global_means <- summarise_if(iris_norm, .predicate = is.numeric, mean))
cols <- ncol(global_means)
names(global_means) <- paste0("g", names(global_means))
library(tidyverse)

rand_split_5 <- function(x) {
  group_map(
    .tbl = x %>% mutate(randindex = sample.int(nrow(x), size = nrow(x), replace = FALSE) %% 5) %>% group_by(randindex),
    .f = ~ tibble(.)
  )
}

r1 <- rand_split_5(iris_norm)



summarise_a_split <- function(x) {
  group_means <- map_dfr(
    x,
    ~ summarise_if(., .predicate = is.numeric, mean)
  )
  comb <- expand_grid(group_means, global_means)
  # rmse
  column_rmse <- map_dfc(
    names(group_means),
    ~ (comb[[.]] - comb[[paste0("g", .)]])^2
  ) %>%
    summarise_all(mean) %>%
    summarise_all(sqrt)
  ## sum up the column rmse to get a single num per the config
  reduce(column_rmse, .f = `+`)
}

summarise_a_split(r1)

splits_to_do <- 1000
# do x many splits and find the most balanced
splits_x <- map(
  1:splits_to_do,
  ~ rand_split_5(iris_norm)
)

# summarise them
summaries_x <- map_dbl(
  splits_x,
  ~ summarise_a_split(.)
)

# find the best (norm)
(best_fit <- which(summaries_x == min(summaries_x)))
summaries_x[(best_fit - 2):(best_fit + 2)]

best <- splits_x[[best_fit]]


# find the worst (norm)
(worst_fit <- which(summaries_x == max(summaries_x)))
summaries_x[worst_fit]

worst <- splits_x[[worst_fit]]

# (norm)
map_dfr(
  best,
  ~ summarise_if(., .predicate = is.numeric, mean)
)
# # A tibble: 5 x 4
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# <dbl>       <dbl>        <dbl>       <dbl>
# 1       0.0821      0.0817       0.0752      0.0698
# 2       0.0795      0.0789       0.0747      0.0692
# 3       0.0812      0.0806       0.0732      0.0671
# 4       0.0807      0.0832       0.0736      0.0704
# 5       0.0808      0.0799       0.0730      0.0684
# (norm)
map_dfr(
  worst,
  ~ summarise_if(., .predicate = is.numeric, mean)
)
# # A tibble: 5 x 4
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# <dbl>       <dbl>        <dbl>       <dbl>
# 1       0.0747      0.0860       0.0510      0.0433
# 2       0.0804      0.0796       0.0753      0.0704
# 3       0.0797      0.0786       0.0714      0.0652
# 4       0.0851      0.0809       0.0831      0.0796
# 5       0.0843      0.0791       0.0889      0.0865

# interesting to compare species distribution

walk(
  best,
  ~ print(table(pull(., Species)))
)

# setosa versicolor  virginica
# 10          9         11
#
# setosa versicolor  virginica
# 10          7         13
#
# setosa versicolor  virginica
# 10         12          8
#
# setosa versicolor  virginica
# 11          9         10
#
# setosa versicolor  virginica
# 9         13          8


walk(
  worst,
  ~ print(table(pull(., Species)))
)
#
# setosa versicolor  virginica
# 20          5          5
#
# setosa versicolor  virginica
# 10          9         11
#
# setosa versicolor  virginica
# 10         12          8
#
# setosa versicolor  virginica
# 6         12         12
#
# setosa versicolor  virginica
# 4         12         14

#----------------------

# lets recover the chosen grouping for the actual iris rather than iris_norm

indexes_to_use <- map(
  best,
  ~ pull(., original_index) %>% as.integer()
) %>%
  enframe(value = "original_index", name = "group") %>%
  unnest(cols = c(original_index)) %>%
  arrange(original_index)

new_iris <- bind_cols(indexes_to_use, iris)
new_iris_split <- group_by(new_iris, group) %>% group_map(~ tibble(.))
1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.