In search of optimization packages and tutorials to move from deterministic to stochastic models

JasonAizkalns · November 1, 2017, 3:18pm

So not sure if this is the best channel to post this type of question, but it feels too half-baked to post on stackoverflow or cross validated.

At the end of the day, I am in search of resources (tutorials) to move from deterministic operations research / optimization models to stochastic ones. Admittedly, not my strong suit, so I could be abusing terminology.

Consider the following assignment (agent-to-task) problem:

Three (3) tasks need to be completed.
Four (4) agents [workers] to assign to those tasks.
Each agent can only be assigned to one task, but you do not need to utilize every agent (obviously because there are more agents than tasks).
Minimize (or maximize) the assignment costs.

Agent	Task 1	Task 2	Task 3
A	11	14	6
B	8	10	11
C	9	12	7
D	10	11	7

Which we could solve (minimize) via the following binary assignments:

Agent	Task 1	Task 2	Task 3
A	0	0	1
B	1	0	0
C	0	0	0
D	0	1	0

...yielding a total cost of 25.

...but what happens if instead of plain-vanilla deterministic costs, we consider variability. Say each cost is normally-distributed: N(mu, sigma).

Agent	Task 1	Task 2	Task 3
A	(11, 2.0)	(14, 1.0)	(6, 1.5)
B	(8, 3.0)	(10, 2.5)	(11, 1.0)
C	(9, 1.5)	(12, 0.5)	(7, 0.5)
D	(10, 1.0)	(11, 2.5)	(7, 1.5)

Questions

How does the approach change? Do we bootstrap here? Simulate?
How do the assignments change?
If we assume Gaussian distributions, are there any tricks or shortcuts?
How would I code this in R? I've found a few starting points (task-views, random books/postings, but nothing really "complete" or the appropriate beginner level).
Packages? Vignettes?

kenbutler · November 3, 2017, 12:56am

I don't know about this either, but a simulation approach would be to generate a "realization" of costs by sampling costs from their distributions, solving the assignment problem (under the assumption that those costs are deterministic), and then repeating many times. It seems to me that what this does is to estimate the probability that each possible assignment of agents to tasks is optimal (since the costs could change so much that the optimal assignment could depend on the particular realization of costs that you observe). Presumably, then, you'd look at your simulation results and see which assignment is chosen the most, and that would be the "best" one over the distribution of costs.

In terms of programming that, I envisage a function whose input is the distributions of costs (like your matrix above the questions), and a function that samples from the costs and works out and returns the optimal assignment for the sampled costs, and then you run that through replicate and somehow keep track of all the assignments that are returned, finding the most "popular" one.

It strikes me as one of those problems that might have some theoretical shortcuts, but I don't know of any. It seems like the sort of thing that somebody must have explored.

nick · November 3, 2017, 1:29am

I would add that you will get a more complete picture if you calculate every possible assignment and their associated costs for each random sample. Depending on the exact nature of the random inputs (and unlikely with a normal distribution, but maybe possible) you could have a certain assignment that has a plurality of optimal times, but many of the non-optimal times are much higher than a second assignment that is consistently second place.

I'm tempted to write such a simulation myself, as it wouldn't be a ton of work and I need more tidyverse-style simulations under my belt, but something from my old math stats class is bugging me and making me think that I'm missing a numerical solution.

Edit: Oh, right. As I suspected, it took about 3 lines of code before I figured out what I was thinking. Assuming normal distributions for each task as stated, the mean cost for a given agent assignment will just be the sum of the means, and the standard deviation of the cost will be the square root of the sum of the square standard deviations (variances).

If the goal is to minimize long-term average cost, the standard deviations don't matter; stick with the mean. If you want a different loss function (maybe you want to minimize the chance of any given run having a cost below 30), then you would calculate the corresponding probabilities from the combined normal for each assignment option.

I may still show how I would come up with a simulation anyway if I get some time, though.

JasonAizkalns · November 3, 2017, 2:09pm

Thank you @kenbutler and @nick. I went ahead and created a a quick gist that has the data and the regular solver solution. I still might play with a replicate/simulation approach, but I think @nick is right and got the the "theoretical shortcut" -- in the long run, we regress to the mean and the standard deviations do not matter.