# Best way to optimize loops

I am trying to build a simple simulation of data that takes in a unique mean and standard devation per row (My input dataset (`data` in the code below) is 100+ rows consisting of `Name`, `mean_proj`, and `proj_std`. I want to simulate the range of outcomes for each row many times over to then perform analysis on. The following is what I have done:

``````n_sims <- 5000

sim_results <- data.frame(matrix(NA, nrow = as.numeric(count(data)), ncol = 0))

for (i in 1:as.numeric(count(data))) {
sim_results[i, "name"] <- data\$Name[i]

for (j in 1:n_sims) {
sim_results[i, paste0("sim_", j)] <- qnorm(runif(1), data\$mean_proj[i], data\$proj_std[i])
}
}
``````

The good news is that it works (creates a new dataframe called `sim_results` with a `name` column followed by many columns `sim_n` with that iteration of the simulated data), the bad news, is that as `n_sims` gets larger, it takes a LOT longer to run. Is there a better, more r-centric, way that I should be looking in to? Would love some tips!

Is this quicker?

``````
n_sims <- 5000
#invent some data
set.seed(1)
data <- data.frame(Name = LETTERS[1:5], mean_proj = runif(5, min = 1, max =10), proj_sd = runif(5, 0.1, 2))

sim_results <- matrix(NA, nrow = nrow(data), ncol = n_sims)
ColNms <- paste("sim", 1:n_sims, sep = "_")
for (i in 1:nrow(data)) {
sim_results[i,] <- qnorm(runif(n_sims), data\$mean_proj[i], data\$proj_sd[i])

}
DF <- as.data.frame(sim_results)
colnames(DF) <- ColNms
DF\$Name <- data\$Name
``````

Created on 2019-12-17 by the reprex package (v0.2.1)

You can do this without any loops. Here's an example with fake data and a smaller number of simulations, just for illustration:

``````library(tidyverse)

data = data.frame(Name=1:3, mean_proj=c(0, 100, 1000), proj_std=c(5,10,15))
n_sims = 5

sim_result = data %>%
group_by(Name) %>%
mutate(sim.vals = list(qnorm(runif(n_sims), mean_proj, proj_std)))
``````

(Just curious: Is there some reason you wanted to do `qnorm(runif(n_sims), mean_proj, proj_std)` instead of `rnorm(n_sims, mean_proj, proj_std)`?)

You now have a data frame where the `sim.vals` column contains the simulation values (nested inside lists) for each row of `data`.

``````sim_result

Name mean_proj proj_std sim.vals
<int>     <dbl>    <dbl> <list>
1     1         0        5 <dbl >
2     2       100       10 <dbl >
3     3      1000       15 <dbl >
``````

You can see all of the `sim.vals` by unnesting:

``````sim_result %>% unnest(sim.vals)

Name mean_proj proj_std sim.vals
<int>     <dbl>    <dbl>    <dbl>
1     1         0        5    -2.70
2     1         0        5     3.14
3     1         0        5     8.81
4     1         0        5     5.90
5     1         0        5     2.59
6     2       100       10   105.
7     2       100       10    97.7
8     2       100       10   112.
9     2       100       10   103.
10     2       100       10    97.3
11     3      1000       15   987.
12     3      1000       15   983.
13     3      1000       15   997.
14     3      1000       15  1014.
15     3      1000       15   999.
``````

You can work with the data frame in various ways. For example:

``````sim_result %>%
mutate(mean.sim = mean(unlist(sim.vals)),
sd.sim = sd(unlist(sim.vals)))
``````

or

``````sim_result %>%
unnest(sim.vals) %>%
group_by(Name) %>%
summarise(mean=mean(sim.vals),
sd=sd(sim.vals))
``````

or

``````sim_result %>%
summarise(quantiles = list(enframe(quantile(unlist(sim.vals))))) %>%
unnest(quantiles) %>%
mutate(name = factor(name, levels=unique(name))) %>%
``````
3 Likes

Using this method, `mutate(sim.vals = list(qnorm(runif(n_sims), mean_proj, proj_std)))` is there a way to index each sim as well? Part of my analysis I would like to do would be to count what % of sims resulted in being the highest value for that iteration.

I'm not sure I understand. Can you say more about what you mean by "I would like to...count what % of sims resulted in being the highest value for that iteration."

Basically if I run through 100 iterations on simulations - I want to look at each entry grouped by `Name` and be able to say

10% of iterations player 1 was the top score
1% of iterations player 2 was the lowest score

I'm still not sure what we're comparing. Is each `Name` a player? What constitutes an "iteration"? Is this evaluation taking place within each `Name` or for a given index value across all of the `Name`s. A small example of what you want to do would be helpful. So in this example, and the kind of summary I want to be able to generate is:

"joe" lands up with the 1st place score in 52% of the sims, 2nd place 23% of the time, 3rd place 8%

"sally" is 1st place 22% of the time, 2nd 23.4% of the time, etc...

Here's an example of ranking by simulated scores:

``````# Fake data
# Set seeds for reproducibility
set.seed(2)
data = data.frame(Name=LETTERS[1:5], mean_proj=rnorm(5, 50, 5), proj_std=runif(5, 5, 10))
n_sims = 1000

set.seed(105)
sim_result = data %>%
group_by(Name) %>%
mutate(sim.vals = list(qnorm(runif(n_sims), mean_proj, proj_std)))

# Rank order Name within each sim.index
sim_result = sim_result %>%
unnest(sim.vals) %>%
group_by(Name) %>%
mutate(sim.index = 1:n()) %>%
group_by(sim.index) %>%
arrange(desc(sim.vals)) %>%
mutate(sim.rank = 1:n()) %>%
arrange(sim.index, sim.rank)

sim_result
``````
``````# A tibble: 5,000 x 6
# Groups:   sim.index [1,000]
Name  mean_proj proj_std sim.vals sim.index sim.rank
<fct>     <dbl>    <dbl>    <dbl>     <int>    <int>
1 E          49.6     7.03     54.2         1        1
2 B          50.9     6.19     53.8         1        2
3 C          57.9     8.80     51.0         1        3
4 D          44.3     5.90     43.8         1        4
5 A          45.5     7.76     35.5         1        5
6 A          45.5     7.76     65.9         2        1
7 C          57.9     8.80     62.2         2        2
8 B          50.9     6.19     55.7         2        3
9 E          49.6     7.03     40.0         2        4
10 D          44.3     5.90     38.0         2        5
# … with 4,990 more rows
``````
``````#Summarize simulation results
sim_result %>%
select(Name, sim.vals, sim.index) %>%
arrange(Name) %>%
pivot_wider(names_from=Name, values_from=sim.vals)
``````
``````# A tibble: 1,000 x 6
# Groups:   sim.index [1,000]
sim.index     A     B     C     D     E
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1         1  35.5  53.8  51.0  43.8  54.2
2         2  65.9  55.7  62.2  38.0  40.0
3         3  41.9  51.8  67.5  41.1  48.8
4         4  43.5  45.7  64.7  45.7  49.2
5         5  46.6  47.7  65.5  41.7  62.4
6         6  51.9  48.5  60.3  46.3  43.8
7         7  55.6  40.2  57.0  50.9  43.7
8         8  52.9  44.9  55.6  44.4  51.7
9         9  46.4  53.8  51.0  46.6  42.3
10        10  31.8  48.1  70.1  46.5  59.7
# … with 990 more rows
``````
``````# Summarize ranks
sim_result %>%
group_by(Name, sim.rank) %>%
tally %>%
mutate(pct = n/sum(n)) %>%
pivot_wider(names_from=Name, values_from=c(n, pct))
``````
``````  sim.rank   n_A   n_B   n_C   n_D   n_E pct_A pct_B pct_C pct_D pct_E
<int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1        1    73   160   597    34   136 0.073 0.16  0.597 0.034 0.136
2        2   164   327   181    89   239 0.164 0.327 0.181 0.089 0.239
3        3   176   273   106   172   273 0.176 0.273 0.106 0.172 0.273
4        4   251   166    73   305   205 0.251 0.166 0.073 0.305 0.205
5        5   336    74    43   400   147 0.336 0.074 0.043 0.4   0.147
``````
2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.