Best way to optimize loops

I am trying to build a simple simulation of data that takes in a unique mean and standard devation per row (My input dataset (data in the code below) is 100+ rows consisting of Name, mean_proj, and proj_std. I want to simulate the range of outcomes for each row many times over to then perform analysis on. The following is what I have done:

n_sims <- 5000

sim_results <- data.frame(matrix(NA, nrow = as.numeric(count(data)), ncol = 0))

for (i in 1:as.numeric(count(data))) {
  sim_results[i, "name"] <- data$Name[i]
  
  for (j in 1:n_sims) {
    sim_results[i, paste0("sim_", j)] <- qnorm(runif(1), data$mean_proj[i], data$proj_std[i])
  }
}

The good news is that it works (creates a new dataframe called sim_results with a name column followed by many columns sim_n with that iteration of the simulated data), the bad news, is that as n_sims gets larger, it takes a LOT longer to run. Is there a better, more r-centric, way that I should be looking in to? Would love some tips!

Is this quicker?


n_sims <- 5000
#invent some data
set.seed(1)
data <- data.frame(Name = LETTERS[1:5], mean_proj = runif(5, min = 1, max =10), proj_sd = runif(5, 0.1, 2))

sim_results <- matrix(NA, nrow = nrow(data), ncol = n_sims)
ColNms <- paste("sim", 1:n_sims, sep = "_")
for (i in 1:nrow(data)) {
  sim_results[i,] <- qnorm(runif(n_sims), data$mean_proj[i], data$proj_sd[i])
  
} 
DF <- as.data.frame(sim_results)
colnames(DF) <- ColNms
DF$Name <- data$Name

Created on 2019-12-17 by the reprex package (v0.2.1)

You can do this without any loops. Here's an example with fake data and a smaller number of simulations, just for illustration:

library(tidyverse)

data = data.frame(Name=1:3, mean_proj=c(0, 100, 1000), proj_std=c(5,10,15))
n_sims = 5

sim_result = data %>% 
  group_by(Name) %>% 
  mutate(sim.vals = list(qnorm(runif(n_sims), mean_proj, proj_std)))

(Just curious: Is there some reason you wanted to do qnorm(runif(n_sims), mean_proj, proj_std) instead of rnorm(n_sims, mean_proj, proj_std)?)

You now have a data frame where the sim.vals column contains the simulation values (nested inside lists) for each row of data.

sim_result

   Name mean_proj proj_std sim.vals 
  <int>     <dbl>    <dbl> <list>   
1     1         0        5 <dbl [5]>
2     2       100       10 <dbl [5]>
3     3      1000       15 <dbl [5]>

You can see all of the sim.vals by unnesting:

sim_result %>% unnest(sim.vals)

    Name mean_proj proj_std sim.vals
   <int>     <dbl>    <dbl>    <dbl>
 1     1         0        5    -2.70
 2     1         0        5     3.14
 3     1         0        5     8.81
 4     1         0        5     5.90
 5     1         0        5     2.59
 6     2       100       10   105.  
 7     2       100       10    97.7 
 8     2       100       10   112.  
 9     2       100       10   103.  
10     2       100       10    97.3 
11     3      1000       15   987.  
12     3      1000       15   983.  
13     3      1000       15   997.  
14     3      1000       15  1014.  
15     3      1000       15   999.  

You can work with the data frame in various ways. For example:

sim_result %>% 
  mutate(mean.sim = mean(unlist(sim.vals)),
         sd.sim = sd(unlist(sim.vals)))

or

sim_result %>% 
  unnest(sim.vals) %>% 
  group_by(Name) %>% 
  summarise(mean=mean(sim.vals),
            sd=sd(sim.vals))

or

sim_result %>% 
  summarise(quantiles = list(enframe(quantile(unlist(sim.vals))))) %>% 
  unnest(quantiles) %>% 
  mutate(name = factor(name, levels=unique(name))) %>% 
  spread(name, value)
3 Likes

Using this method, mutate(sim.vals = list(qnorm(runif(n_sims), mean_proj, proj_std))) is there a way to index each sim as well? Part of my analysis I would like to do would be to count what % of sims resulted in being the highest value for that iteration.

I'm not sure I understand. Can you say more about what you mean by "I would like to...count what % of sims resulted in being the highest value for that iteration."

Basically if I run through 100 iterations on simulations - I want to look at each entry grouped by Name and be able to say

10% of iterations player 1 was the top score
1% of iterations player 2 was the lowest score

I'm still not sure what we're comparing. Is each Name a player? What constitutes an "iteration"? Is this evaluation taking place within each Name or for a given index value across all of the Names. A small example of what you want to do would be helpful.

15
So in this example, and the kind of summary I want to be able to generate is:

"joe" lands up with the 1st place score in 52% of the sims, 2nd place 23% of the time, 3rd place 8%

"sally" is 1st place 22% of the time, 2nd 23.4% of the time, etc...

Here's an example of ranking by simulated scores:

# Fake data
# Set seeds for reproducibility
set.seed(2)
data = data.frame(Name=LETTERS[1:5], mean_proj=rnorm(5, 50, 5), proj_std=runif(5, 5, 10))
n_sims = 1000

set.seed(105)
sim_result = data %>% 
  group_by(Name) %>% 
  mutate(sim.vals = list(qnorm(runif(n_sims), mean_proj, proj_std)))

# Rank order Name within each sim.index
sim_result = sim_result %>% 
  unnest(sim.vals) %>% 
  group_by(Name) %>% 
  mutate(sim.index = 1:n()) %>% 
  group_by(sim.index) %>% 
  arrange(desc(sim.vals)) %>% 
  mutate(sim.rank = 1:n()) %>%
  arrange(sim.index, sim.rank)

sim_result
# A tibble: 5,000 x 6
# Groups:   sim.index [1,000]
   Name  mean_proj proj_std sim.vals sim.index sim.rank
   <fct>     <dbl>    <dbl>    <dbl>     <int>    <int>
 1 E          49.6     7.03     54.2         1        1
 2 B          50.9     6.19     53.8         1        2
 3 C          57.9     8.80     51.0         1        3
 4 D          44.3     5.90     43.8         1        4
 5 A          45.5     7.76     35.5         1        5
 6 A          45.5     7.76     65.9         2        1
 7 C          57.9     8.80     62.2         2        2
 8 B          50.9     6.19     55.7         2        3
 9 E          49.6     7.03     40.0         2        4
10 D          44.3     5.90     38.0         2        5
# … with 4,990 more rows
#Summarize simulation results
sim_result %>% 
  select(Name, sim.vals, sim.index) %>% 
  arrange(Name) %>% 
  pivot_wider(names_from=Name, values_from=sim.vals)
# A tibble: 1,000 x 6
# Groups:   sim.index [1,000]
   sim.index     A     B     C     D     E
       <int> <dbl> <dbl> <dbl> <dbl> <dbl>
 1         1  35.5  53.8  51.0  43.8  54.2
 2         2  65.9  55.7  62.2  38.0  40.0
 3         3  41.9  51.8  67.5  41.1  48.8
 4         4  43.5  45.7  64.7  45.7  49.2
 5         5  46.6  47.7  65.5  41.7  62.4
 6         6  51.9  48.5  60.3  46.3  43.8
 7         7  55.6  40.2  57.0  50.9  43.7
 8         8  52.9  44.9  55.6  44.4  51.7
 9         9  46.4  53.8  51.0  46.6  42.3
10        10  31.8  48.1  70.1  46.5  59.7
# … with 990 more rows
# Summarize ranks
sim_result %>% 
  group_by(Name, sim.rank) %>% 
  tally %>% 
  mutate(pct = n/sum(n)) %>% 
  pivot_wider(names_from=Name, values_from=c(n, pct))
  sim.rank   n_A   n_B   n_C   n_D   n_E pct_A pct_B pct_C pct_D pct_E
     <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1        1    73   160   597    34   136 0.073 0.16  0.597 0.034 0.136
2        2   164   327   181    89   239 0.164 0.327 0.181 0.089 0.239
3        3   176   273   106   172   273 0.176 0.273 0.106 0.172 0.273
4        4   251   166    73   305   205 0.251 0.166 0.073 0.305 0.205
5        5   336    74    43   400   147 0.336 0.074 0.043 0.4   0.147
2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.