simple iteration question

purrr

#1

Every time I think I understand how to use purrr...

How would you rewrite this with purrr and ggplot?

plot(c(0,32),c(0,15),type="n",xlab="Sample size",ylab="Variance")

for (n in seq(3,31,2)) {
for( i in 1:30){
x <- rnorm(n,mean=10,sd=2)
points(n,var(x)) }}

Example from Crawley, Michael J.;. “An Introduction Using R”.


#2

With purrr and ggplot2 it makes more sense to create an input object first and then plot it, so something like that (if I understood the code correctly):

library(ggplot2)
library(magrittr)

input <- purrr::cross_df(.l = list(n = seq(3,31,2), i = 1:30)) %>%
  dplyr::mutate(variance = purrr::map_dbl(n, ~var(rnorm(.x, mean = 10, sd = 2))))


ggplot(input, aes(x = n, y = variance)) +
  geom_point(alpha = 0.2) +
  geom_smooth()
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Created on 2019-01-08 by the reprex package (v0.2.1)


#3

BTW, sd probably makes more sense than var?

library(ggplot2)
library(magrittr)

input <- purrr::cross_df(.l = list(n = seq(3,31,2), i = 1:30)) %>%
  dplyr::mutate(variance = purrr::map_dbl(n, ~sd(rnorm(.x, mean = 10, sd = 2))))


ggplot(input, aes(x = n, y = variance)) +
  geom_point(alpha = 0.2) +
  geom_smooth()
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Created on 2019-01-08 by the reprex package (v0.2.1)


#4

Just to show a different way of doing this, here's a solution using base R and matrices. In general, I suggest (and personally use) code like @mishabalyasin showed. Tables are great for organizing and reasoning about data. However, matrices and arrays have some benefits:

  • Efficiency, both for memory size and processing time.
  • Near universality, in that most functions are vectorized or even "array-ized." This can lead to cleaner code without much indexing notation.
  • Matrix algebra (of course).

This example's actually pretty good for showcasing all three, though the benefit to efficiency is tiny given the amount of data.

Because each number returned by rnorm is independent, and the inner loop always goes from 1 to 30, we can remove the inner loop by generating a chunk of numbers at once and storing them as a matrix. When you give var to the whole thing. The values along the result's diagonal are each column's variance.

set.seed(2)

# Separately defining a named function helps clarify code
sized_variances <- function(n, iterations) {
  x <- rnorm(n * iterations, mean = 10, sd = 2)
  dim(x) <- c(n, iterations)
  diag(var(x))
}

Because what'll actually be plotted, the variance of a vector, will always be a single number no matter the input's length, we know we'll get 30 numbers back for each value of n. So vapply is a good candidate for the outer loop.

n_values <- seq(3, 31, 2)

variances <- vapply(
  X          = n_values,
  FUN        = sized_variances,
  FUN.VALUE  = numeric(30),
  iterations = 30
)

str(variances)
# num [1:30, 1:15] 6.57 4.24 5.78 1.71 1.65 ...

variances[1:5, 1:5]
#          [,1]      [,2]     [,3]     [,4]     [,5]
# [1,] 6.571416 4.2608157 7.594818 2.713868 2.282991
# [2,] 4.239376 4.7380924 7.750159 2.461479 3.223732
# [3,] 5.784260 5.8654816 2.222504 4.473939 1.215411
# [4,] 1.714714 0.6747371 4.907330 5.716885 7.229157
# [5,] 1.649887 5.2541663 4.573695 2.874950 1.867682

Because R is column-oriented, variances' order expressed as indices would be [1, 1], [2, 1], ..., [30, 1], [2, 1], [2, 2], and so on. It'll keep this order if converted to a vector, like the plot function will do. Knowing that, we can use rep to provide the x values.

plot(
  rep(n_values, each = 30),
  variances
)

variances


#5

Thank you! I've never seen cross_df() so I'll have to have a look at that.. The example came from a chapter explaining variance, but he mentions sd in the same paragraph.

@nwerth, I won't claim to completely understand how your example works, but it IS much clearer than the one from the book. Much easier to work through step by step for a noob..:+1: