why we can't use c() in for loop

Sometime I write for loop, and user c() inside for loop.

x <- NULL
for(i in seq_along(1:100)){
x <- c(x, i)
}

When I write this code, people will say that it's not efficient.
I heard that it is recommended to assign to a vector with zeros instead of null vector, and I would like to know why.

thank you.

Some references below:
Before you start the loop, you must always allocate sufficient space for the output. from R4DS( 21 Iteration | R for Data Science (had.co.nz))

And Advanced-r:
if you’re generating data, make sure to preallocate the output container. Otherwise the loop will be very slow; 5 Control flow | Advanced R (hadley.nz)

3 Likes

thank you.
I read R4DS but I don't found why be slow.
I want to know that.

because memory allocation is a costly operation, if you extend x via c() then you memory allocate many more times than if you allocate once at the start

2 Likes

I see !!
thank you !!
So if we create the zero vector first, we only need to allocate memory once!

In reading Ch.21, as suggested by @ifendo,

I believe this is the explanation that you are looking for:

 21.3.3 Unknown output length

Sometimes you might not know how long the output will be. For example, imagine you want to simulate some random vectors of random lengths. You might be tempted to solve this problem by progressively growing the vector:

means <- c(0, 1, 2)

output <- double()
for (i in seq_along(means)) {
  n <- sample(100, 1)
  output <- c(output, rnorm(n, means[[i]]))
}
str(output)
#>  num [1:138] 0.912 0.205 2.584 -0.789 0.588 ...

But this is not very efficient because in each iteration, R has to copy all the data from the previous iterations. In technical terms you get “quadratic” (O(n2)
) behaviour which means that a loop with three times as many elements would take nine (32) times as long to run.
1 Like

There is actually an alternative that is still slower than a full-length zero-vector but much faster than a null-vector: vector with length 1 which will grow over the loop.

x <- NULL
for(i in seq_along(1:10000)){
  x <- c(x, i)
}

y <- c(NA)
for(i in seq_along(1:10000)){
  y[[i]] <- i
}

z <- rep(NA, times = 10000)
for(i in seq_along(1:10000)){
  z[[i]] <- i
}

identical(x, y)
#> [1] TRUE
identical(x, z)
#> [1] TRUE



rbenchmark::benchmark(
  nul_vec = {
    x <- NULL
    for(i in seq_along(1:10000)){
      x <- c(x, i)
    }
  },
  uni_vec = {
    y <- c(NA)
    for(i in seq_along(1:10000)){
      y[[i]] <- i
    }
  },
  ful_vec = {
    z <- rep(NA, times = 10000)
    for(i in seq_along(1:10000)){
      z[[i]] <- i
    }
  }
  
)
#>      test replications elapsed relative user.self sys.self user.child sys.child
#> 3 ful_vec          100    0.42    1.000      0.42     0.00         NA        NA
#> 1 nul_vec          100   13.22   31.476     12.48     0.03         NA        NA
#> 2 uni_vec          100    0.68    1.619      0.67     0.00         NA        NA
2 Likes

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.