How to updating within sapply (to avoid for loop)

Yarnabrina · January 31, 2019, 3:45am

Hi!

I've observed that apply family of functions are usually better than loops, and indeed there are some articles available online.

In one of my college assignments, I need to do a task for a large number of items, and in between them, I need to update a quantity. It takes long time, and so I thought to use sapply instead of a for loop. But, after I implemented it, I noted that there are some problems with the output and verified that the updation is not occurring.

Though not at all comparable to my assignment, here's an illustration:

## using for loop
x <- 10
y <- c()
for (i in seq_len(length.out = x))
{
  if (x > 5)
  {
    x <- (x - 1)
    y[i] <- x
  } else
  {
    y[i] <- x
  }
}
cat('\nx:', x, '\ny:', y)
#> 
#> x: 5 
#> y: 9 8 7 6 5 5 5 5 5 5

## using sapply
x <- 10
y <- sapply(X = seq_len(length.out = x),
            FUN = function(t)
            {
              if (x > 5)
              {
                x <- (x - 1)
                return(x)
              } else
              {
                return(x)
              }
            })
cat('\nx:', x, '\ny:', y)
#> 
#> x: 10 
#> y: 9 9 9 9 9 9 9 9 9 9

## using sapply with global variables
x <<- 10
y <- sapply(X = seq_len(length.out = x),
            FUN = function(t)
            {
              if (x > 5)
              {
                x <<- (x - 1)
                return(x)
              } else
              {
                return(x)
              }
            })
cat('\nx:', x, '\ny:', y)
#> 
#> x: 5 
#> y: 9 8 7 6 5 5 5 5 5 5

^{Created on 2019-01-31 by the reprex package (v0.2.1)}

I can update by defining global variables, as in the 3rd case above, but I don't want to do that (may be because I'm not comfortable with it).

Any help will be appreciated.

hughparsonage · January 31, 2019, 9:59am

This is an example where you a for loop is actually more appropriate than *apply because you want the side effect.

for loops used to be slow last century, but the main reason they're avoided now is that a vectorized solution might be available.

Yarnabrina · January 31, 2019, 4:49pm

Thanks for the response!

I understand what you're saying, but is that a strict no? I mean, is there no way to update using the apply family? For my original problem, the for loop works, but it's really slow.

I've come across this many times, but I don't really know what it means. What exactly is a "vectorized code"? I know there's also a function Vectorize, which is supposed to vectorize a function, but I don't know what it does actually.

If sapply is not applicable, is there any alternative "vectorized" code which will be faster than loops to update something?

adam83 · January 31, 2019, 6:07pm

Hi Yarnabrina, I try to find some easy answer to your questions...

To be clear, the apply-function(...) starts a fun=function(...) for your calculations. It's a function in a function. But, all variables (elements) created in a function are temporary. So, your trick to use a global variable is one way to get out of the dilemma (not the best solution, but it is a solution).

I was thinking about your task the loop and your update-routine based on the loops. Puuuh, maybe one solution (if possible) would be, 1. Step: Calculating your items and then 2. Step: Update your quantity. A solution without object dependences in your loops (running your tasks in independend steps).

Without dependences a parallel computing of your tasks should also be possible. It's often not such hard to paralize a program (e.g. package "parallel"), especially for independend tasks.

adam83 · January 31, 2019, 6:18pm

Sometimes an example is better then explanation...

#Example 1: It's based on two for loops running a big multiplication table 
#(a quite bad but simple example!)
# We have two problems:
# First it's a double loop so your run-time goes to n^2 
# (you lose time)!
# Second the saving routine in your object res[i,j] costs time 
# (but less then point one above)!
n <- 10000
#run calculation and check system time
system.time({
  res1 <- matrix(NA, n,n)
  for(i in 1:n){
    for(j in 1:n){
      res1[i,j] <- i*j
    }
  }
})
#>        User      System verstrichen 
#>       9.765       0.414      10.567

dim(res1)
#> [1] 10000 10000

#Example 2: Let's try it easy, fast and use some vectors 
# as well as matrix-multiplication
#run calculation and again check system time
system.time({
  y <- 1:n
  res2 <- y %*% t(y)
})
#>        User      System verstrichen 
#>       0.352       0.296       0.713

dim(res2)
#> [1] 10000 10000

#check objects
all(res1==res2)
#> [1] TRUE

Yarnabrina · February 1, 2019, 3:22am

Thanks for the response.

I'm not sure that I follow what you are saying. Do you want me to try two sapply for two steps? Can be you clarify a bit further, perhaps with an example for my toy problem?

As for your example on "vectorized" part, I know that it's fast to use them and I try to do so myself. But what I asked is that when someone uses Vectorize on a function, what happens in the background so that it affects its performance? The new vectorized function surely (I'm guessing, though) calls the old non-vectorized function, and hence I don't understand why will there be any effect on performance.

hughparsonage · February 1, 2019, 4:16pm

I understand what you're saying, but is that a strict no? I mean, is there no way to update using the apply family? For my original problem, the for loop works, but it's really slow.

The for loop will be faster and clearer than using apply in this case. To simplify, the only difference between a for loop and the apply method is that the apply method has no side-effects. Using sapply is like trying to soak a towel by putting it in a waterproof bag, putting it in the sink, then punching holes in the bag to get the water in. Yes, it's possible, but it's not the right way.

I've come across this many times, but I don't really know what it means. What exactly is a "vectorized code"? I know there's also a function Vectorize, which is supposed to vectorize a function, but I don't know what it does actually.

A function is vectorized if it returns the same result when applied to a vector as it would if you applied to each element of that vector and then combined it into a single result. So f is vectorized if f(c(1, 2, 3)) gives the same result as c(f(1), f(2), f(3)).

Unfortunately knowing how to vectorize some code means knowing the language. There is the function Vectorize but it's always the worst option.

Without knowing your exact problem, I don't know how to vectorize your code.

Yarnabrina · February 1, 2019, 5:55pm

Thanks for the response.

I believe if I give my actual problem here, that will be against the standing homework policy of this community. So, let me skip that primarily for that reason. Secondary reason follows.

If this is what a vectorized code is, then I'm pretty sure that it'll not be possible to vectorize my code. My problem specifically depend on the order of the arguments and the process itself is modified accordingly.

Still, out of curiosity, can you please show me how to vectorize my toy example? It's always good to learn something new

adam83 · February 2, 2019, 12:21pm

Sorry, that I don't made it clear enough, last time. I had been thinking about a redesign of your original code, to make it faster or in other words find out why your loop performance is so bad...?
The fact is that removing loops will almost always run your code faster and may more simple. Wherever you find yourself writing a for loop in R, stop and try to rethink. If you really do need to use a loop, try and keep as much outside of it as possible, e.g. the easy stuff, build your empty objects (vectors, matrix etc.) beforehand and rather than growing it with each iteration, assign an empty object of the correct size in the beginning and fill it up using fast subscripting. Just using an apply-function does not necessarily improve your performance, but good planned functionals are quite effective. You are dealing with a quite special case, or like Hugh mentioned it before 'you want the side effect' of the loop. There are performance technics in R everyone can use and learn (me too )... Below you will find some interesting links, try it out!

Best regards
Adam

Yarnabrina · February 4, 2019, 2:43pm

Thanks for the nice references. I'll certainly go through these.

I'm marking this thread as solved with choosing this as the solution. But if someone can show me one example of vectorizing a loop, like my toy example in the question, that'll be much appreciated.

Thanks once again.

system · February 11, 2019, 2:43pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.