How to speed up this loop in R

Hi. I am a beginner in the RStudio, a person who is not programmer but start timidly/shyly to be interested of environment/program R and I've just started gaining experience so please forgive me mine lack of knowledge, mistakes that can sneak in when writing a post :blush: but to the point.

For the table in the attachment, I want to execute the code below.

for(r in 2:k){
 
  if(data[r,2]==data[r-1,2]){
    data[r,6]<-data[r,6]+data[r-1,6]}
  else{data[r,6]<-data[r,6]}

}

My "k" is 1 million (rows), but I can split dataset into the several smaller datasets if necessary.
I want to overwrite the 6th column in each row: in the 6th column, I want to count how many rows above the currently analyzed row contain the same value 'user_id'.
My code is ineffective for 100k. rows, I have to do this task for several tables with 1 million rows (16 million in total).

If I understand the code correctly, you want to replace the sixth column with either the cumulative sum of that column or with a series of row numbers. The two results are the same if the original values of the sixth column are always 1. In either case, you want to do the calculation within each user_id. Rather than using a for loop, I would use the row_number() or cumsum() functions to do that. The group_by() function of dplyr can be used to keep the calculation within each value of user_id.
The code below shows how to do both calculations.

DF <- data.frame(user_id = c(rep(10001, 5), rep(10002, 6)),
                 I_poprzednich = 1)
DF
#>    user_id I_poprzednich
#> 1    10001             1
#> 2    10001             1
#> 3    10001             1
#> 4    10001             1
#> 5    10001             1
#> 6    10002             1
#> 7    10002             1
#> 8    10002             1
#> 9    10002             1
#> 10   10002             1
#> 11   10002             1
library(dplyr)
DF <- DF |> group_by(user_id) |> mutate(I_poprzednich = row_number())
DF
#> # A tibble: 11 x 2
#> # Groups:   user_id [2]
#>    user_id I_poprzednich
#>      <dbl>         <int>
#>  1   10001             1
#>  2   10001             2
#>  3   10001             3
#>  4   10001             4
#>  5   10001             5
#>  6   10002             1
#>  7   10002             2
#>  8   10002             3
#>  9   10002             4
#> 10   10002             5
#> 11   10002             6

DF2 <- data.frame(user_id = c(rep(10001, 5), rep(10002, 6)),
                 I_poprzednich = 1)
DF2
#>    user_id I_poprzednich
#> 1    10001             1
#> 2    10001             1
#> 3    10001             1
#> 4    10001             1
#> 5    10001             1
#> 6    10002             1
#> 7    10002             1
#> 8    10002             1
#> 9    10002             1
#> 10   10002             1
#> 11   10002             1
DF2 <- DF2 |> group_by(user_id) |> mutate(I_poprzednich = cumsum(I_poprzednich))
DF2
#> # A tibble: 11 x 2
#> # Groups:   user_id [2]
#>    user_id I_poprzednich
#>      <dbl>         <dbl>
#>  1   10001             1
#>  2   10001             2
#>  3   10001             3
#>  4   10001             4
#>  5   10001             5
#>  6   10002             1
#>  7   10002             2
#>  8   10002             3
#>  9   10002             4
#> 10   10002             5
#> 11   10002             6

Created on 2022-02-16 by the reprex package (v2.0.1)

1 Like

Thanks a lot, that's exactly what I need!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.