Improve speed purrr:: script

AndrePaind · March 5, 2021, 5:11pm

Good Afternoon,
I'm using this code to study rainfall datasets.
I have a list of rainfall records (format: YYYY-MM-DD HH:MM:SS xx.xx ) at irregular time steps and i'm merging it with an empty dataset to create a continuous dataset with records every minute (for many years). After that I look for the maximum at different time steps (5-10-15-20-30-60 minutes) usign purrr:.
Anyway, the code is still too slow (to give an idea, vectors contain more than 5.5 million elements).

Here's the first part of the code, which is also the bottleneck (it takes 95% of the computational time on purrr::: scripts)

How could I improve its speed?

%%%%%%%%%%%%%%%%

library(dplyr)
library(data.table)
library(xts)
library(rio)
library(stats4)
library(MASS)
library(gumbel)
library(ismev)
library(readr)
library(stringi)
library(fExtremes)
library(evd)


start_date <- as.POSIXct(paste0("2010/01/01", "00:00:00"), format= "%Y/%m/%d %H:%M:%S")
end_date <- as.POSIXct(paste0("2010/01/02", "00:00:01"), format= "%Y/%m/%d %H:%M:%S")
tot_date <- seq.POSIXt(start_date, end_date, by = "1 min")
zero_output <-data.frame(tot_date)
head(zero_output)


date <- c("2010-01-01 00:00:00","2010-01-01 01:03:00","2010-01-01 05:15:00","2010-01-01 06:22:00","2010-01-01 12:35:00","2010-01-01 12:36:00","2010-01-01 12:37:00","2010-01-01 15:28:00","2010-01-01 20:00:00","2010-01-01 23:00:00" )
rain <- c(0.2,0.5,0.6,0.4,0.5,1.2,8.5,4.5,12,15)
rio_5min <- data.frame(date,rain)
zero_output$tot_date <- as.character(zero_output$tot_date)
rio_5min$date <- as.character(rio_5min$date)
AD <- left_join(zero_output,rio_5min,by=c("tot_date"="date"))  
AD <- data.frame(AD)
AD$rain[is.na(AD$rain)]<-0


options(digits =3)

fiv <- vector()
ten <- vector()
fift <- vector()
twe <- vector()
midhou <- vector()
hou <- vector()

fiv <- purrr::map_dbl(seq_along(AD$rain),
                      ~{
                        sum(AD$rain[.:(.+4)]) / 5
                      })
fiv[is.na(fiv)]<-0



ten <- purrr::map_dbl(seq_along(AD$rain),
                      ~{
                        sum(AD$rain[.:(.+9)]) / 10
                      })
ten[is.na(ten)]<-0

fift <- purrr::map_dbl(seq_along(AD$rain),
                       ~{
                         sum(AD$rain[.:(.+14)]) / 15
                       })
fift[is.na(fift)]<-0

twe <- purrr::map_dbl(seq_along(AD$rain),
                      ~{
                        sum(AD$rain[.:(.+19)]) / 20
                      })
twe[is.na(twe)]<-0

midhou <- purrr::map_dbl(seq_along(AD$rain),
                         ~{
                           sum(AD$rain[.:(.+29)]) / 30
                         })
midhou[is.na(midhou)]<-0


hou <- purrr::map_dbl(seq_along(AD$rain),
                      ~{
                        sum(AD$rain[.:(.+59)]) / 60
                      })
hou[is.na(hou)]<-0

#fiv <- AD$rain
fiv <- fiv*60
ten <- ten*60
fift <- fift*60
twe <- twe*60
midhou <-midhou*60
hou <-hou*60
....
....
....
%%%%%%%%%%%%%%%%

Thank You,
Andrea

nirgrahamuk · March 5, 2021, 5:30pm

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Short Version

You can share your data in a forum friendly way by passing the data to share to the dput() function.
If your data is too large you can use standard methods to reduce it before sending to dput().
When you come to share the dput() text that represents your data, please be sure to format your post with triple backticks on the line before your code begins to format it appropriately.

```
( example_df <- structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 
5, 4.4, 4.9), Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9, 3.4, 
3.4, 2.9, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 
1.4, 1.5, 1.4, 1.5), Petal.Width = c(0.2, 0.2, 0.2, 0.2, 0.2, 
0.4, 0.3, 0.2, 0.2, 0.1), Species = structure(c(1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("setosa", "versicolor", "virginica"
), class = "factor")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame")))
```

AndrePaind · March 5, 2021, 6:01pm

Thank You for the hint @nirgrahamuk , I've just modified the post adding an example of the dataset

nirgrahamuk · March 5, 2021, 6:33pm

How do you see us accessing that example? What's the process you intend us to use ?

AndrePaind · March 5, 2021, 6:46pm

What about now?
@nirgrahamuk thanks for the suggestions

nirgrahamuk · March 5, 2021, 6:54pm

Hardly, did you try it yourself to see what happens with that?...

Why not dput?

AndrePaind · March 8, 2021, 10:29am

Ok, now should be fixed and compute correctly, thanks for your guidelines @nirgrahamuk. I appreciate.

nirgrahamuk · March 8, 2021, 12:49pm

Hello, it doesnt look like your code matches your statements...

it seems like you take a rolling overage ?

I wanted averages over different windows of data by the minute, I would probably first reach for slider package.
I would not have the as.character transformations you did, that convert away from date-times, but rather use the date-time info.


library(slider)
custslide<-function(ev){
  slide_period_dbl(.x = AD$rain,
                   .i = AD$tot_date,
                   .period = "minute",
                   .f = ~mean(.,na.rm=TRUE),
                   .every = ev) * 60
}
(fivx <- custslide(5))

I won't try profiling your issue from a speed/efficiency point of view as there is insufficient data to make it reasonable to begin.

AndrePaind · March 8, 2021, 4:41pm

Thanks for the reply @nirgrahamuk, but unfortunately I got this error.

Error: .i has an incorrect type.
x It must inherit from Date, POSIXct, or POSIXlt, not character.

Referring to the function you suggest me to use. I tried to convert the "AD" vector to other format (such as double or numeric) before running your function, but the issue still remains.

How can i fix it? I think you suggestion will work.

Thank You,
Andrea

nirgrahamuk · March 8, 2021, 8:07pm

zero_output$tot_date <- as.character(zero_output$tot_date)
rio_5min$date <- as.character(rio_5min$date)

It's these two, a mistake to make them character. Keep one the posixct date that it is and make the other similar.

AndrePaind · March 9, 2021, 11:44am

@nirgrahamuk thank you, now it works, but i still have a question about your solution.

Is this function making the average for different slots with a moving window?
I am asking that, because increasing the number in (fivx <- custslide(5)) to (fivx <- custslide(10)).. and more (so ideally 5-10-20-... minutes widow), the number of elements in my resulting vector decrease, but they should remain the same.
Actually what I'm trying to do is calculating the average at different durations using a moving window (so for every element, it should evaluate the average between that element and those elements after it (so 5 elements in case window of 5 minutes, 10 elements in case of 10 minutes window.... 60 elements in case of 60 minutes and so on), so the resulting vector should always have the same length.

I'm so sorry to bother you, but being an entry level to R, i've so many doubts and just this community can solve me them.

Thank you again,
Andrea

nirgrahamuk · March 9, 2021, 5:22pm

ok, so I extended your 'example data' substantially, so that there is sufficient data, for performance to become slightly visible. I do like slider for the cleaner code I can write with it. but it is less performant than your original attempt. I will show a modification to the slider approach that does the rolling average rather than average by grouped, then I will show an attempt to use furrr package to process data in parallel, which with the example and on my machine is signifantly faster.


start_date <- as.POSIXct(paste0("2010/01/01", "00:00:00"), format= "%Y/%m/%d %H:%M:%S")
end_date <- as.POSIXct(paste0("2011/01/02", "00:00:01"), format= "%Y/%m/%d %H:%M:%S")
tot_date <- seq.POSIXt(start_date, end_date, by = "1 min")
zero_output <-data.frame(tot_date)



date <- as.POSIXct(c("2010-01-01 00:00:00","2010-01-01 01:03:00","2010-01-01 05:15:00","2010-01-01 06:22:00","2010-01-01 12:35:00","2010-01-01 12:36:00","2010-01-01 12:37:00","2010-01-01 15:28:00","2010-01-01 20:00:00","2010-01-01 23:00:00" ))
rain <- c(0.2,0.5,0.6,0.4,0.5,1.2,8.5,4.5,12,15)
rio_5min <- data.frame(date,rain)

AD <- left_join(zero_output,rio_5min,by=c("tot_date"="date"))  

AD$rain[is.na(AD$rain)]<-0

library(slider)
custslide<-function(ev){
  slide_period_dbl(.x = AD$rain,
                   .i = AD$tot_date,
                   .period = "minute",
                   .f = ~mean(.,na.rm=TRUE),
                   .after = ev-1) * 60
}
fivx <- custslide(5)

### benchmarking original approach
library(microbenchmark)
microbenchmark::microbenchmark(
  {
fiv <- vector()
fiv <- purrr::map_dbl(seq_along(AD$rain),
                      ~{
                        sum(AD$rain[.:(.+4)]) / 5
                      })
fiv[is.na(fiv)]<-0
fiv <- fiv*60
},times=5L)


library(furrr)
plan(multisession, workers = 4)
microbenchmark::microbenchmark(
  {
    fiv <- vector()
    fiv <- furrr::future_map_dbl(seq_along(AD$rain),
                          ~{
                            sum(AD$rain[.:(.+4)]) / 5
                          })
    fiv[is.na(fiv)]<-0
    fiv <- fiv*60
  },times=5L)

AndrePaind · March 11, 2021, 10:20am

Thanks a lot @nirgrahamuk, the speed has increased a lot!!
Amazing!!

nirgrahamuk · March 11, 2021, 10:23am

glad it worked for you

system · March 18, 2021, 10:23am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.