Improve speed purrr:: script

Good Afternoon,
I'm using this code to study rainfall datasets.
I have a list of rainfall records (format: YYYY-MM-DD HH:MM:SS xx.xx ) at irregular time steps and i'm merging it with an empty dataset to create a continuous dataset with records every minute (for many years). After that I look for the maximum at different time steps (5-10-15-20-30-60 minutes) usign purrr:.
Anyway, the code is still too slow (to give an idea, vectors contain more than 5.5 million elements).

Here's the first part of the code, which is also the bottleneck (it takes 95% of the computational time on purrr::: scripts)

How could I improve its speed?

%%%%%%%%%%%%%%%%

library(dplyr)
library(data.table)
library(xts)
library(rio)
library(stats4)
library(MASS)
library(gumbel)
library(ismev)
library(readr)
library(stringi)
library(fExtremes)
library(evd)


start_date <- as.POSIXct(paste0("2010/01/01", "00:00:00"), format= "%Y/%m/%d %H:%M:%S")
end_date <- as.POSIXct(paste0("2010/01/02", "00:00:01"), format= "%Y/%m/%d %H:%M:%S")
tot_date <- seq.POSIXt(start_date, end_date, by = "1 min")
zero_output <-data.frame(tot_date)
head(zero_output)


date <- c("2010-01-01 00:00:00","2010-01-01 01:03:00","2010-01-01 05:15:00","2010-01-01 06:22:00","2010-01-01 12:35:00","2010-01-01 12:36:00","2010-01-01 12:37:00","2010-01-01 15:28:00","2010-01-01 20:00:00","2010-01-01 23:00:00" )
rain <- c(0.2,0.5,0.6,0.4,0.5,1.2,8.5,4.5,12,15)
rio_5min <- data.frame(date,rain)
zero_output$tot_date <- as.character(zero_output$tot_date)
rio_5min$date <- as.character(rio_5min$date)
AD <- left_join(zero_output,rio_5min,by=c("tot_date"="date"))  
AD <- data.frame(AD)
AD$rain[is.na(AD$rain)]<-0


options(digits =3)

fiv <- vector()
ten <- vector()
fift <- vector()
twe <- vector()
midhou <- vector()
hou <- vector()

fiv <- purrr::map_dbl(seq_along(AD$rain),
                      ~{
                        sum(AD$rain[.:(.+4)]) / 5
                      })
fiv[is.na(fiv)]<-0



ten <- purrr::map_dbl(seq_along(AD$rain),
                      ~{
                        sum(AD$rain[.:(.+9)]) / 10
                      })
ten[is.na(ten)]<-0

fift <- purrr::map_dbl(seq_along(AD$rain),
                       ~{
                         sum(AD$rain[.:(.+14)]) / 15
                       })
fift[is.na(fift)]<-0

twe <- purrr::map_dbl(seq_along(AD$rain),
                      ~{
                        sum(AD$rain[.:(.+19)]) / 20
                      })
twe[is.na(twe)]<-0

midhou <- purrr::map_dbl(seq_along(AD$rain),
                         ~{
                           sum(AD$rain[.:(.+29)]) / 30
                         })
midhou[is.na(midhou)]<-0


hou <- purrr::map_dbl(seq_along(AD$rain),
                      ~{
                        sum(AD$rain[.:(.+59)]) / 60
                      })
hou[is.na(hou)]<-0

#fiv <- AD$rain
fiv <- fiv*60
ten <- ten*60
fift <- fift*60
twe <- twe*60
midhou <-midhou*60
hou <-hou*60
....
....
....
%%%%%%%%%%%%%%%%

Thank You,
Andrea

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:


Short Version

You can share your data in a forum friendly way by passing the data to share to the dput() function.
If your data is too large you can use standard methods to reduce it before sending to dput().
When you come to share the dput() text that represents your data, please be sure to format your post with triple backticks on the line before your code begins to format it appropriately.

```
( example_df <- structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 
5, 4.4, 4.9), Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9, 3.4, 
3.4, 2.9, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 
1.4, 1.5, 1.4, 1.5), Petal.Width = c(0.2, 0.2, 0.2, 0.2, 0.2, 
0.4, 0.3, 0.2, 0.2, 0.1), Species = structure(c(1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("setosa", "versicolor", "virginica"
), class = "factor")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame")))
```
1 Like

Thank You for the hint @nirgrahamuk , I've just modified the post adding an example of the dataset :blush:

How do you see us accessing that example? What's the process you intend us to use ?

What about now?
@nirgrahamuk thanks for the suggestions

Hardly, did you try it yourself to see what happens with that?...

Why not dput?

1 Like

Ok, now should be fixed and compute correctly, thanks for your guidelines @nirgrahamuk. I appreciate.

Hello, it doesnt look like your code matches your statements...

it seems like you take a rolling overage ?

I wanted averages over different windows of data by the minute, I would probably first reach for slider package.
I would not have the as.character transformations you did, that convert away from date-times, but rather use the date-time info.


library(slider)
custslide<-function(ev){
  slide_period_dbl(.x = AD$rain,
                   .i = AD$tot_date,
                   .period = "minute",
                   .f = ~mean(.,na.rm=TRUE),
                   .every = ev) * 60
}
(fivx <- custslide(5))

I won't try profiling your issue from a speed/efficiency point of view as there is insufficient data to make it reasonable to begin.

Thanks for the reply @nirgrahamuk, but unfortunately I got this error.

Error: .i has an incorrect type.
x It must inherit from Date, POSIXct, or POSIXlt, not character.

Referring to the function you suggest me to use. I tried to convert the "AD" vector to other format (such as double or numeric) before running your function, but the issue still remains.

How can i fix it? I think you suggestion will work.

Thank You,
Andrea

zero_output$tot_date <- as.character(zero_output$tot_date)
rio_5min$date <- as.character(rio_5min$date)

It's these two, a mistake to make them character. Keep one the posixct date that it is and make the other similar.

1 Like

@nirgrahamuk thank you, now it works, but i still have a question about your solution.

Is this function making the average for different slots with a moving window?
I am asking that, because increasing the number in (fivx <- custslide(5)) to (fivx <- custslide(10)).. and more (so ideally 5-10-20-... minutes widow), the number of elements in my resulting vector decrease, but they should remain the same.
Actually what I'm trying to do is calculating the average at different durations using a moving window (so for every element, it should evaluate the average between that element and those elements after it (so 5 elements in case window of 5 minutes, 10 elements in case of 10 minutes window.... 60 elements in case of 60 minutes and so on), so the resulting vector should always have the same length.

I'm so sorry to bother you, but being an entry level to R, i've so many doubts and just this community can solve me them.

Thank you again,
Andrea

ok, so I extended your 'example data' substantially, so that there is sufficient data, for performance to become slightly visible. I do like slider for the cleaner code I can write with it. but it is less performant than your original attempt. I will show a modification to the slider approach that does the rolling average rather than average by grouped, then I will show an attempt to use furrr package to process data in parallel, which with the example and on my machine is signifantly faster.


start_date <- as.POSIXct(paste0("2010/01/01", "00:00:00"), format= "%Y/%m/%d %H:%M:%S")
end_date <- as.POSIXct(paste0("2011/01/02", "00:00:01"), format= "%Y/%m/%d %H:%M:%S")
tot_date <- seq.POSIXt(start_date, end_date, by = "1 min")
zero_output <-data.frame(tot_date)



date <- as.POSIXct(c("2010-01-01 00:00:00","2010-01-01 01:03:00","2010-01-01 05:15:00","2010-01-01 06:22:00","2010-01-01 12:35:00","2010-01-01 12:36:00","2010-01-01 12:37:00","2010-01-01 15:28:00","2010-01-01 20:00:00","2010-01-01 23:00:00" ))
rain <- c(0.2,0.5,0.6,0.4,0.5,1.2,8.5,4.5,12,15)
rio_5min <- data.frame(date,rain)

AD <- left_join(zero_output,rio_5min,by=c("tot_date"="date"))  

AD$rain[is.na(AD$rain)]<-0

library(slider)
custslide<-function(ev){
  slide_period_dbl(.x = AD$rain,
                   .i = AD$tot_date,
                   .period = "minute",
                   .f = ~mean(.,na.rm=TRUE),
                   .after = ev-1) * 60
}
fivx <- custslide(5)

### benchmarking original approach
library(microbenchmark)
microbenchmark::microbenchmark(
  {
fiv <- vector()
fiv <- purrr::map_dbl(seq_along(AD$rain),
                      ~{
                        sum(AD$rain[.:(.+4)]) / 5
                      })
fiv[is.na(fiv)]<-0
fiv <- fiv*60
},times=5L)


library(furrr)
plan(multisession, workers = 4)
microbenchmark::microbenchmark(
  {
    fiv <- vector()
    fiv <- furrr::future_map_dbl(seq_along(AD$rain),
                          ~{
                            sum(AD$rain[.:(.+4)]) / 5
                          })
    fiv[is.na(fiv)]<-0
    fiv <- fiv*60
  },times=5L)
1 Like

Thanks a lot @nirgrahamuk, the speed has increased a lot!!
Amazing!!

glad it worked for you :slight_smile:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.