I have the following code and want to write it with "apply" family of functions in order to decrease the time consumption. And are there another things that can be changed to optimize the operation?
ord_db is a tibble with several millions of rows.
for(i in 1:nrow(ord_db)){
n <- ord_db[i, "Date"]
if(n == 43465){
z <- 0
}else if (n == 43466 | n == 43831 | n == 44196){
z <- ord_db$Value[i]
}else {
z <- ord_db$Value[i] - ord_db$Prev_Date_Value[i]
}
ord_db$Value_change[I] <- z
}
here is an approach using tidyverse that doesnt involve for loops, or apply familt functions
library(tidyverse)
#use built in population dataset as example
(small_df <- head(population,10))
(new_df <- mutate(small_df,
z= case_when(year==1999 ~ 0,
year %in% c(1995,1996) ~ as.numeric(population),#needs to be double not integer to match the rest
TRUE ~ sqrt(population/100)
)))
Thanks . I wanted to use "sapply" because I read that it is faster and more efficient. Do you know your suggested method is faster than "sapply"? Could you write this with "sapply"?
I dont think sapply is a good choice because your conditions mix multiple variables as source, and if you are going to access them , its going to be best to do so using the standard base style vectorisation.
Here is a comparison
library(tidyverse)
#use built in population dataset as example
(small_df <- head(population,10))
library(microbenchmark)
microbenchmark(
tidy = {
(new_df <- mutate(small_df,
z= case_when(year==1999 ~ 0,
year %in% c(1995,1996) ~ as.numeric(population),#needs to be double not integer to match the rest
TRUE ~ sqrt(population/100)
)))
},
base = {
new_df2<-small_df
new_df2$z <- ifelse(new_df2$year==1990,0,
ifelse( new_df2$year %in% c(1995,1996) , new_df2$population,
sqrt(new_df2$population/100)))
new_df2
}
)
Unit: microseconds
expr min lq mean median uq max neval
tidy 185.100 192.8005 221.0891 200.8005 214.2520 483.302 100
base 92.801 103.9010 122.0940 115.8015 120.4515 334.702 100