I'm trying to calculate some historic averages on a reasonably large data frame (size 50,000r x 150c) and to do so, I first need to subset the data according to two criteria then apply some arithmetic function to one of the subsetted columns (e.g. calculate an average between current date and date-3 months). I then write the result to another data frame as a new column of counts / averages / weighted averages over certain periods. Due to the size of the datasets (and the limitations of my coding skill) this is taking a long time (loop after loop after loop...) and my question is, is there a better way to approach this kind of problem in R?
Here's a simple example to illustrate what I'm trying to do:
library(lubridate)
### Create dataframe Df
date <- c("01/01/2020", "02/01/2020", "02/01/2020","02/01/2020", "03/01/2020",
"03/01/2020", "03/01/2020", "03/01/2020", "04/01/2020", "04/01/2020")
date <- dmy(date)
name <- c("john", "paul", "john", "peter", "peter",
"john", "andrew", "john", "peter", "peter")
visits <- c(1, 3, 2, 1, 3,
4, 6, 1 ,1, 9)
Df <- data.frame(date, name, visits)
Df
### Create dataframe Df1
date1 <- c("01/01/2020", "02/01/2020", "03/01/2020", "04/01/2020")
date1 <- dmy(date1)
name1 <- c("john", "paul", "andrew", "peter")
totvisits <- c(0, 0, 0, 0)
Df1 <- data.frame(date1, name1, totvisits)
Df1
Df$name <- as.character(Df$name)
Df1$name1 <- as.character(Df1$name1)
In this example I want to (for each row name1/date1 pair in Df1) subset Df according to date == date1 / name==name1 and return the number of visits each 'name1' has made prior to each 'date1' value ie by summing the 'visits' column for date < date1. I then want to save this value to the relevant row of a new column ('Df1$totvisits' in this example). As far as I can tell this requires some kind of loop which on bigger datasets is clunky and takes ages. I've tried mapply() too but that's no quicker and doesn't get away from the fact that my solution isn't very elegant.
### loop
for (i in 1:dim(Df1)[1]) {
Df1[i, 3] <- sum(subset(Df, Df$name == Df1$name1[i] & Df$date <= Df1$date1[i])[,3])
}
Df1
### apply()
f <- function(x, y) {
sum(subset(Df, (Df$name == x) & (Df$date <= y))[,3])
}
Df1[, 3] <- mapply(f, x = Df1$name1, y = Df1$date1)
Df1
To make the above clearer, what I'm trying to do is to add a new column to Df1 ('Df1$totvisits') where each entry is the result of looking up date1/name1 in the other data frame ('Df), and returning the sum of the visits that occurred before 'date1' for each 'name1'. For example, in the "2020-01-02 paul" row in Df1, I need to get values of visits from Df where 'name == paul' and 'date <= 02/01/2020', and then put the sum of them in the third column, second row of Df1. Since there is only one instance of paul in Df with date <= "2020-01-02", this entry becomes = 3.
I would like to be able to extend this to looking up value from a range of dates e.g. dates between x and y (specified by 2 columns of Df1 such as date1/date2) where person z has made a visit. I'd like to do this so that I can calculate average visits in per day over say a rolling 3m window.
In a dplyr context, I think what I'm trying to do is 'mutate' Df1 with a new column that contains an expression that returns values of sums of Df$visits before (/ between date1/date2 in my more advanced case) each date. It's just I cant seem to get it to work and it's driving me nuts!
Thanks in advance for any help with this.