Sum of values from previous few years

Hello!

I am trying to add values from last 10 years in a separate column.

Original data starts from 2012 - 2021 and I am interested in creating a column that has sum of Sales from previous 10 years. So, for values corresponding to year 2021, we need to add values from 2011 - 2020 and because we don't have value for 2011, we can consider value from 2012. Similarly, for values corresponding to year 2020, we need to add values from 2010 - 2019 and because we don't have values for 2011 & 2010, we can consider value from 2012 for both 2011 & 2010 and so on.

For this example, we can add values from previous 3 years Sales and fill in the empty values with the earliest years as above. So, for 2021, the new value in new column should be sum of values from 2020, 2019 & 2018. For 2020, the new value should be sum of values from 2019 & 2018 & 2018. For 2019, the new value should be sum of values from 2018, 2018 & 2018. The same goes for 2018 as we don't have any previous values.

Any other logic to fill in the gaps for years missing is very much welcome. The goal here is to be able to collect last 10 years' worth sales to be used for some other calculations.

I am not able to figure this out. Any help would be appreciated.

library(tidyverse)
library(lubridate)

vol <- data.frame(
  Date = c("2018 Jan",
            "2018 Feb","2018 Mar","2018 Apr","2018 May","2018 Jun",
            "2018 Jul","2018 Aug","2018 Sep","2018 Oct","2018 Nov",
            "2018 Dec","2019 Jan",
            "2019 Feb","2019 Mar","2019 Apr","2019 May","2019 Jun",
            "2019 Jul","2019 Aug","2019 Sep","2019 Oct","2019 Nov",
            "2019 Dec","2020 Jan",
            "2020 Feb","2020 Mar","2020 Apr","2020 May","2020 Jun",
            "2020 Jul","2020 Aug","2020 Sep","2020 Oct","2020 Nov",
            "2020 Dec", "2021 Jan",
            "2021 Feb","2021 Mar","2021 Apr","2021 May","2021 Jun",
            "2021 Jul","2021 Aug","2021 Sep","2021 Oct","2021 Nov",
            "2021 Dec"),
  Country = c("CA","CA","CA","CA","CA","CA","US","US","US","US","US","US",
          "CA","CA","CA","CA","CA","CA","US","US","US","US","US","US",
            "CA","CA","CA","CA","CA","CA","US","US","US","US","US","US",
            "CA","CA","CA","CA","CA","CA","US","US","US","US","US","US")
)
  Type = c("A", "B", "C", "D", "A", "B", "C", "D","A", "B", "C", "D",
"A", "B", "C", "D", "A", "B", "C", "D","A", "B", "C", "D",
"A", "B", "C", "D", "A", "B", "C", "D","A", "B", "C", "D",
"A", "B", "C", "D", "A", "B", "C", "D","A", "B", "C", "D"
)
  Sales = c(100,110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210,
            220, 230,240, 250, 260, 270, 280, 290, 300, 310, 320, 330,
            340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450,
            460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570
)

vol$Date <- ym(vol$Date)

Thank you!

Please have a look on Replace a value NA with the value from another column in R - Stack Overflow

Regards,
Grzegorz

Thanks @gsapijaszko!
The link shows how to replace NA values. But I am struggling with replacement while performing addition.

Hopefully, this is doable.

Thanks!

Hi,

If I understand correctly, this should be what you are looking for right?

library(tidyverse)
library(lubridate)

vol <- data.frame(
  Date = c("2018 Jan",
           "2018 Feb","2018 Mar","2018 Apr","2018 May","2018 Jun",
           "2018 Jul","2018 Aug","2018 Sep","2018 Oct","2018 Nov",
           "2018 Dec","2019 Jan",
           "2019 Feb","2019 Mar","2019 Apr","2019 May","2019 Jun",
           "2019 Jul","2019 Aug","2019 Sep","2019 Oct","2019 Nov",
           "2019 Dec","2020 Jan",
           "2020 Feb","2020 Mar","2020 Apr","2020 May","2020 Jun",
           "2020 Jul","2020 Aug","2020 Sep","2020 Oct","2020 Nov",
           "2020 Dec", "2021 Jan",
           "2021 Feb","2021 Mar","2021 Apr","2021 May","2021 Jun",
           "2021 Jul","2021 Aug","2021 Sep","2021 Oct","2021 Nov",
           "2021 Dec"),
  
  Country = c("CA","CA","CA","CA","CA","CA","US","US","US","US","US","US",
              "CA","CA","CA","CA","CA","CA","US","US","US","US","US","US",
              "CA","CA","CA","CA","CA","CA","US","US","US","US","US","US",
              "CA","CA","CA","CA","CA","CA","US","US","US","US","US","US"),

Type = c("A", "B", "C", "D", "A", "B", "C", "D","A", "B", "C", "D",
         "A", "B", "C", "D", "A", "B", "C", "D","A", "B", "C", "D",
         "A", "B", "C", "D", "A", "B", "C", "D","A", "B", "C", "D",
         "A", "B", "C", "D", "A", "B", "C", "D","A", "B", "C", "D"),

Sales = c(100,110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210,
          220, 230,240, 250, 260, 270, 280, 290, 300, 310, 320, 330,
          340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450,
          460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570)
)

vol$Date <- ym(vol$Date)

#Sum for the past 10 years
vol$Sum = sapply(vol$Date, function(x){
  sum(vol$Sales[vol$Date <= x & vol$Date >= x - years(10)])
})

vol %>% head()
#>         Date Country Type Sales Sum
#> 1 2018-01-01      CA    A   100 100
#> 2 2018-02-01      CA    B   110 210
#> 3 2018-03-01      CA    C   120 330
#> 4 2018-04-01      CA    D   130 460
#> 5 2018-05-01      CA    A   140 600
#> 6 2018-06-01      CA    B   150 750

Created on 2022-01-18 by the reprex package (v2.0.1)

Hope this helps,
PJ

Thanks @pieterjanvc!
So, with this approach, we are only able to add what we have the data for. In my situation, we don't have data from 2002 to 2011 and we need to consider sum of last 10 years data for each month of each year from 2012 - 2021 and we should not consider current month. Because we don't have previous years data, we would like to keep the values of earliest year for the previous years. So, we need to consider same values of 2012 for 2007 - 2011 (not realistic), but the best we can assume. Any suggestions are welcome to make assumptions in this scenario.

For simplicity, above sample data is for 3 years and we would like to have similar results with sum of last 3 years. So, For 2018 Jan, we should consider sum of values from 2015 Jan -2017 Dec. But we don't have data for these years. So, we assume values from 2018 Jan- 2018 Dec for all previous years (In this case, 2015 - 2017). Similarly, for 2018 Feb, we would need sum of values from 2015 Feb - 2017 Jan and so on.

Thank you!

HI,

How about this?

library(tidyverse)
library(lubridate)

vol <- data.frame(
  Date = c("2018 Jan",
           "2018 Feb","2018 Mar","2018 Apr","2018 May","2018 Jun",
           "2018 Jul","2018 Aug","2018 Sep","2018 Oct","2018 Nov",
           "2018 Dec","2019 Jan",
           "2019 Feb","2019 Mar","2019 Apr","2019 May","2019 Jun",
           "2019 Jul","2019 Aug","2019 Sep","2019 Oct","2019 Nov",
           "2019 Dec","2020 Jan",
           "2020 Feb","2020 Mar","2020 Apr","2020 May","2020 Jun",
           "2020 Jul","2020 Aug","2020 Sep","2020 Oct","2020 Nov",
           "2020 Dec", "2021 Jan",
           "2021 Feb","2021 Mar","2021 Apr","2021 May","2021 Jun",
           "2021 Jul","2021 Aug","2021 Sep","2021 Oct","2021 Nov",
           "2021 Dec"),
  
  Country = c("CA","CA","CA","CA","CA","CA","US","US","US","US","US","US",
              "CA","CA","CA","CA","CA","CA","US","US","US","US","US","US",
              "CA","CA","CA","CA","CA","CA","US","US","US","US","US","US",
              "CA","CA","CA","CA","CA","CA","US","US","US","US","US","US"),
  
  Type = c("A", "B", "C", "D", "A", "B", "C", "D","A", "B", "C", "D",
           "A", "B", "C", "D", "A", "B", "C", "D","A", "B", "C", "D",
           "A", "B", "C", "D", "A", "B", "C", "D","A", "B", "C", "D",
           "A", "B", "C", "D", "A", "B", "C", "D","A", "B", "C", "D"),
  
  Sales = c(100,110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210,
            220, 230,240, 250, 260, 270, 280, 290, 300, 310, 320, 330,
            340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450,
            460, 470, 480, 490, 500, 510, 520, 530, 540, 550, 560, 570)
)

vol$Date <- ym(vol$Date)

#Fill in missing dates
vol = data.frame(
  Date = seq(ymd("2015-1-1"), ymd("2021-12-1"), by = "months")
) %>%  left_join(
  vol, by = "Date"
) %>% mutate(Month = month(Date))

#Fill in missing values
vol = vol %>% group_by(Month) %>% 
  fill(Country, Type, Sales, .direction = "up") %>% 
  ungroup()

#Sum for the past 3 years
vol$Sum = sapply(vol$Date, function(x){
  sum(vol$Sales[vol$Date <= x & vol$Date >= x - years(3)])
})

Created on 2022-01-20 by the reprex package (v2.0.1)

Thanks @pieterjanvc!

Yes, the Sales variable is exactly what I was looking for. Thanks for helping with this!

But the Sum variable doesn't seem to have right values...If we look at last row with 2021-12-01, it gives 14430. But if I try to check it in excel, I think it should be 13860 which should add values from 2018 Dec - 2021 Nov. If however, we add values from 2019 Jan - 2021 Dec while considering Dec value (570), then also it should be 14220. I didn't want to consider current month (Dec 2021 in this case), so in that case, looking for 13860 instead.

If I use mutate(Sum = lag(Sales, 1) + lag(Sales, 2) + lag(Sales, 3)) instead its calculating it right if I wanted to add for just last 3 months. But because we need last 3 years instead, not sure if we can modify it somehow to get the right result.

Thanks again for your help!

HI again,

All you have to do is change the sign in the sum function of the max date from <= to < like so:

#Sum for the past 3 years
vol$Sum = sapply(vol$Date, function(x){
  sum(vol$Sales[vol$Date < x & vol$Date >= x - years(3)])
})

That should provide the requested sum
Hope this helps,
PJ

1 Like

Thank you so much!
This is very helpful @pieterjanvc !

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.