How to rollback 12 month for each month and summarise

dplyr

#1

Sample data at the bottom of the page.

My goal is following:

  1. for each month look back last 12 months and filter the data based on it. For an example, if we pick 2017-11-01. Therefore 12 months back from it is 2016-12-01. So filter data from period 2016-12-01 to 2017-11-01
  2. Within that 12 month period, count how many Unique ID are there for each Group
  3. Repeat above for each month in the dataframe i.e go back 12 month for each month in the dataset and get the count of Unique ID for each Group

If successfully executed above points, I can plot line graph with date on x axis and count of Unique ID for each Group on y axis for each month.

My attempt:

foo %>% 
  filter(Date >= "2016-12-01" &  Date <= "2017-11-01") %>% 
  group_by(ID, Group) %>% 
  count() %>% 
  group_by(Group) %>% 
  count

Output:
Min 2
sport 6
Xox 41

Above output is for 2017-11-01. However, my goal is to repeat this process for each month and store it in a dataframe so that I can draw a line plot.

Any help from the community is greatly appreciated.

Thanks

Here is my sample dataframe:

#>         ID       Date Group
#> 1   D_2313 2017-07-01   Xox
#> 2   D_2416 2017-07-01   Xox
#> 3    D_446 2017-02-01   Xox
#> 4   D_3466 2017-07-01   Xox
#> 5   D_1183 2017-01-01   Xox
#> 6   D_3751 2015-12-01   Xox
#> 7     D_76 2017-03-01 sport
#> 8    D_441 2015-10-01   Xox
#> 9   D_1417 2017-02-01   Xox
#> 10  D_2886 2016-12-01   Xox
#> 11  D_1027 2016-03-01   Xox
#> 12  D_1955 2016-05-01   Xox
#> 13  D_1227 2017-04-01   Xox
#> 14   D_371 2016-11-01   Xox
#> 15   D_293 2017-10-01 sport
#> 16  D_2712 2016-12-01   Xox
#> 17  D_1122 2016-06-01   Xox
#> 18   D_839 2015-11-01   Xox
#> 19    D_83 2016-10-01   Xox
#> 20  D_3286 2016-03-01   Xox
#> 21  D_1216 2017-09-01   Xox
#> 22  D_3182 2017-07-01   Xox
#> 23   D_376 2017-09-01   Xox
#> 24   D_946 2016-11-01   Xox
#> 25  D_2585 2017-06-01   Xox
#> 26   D_162 2017-03-01   Xox
#> 27  D_2485 2017-05-01   Min
#> 28  D_1994 2017-05-01 sport
#> 29   D_543 2016-12-01   Xox
#> 30  D_3338 2016-07-01   Xox
#> 31    D_72 2016-10-01   Xox
#> 32   D_840 2016-04-01   Xox
#> 33   D_331 2016-01-01   Xox
#> 34  D_2644 2017-11-01   Xox
#> 35  D_2884 2016-12-01   Xox
#> 36  D_2626 2017-03-01   Xox
#> 37  D_2937 2017-09-01   Xox
#> 38  D_2818 2017-02-01   Xox
#> 39  D_3636 2017-10-01   Xox
#> 40  D_1103 2016-12-01   Xox
#> 41  D_1922 2015-11-01   Xox
#> 42  D_1088 2017-05-01   Xox
#> 43  D_3460 2016-09-01   Xox
#> 44  D_1465 2016-07-01   Xox
#> 45  D_1974 2016-06-01   Xox
#> 46  D_3525 2015-10-01   Xox
#> 47  D_2650 2017-04-01 sport
#> 48  D_2691 2017-02-01   Xox
#> 49  D_3616 2017-03-01   Xox
#> 50  D_1104 2016-10-01   Xox
#> 51  D_1533 2017-10-01   Xox
#> 52  D_3431 2016-06-01   Xox
#> 53  D_3458 2016-09-01   Xox
#> 54  D_1632 2016-10-01   Xox
#> 55   D_687 2017-01-01   Xox
#> 56  D_2560 2016-03-01   Xox
#> 57  D_1545 2016-03-01   Xox
#> 58  D_3073 2016-02-01   Xox
#> 59   D_468 2017-02-01   Xox
#> 60  D_2891 2015-12-01   Xox
#> 61  D_2479 2016-05-01   Xox
#> 62   D_254 2016-11-01   Xox
#> 63  D_2410 2017-10-01   Xox
#> 64  D_3633 2016-09-01   Xox
#> 65  D_2773 2017-05-01   Xox
#> 66   D_749 2016-09-01   Xox
#> 67  D_3259 2017-02-01   Xox
#> 68   D_143 2015-11-01   Xox
#> 69  D_3272 2016-01-01   Xox
#> 70   D_545 2016-05-01   Xox
#> 71  D_1684 2017-09-01   Xox
#> 72   D_341 2017-09-01   Xox
#> 73  D_3351 2016-12-01   Xox
#> 74  D_2127 2016-04-01   Xox
#> 75  D_1364 2016-03-01   Xox
#> 76  D_3500 2017-06-01   Xox
#> 77  D_2290 2016-12-01 sport
#> 78  D_2847 2015-11-01   Xox
#> 79   D_724 2017-09-01   Xox
#> 80  D_2111 2017-02-01   Xox
#> 81  D_2225 2017-06-01   Xox
#> 82   D_720 2016-01-01   Xox
#> 83  D_2709 2016-06-01   Xox
#> 84  D_1648 2015-12-01   Xox
#> 85  D_3251 2016-10-01   Xox
#> 86   D_184 2017-06-01   Xox
#> 87   D_961 2016-07-01   Xox
#> 88  D_2671 2016-08-01   Xox
#> 89  D_3217 2017-05-01   Xox
#> 90  D_1077 2016-04-01   Xox
#> 91  D_1290 2016-05-01 sport
#> 92  D_3637 2017-08-01   Xox
#> 93   D_436 2016-10-01   Xox
#> 94  D_3819 2017-08-01 sport
#> 95    D_25 2015-10-01   Xox
#> 96   D_739 2017-05-01   Min
#> 97  D_3697 2016-08-01   Xox
#> 98  D_1851 2016-11-01   Xox
#> 99    D_50 2016-07-01   Xox
#> 100 D_1286 2016-02-01   Xox

For complete data just in case months not enough to go back 12 months: https://www.dropbox.com/s/xslrswrys0hkl32/foo_complete.csv?dl=0


#2

It would be better is you included a reprex of your code instead of pasting your code into the message. As is the quotes in your example are hard to deal with.

https://cran.r-project.org/web/packages/reprex/index.html


#3

My suggestion is to extract year and month, arrange the dataframe by them, the use stats::filter to do a 12 montha rolling sum.


#4

Inserted sample data using reprex. Thanks @danr for the advise


#5

maybe this package will be able help you:https://business-science.github.io/tibbletime?


#6

I don’t think this is very efficient, but if I do this I will fill every day with NA first and calculate the rolling sum.

reprex::reprex_info()
#> Created by the reprex package v0.1.1.9000 on 2017-12-26

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(purrr)

foo <- readr::read_table(
"    ID       Date Group
D_2313 2017-07-01   Xox
D_2416 2017-07-01   Xox
 D_446 2017-02-01   Xox
D_3466 2017-07-01   Xox
D_1183 2017-01-01   Xox
D_3751 2015-12-01   Xox
  D_76 2017-03-01 sport
 D_441 2015-10-01   Xox
D_1417 2017-02-01   Xox
D_2886 2016-12-01   Xox
D_1027 2016-03-01   Xox
D_1955 2016-05-01   Xox
D_1227 2017-04-01   Xox
 D_371 2016-11-01   Xox
 D_293 2017-10-01 sport
D_2712 2016-12-01   Xox
D_1122 2016-06-01   Xox
 D_839 2015-11-01   Xox
  D_83 2016-10-01   Xox
D_3286 2016-03-01   Xox
D_1216 2017-09-01   Xox
D_3182 2017-07-01   Xox
 D_376 2017-09-01   Xox
 D_946 2016-11-01   Xox
D_2585 2017-06-01   Xox
 D_162 2017-03-01   Xox
D_2485 2017-05-01   Min
D_1994 2017-05-01 sport
 D_543 2016-12-01   Xox
D_3338 2016-07-01   Xox
  D_72 2016-10-01   Xox
 D_840 2016-04-01   Xox
 D_331 2016-01-01   Xox
D_2644 2017-11-01   Xox
D_2884 2016-12-01   Xox
D_2626 2017-03-01   Xox
D_2937 2017-09-01   Xox
D_2818 2017-02-01   Xox
D_3636 2017-10-01   Xox
D_1103 2016-12-01   Xox
D_1922 2015-11-01   Xox
D_1088 2017-05-01   Xox
D_3460 2016-09-01   Xox
D_1465 2016-07-01   Xox
D_1974 2016-06-01   Xox
D_3525 2015-10-01   Xox
D_2650 2017-04-01 sport
D_2691 2017-02-01   Xox
D_3616 2017-03-01   Xox
D_1104 2016-10-01   Xox
D_1533 2017-10-01   Xox
D_3431 2016-06-01   Xox
D_3458 2016-09-01   Xox
D_1632 2016-10-01   Xox
 D_687 2017-01-01   Xox
D_2560 2016-03-01   Xox
D_1545 2016-03-01   Xox
D_3073 2016-02-01   Xox
 D_468 2017-02-01   Xox
D_2891 2015-12-01   Xox
D_2479 2016-05-01   Xox
 D_254 2016-11-01   Xox
D_2410 2017-10-01   Xox
D_3633 2016-09-01   Xox
D_2773 2017-05-01   Xox
 D_749 2016-09-01   Xox
D_3259 2017-02-01   Xox
 D_143 2015-11-01   Xox
D_3272 2016-01-01   Xox
 D_545 2016-05-01   Xox
D_1684 2017-09-01   Xox
 D_341 2017-09-01   Xox
D_3351 2016-12-01   Xox
D_2127 2016-04-01   Xox
D_1364 2016-03-01   Xox
D_3500 2017-06-01   Xox
D_2290 2016-12-01 sport
D_2847 2015-11-01   Xox
 D_724 2017-09-01   Xox
D_2111 2017-02-01   Xox
D_2225 2017-06-01   Xox
 D_720 2016-01-01   Xox
D_2709 2016-06-01   Xox
D_1648 2015-12-01   Xox
D_3251 2016-10-01   Xox
 D_184 2017-06-01   Xox
 D_961 2016-07-01   Xox
D_2671 2016-08-01   Xox
D_3217 2017-05-01   Xox
D_1077 2016-04-01   Xox
D_1290 2016-05-01 sport
D_3637 2017-08-01   Xox
 D_436 2016-10-01   Xox
D_3819 2017-08-01 sport
  D_25 2015-10-01   Xox
 D_739 2017-05-01   Min
D_3697 2016-08-01   Xox
D_1851 2016-11-01   Xox
  D_50 2016-07-01   Xox
D_1286 2016-02-01   Xox"
)

nested_foo <- foo %>%
  nest(-Group)

calculate_one_group <- . %>%
  # calculate count per day
  count(Date) %>%
  # fill every day with NA
  complete(Date = seq(min(Date) - 365, max(Date), by = "day")) %>%
  # aggregate counts with a 365-day window
  mutate(rolling_n = as.integer(RcppRoll::roll_sumr(n, 365, na.rm = TRUE))) %>%
  # filter out NA days
  filter(!is.na(n))

nested_foo %>%
  mutate(data = map(data, calculate_one_group)) %>%
  unnest(data)
#> # A tibble: 34 x 4
#>    Group Date           n rolling_n
#>    <chr> <date>     <int>     <int>
#>  1 Xox   2015-10-01     3         3
#>  2 Xox   2015-11-01     4         7
#>  3 Xox   2015-12-01     3        10
#>  4 Xox   2016-01-01     3        13
#>  5 Xox   2016-02-01     2        15
#>  6 Xox   2016-03-01     5        20
#>  7 Xox   2016-04-01     3        23
#>  8 Xox   2016-05-01     3        26
#>  9 Xox   2016-06-01     4        30
#> 10 Xox   2016-07-01     4        34
#> # ... with 24 more rows

#7

Is this the output you’re looking for?:

# A tibble: 26 x 4
     Min sport   Xox     period
   <int> <int> <int>     <date>
 1     2     4    50 2017-07-01
 2    NA     2    50 2017-02-01
 3    NA     2    45 2017-01-01
 4    NA    NA    10 2015-12-01
 5    NA     3    53 2017-03-01
...

If so here’s the code:

library(dplyr)
library(purrr)
library(lubridate)
library(tidyr)



unique(foo$Date) %>% 
  map(~foo %>% 
        filter(Date >= (ymd(.x) - dyears(1)) & Date <= .x) %>% 
        group_by(ID, Group) %>% 
        count() %>% 
        group_by(Group) %>% 
        count() %>%
        spread(Group, nn) %>%
        mutate(period = .x)) %>%
  bind_rows()
  


foo <- readr::read_csv(
  "ID,Date,Group
  D_2313,2017-07-01,Xox
  D_2416,2017-07-01,Xox
  D_446,2017-02-01,Xox
  D_3466,2017-07-01,Xox
  D_1183,2017-01-01,Xox
  D_3751,2015-12-01,Xox
  D_76,2017-03-01,sport
  D_441,2015-10-01,Xox
  D_1417,2017-02-01,Xox
  D_2886,2016-12-01,Xox
  D_1027,2016-03-01,Xox
  D_1955,2016-05-01,Xox
  D_1227,2017-04-01,Xox
  D_371,2016-11-01,Xox
  D_293,2017-10-01,sport
  D_2712,2016-12-01,Xox
  D_1122,2016-06-01,Xox
  D_839,2015-11-01,Xox
  D_83,2016-10-01,Xox
  D_3286,2016-03-01,Xox
  D_1216,2017-09-01,Xox
  D_3182,2017-07-01,Xox
  D_376,2017-09-01,Xox
  D_946,2016-11-01,Xox
  D_2585,2017-06-01,Xox
  D_162,2017-03-01,Xox
  D_2485,2017-05-01,Min
  D_1994,2017-05-01,sport
  D_543,2016-12-01,Xox
  D_3338,2016-07-01,Xox
  D_72,2016-10-01,Xox
  D_840,2016-04-01,Xox
  D_331,2016-01-01,Xox
  D_2644,2017-11-01,Xox
  D_2884,2016-12-01,Xox
  D_2626,2017-03-01,Xox
  D_2937,2017-09-01,Xox
  D_2818,2017-02-01,Xox
  D_3636,2017-10-01,Xox
  D_1103,2016-12-01,Xox
  D_1922,2015-11-01,Xox
  D_1088,2017-05-01,Xox
  D_3460,2016-09-01,Xox
  D_1465,2016-07-01,Xox
  D_1974,2016-06-01,Xox
  D_3525,2015-10-01,Xox
  D_2650,2017-04-01,sport
  D_2691,2017-02-01,Xox
  D_3616,2017-03-01,Xox
  D_1104,2016-10-01,Xox
  D_1533,2017-10-01,Xox
  D_3431,2016-06-01,Xox
  D_3458,2016-09-01,Xox
  D_1632,2016-10-01,Xox
  D_687,2017-01-01,Xox
  D_2560,2016-03-01,Xox
  D_1545,2016-03-01,Xox
  D_3073,2016-02-01,Xox
  D_468,2017-02-01,Xox
  D_2891,2015-12-01,Xox
  D_2479,2016-05-01,Xox
  D_254,2016-11-01,Xox
  D_2410,2017-10-01,Xox
  D_3633,2016-09-01,Xox
  D_2773,2017-05-01,Xox
  D_749,2016-09-01,Xox
  D_3259,2017-02-01,Xox
  D_143,2015-11-01,Xox
  D_3272,2016-01-01,Xox
  D_545,2016-05-01,Xox
  D_1684,2017-09-01,Xox
  D_341,2017-09-01,Xox
  D_3351,2016-12-01,Xox
  D_2127,2016-04-01,Xox
  D_1364,2016-03-01,Xox
  D_3500,2017-06-01,Xox
  D_2290,2016-12-01,sport
  D_2847,2015-11-01,Xox
  D_724,2017-09-01,Xox
  D_2111,2017-02-01,Xox
  D_2225,2017-06-01,Xox
  D_720,2016-01-01,Xox
  D_2709,2016-06-01,Xox
  D_1648,2015-12-01,Xox
  D_3251,2016-10-01,Xox
  D_184,2017-06-01,Xox
  D_961,2016-07-01,Xox
  D_2671,2016-08-01,Xox
  D_3217,2017-05-01,Xox
  D_1077,2016-04-01,Xox
  D_1290,2016-05-01,sport
  D_3637,2017-08-01,Xox
  D_436,2016-10-01,Xox
  D_3819,2017-08-01,sport
  D_25,2015-10-01,Xox
  D_739,2017-05-01,Min
  D_3697,2016-08-01,Xox
  D_1851,2016-11-01,Xox
  D_50,2016-07-01,Xox
  D_1286,2016-02-01,Xox"
)

#8

Never mind … now I see what you are looking for…


#9

Yes! Thank you so much @edgararuiz . This solves my problem - this really speeds up my work productivity.

One quick question - could you please explain how .x works. E.g. filter(Date >= (ymd(.x) - dyears(1)) & Date <= .x)


#10

.x represents each iteration of unique(foo$Date). Essentially, @edgararuiz’s code is taking each unique datetime and iterating over it. so the .x represents the current month that is being used at any given iteration.

Additionally, you could use map_df rather than map and exclude the bind_rows call since map_df by default combines the output of map into a dataframe, i.e. map_df takes care of the map() %>% bind_rows() for you.


#11

Hi @Rscotty, I’m glad that worked for you!

A small addition to @tbradley explanation, what has helped me with using map is to picture .x as the variable you would use in in a for loop. So, instead of having to always come up with a name, map has fixed that to be .x

for(.x in letters){
  print(.x)
}

@tbradley - Thanks for the tip on map_df!


#12

@tbradley @edgararuiz Thank you all.