Quick date range recoding

Hi,
I have to recode dates into specific ranges (for example financial year). I've done my coding but running that on my large data file id the most time consuming R process since I started playing with R. Is any easier/quicker way of running my code?

data.source <- data.frame(stringsAsFactors=FALSE,
                 ReferenceNumber = c("08d4-ea84-4cca", "08d4-ea84-4cca", "08d4-ea84-4cca",
                                     "08d4-ea84-4cca", "137f-dd7f-4120",
                                     "3302-f362-4244", "3302-f362-4244",
                                     "3302-f362-4244", "69f4-af0b-4640", "69f4-af0b-4640",
                                     "69f4-af0b-4640", "ac31-5bab-4798",
                                     "ac31-5bab-4798", "ac31-5bab-4798", "ac31-5bab-4798",
                                     "ac31-5bab-4798", "ac31-5bab-4798",
                                     "ac31-5bab-4798", "ac31-5bab-4798", "e0e5-7e8b-497c",
                                     "e0e5-7e8b-497c", "e0e5-7e8b-497c",
                                     "e0e5-7e8b-497c"),
                          RoDate = c("2014-06-10", "2013-03-15", "2013-03-05", "2012-12-17",
                                     "2019-06-26", "2017-10-20", "2018-10-24",
                                     "2017-01-27", "2015-10-07", "2015-06-03",
                                     "2014-02-27", "2016-11-24", "2015-11-24",
                                     "2015-01-09", "2014-11-28", "2013-11-26",
                                     "2012-11-27", "2011-11-29", "2011-11-24",
                                     "2014-06-05", "2014-06-05", "2014-03-21",
                                     "2014-01-15")
              )

library(dplyr)
date.range <- data.source %>% mutate(
  ReportingYear = case_when(
    RoDate >= "2011-10-05" & RoDate < "2012-10-05" ~ 1,
    RoDate >= "2012-10-05" & RoDate < "2013-10-05" ~ 2,
    RoDate >= "2013-10-05" & RoDate < "2014-10-05" ~ 3,
    RoDate >= "2014-10-05" & RoDate < "2015-10-05" ~ 4,
    RoDate >= "2015-10-05" & RoDate < "2016-10-05" ~ 5,
    RoDate >= "2016-10-05" & RoDate < "2017-10-05" ~ 6,
    RoDate >= "2017-10-05" & RoDate < "2018-10-05" ~ 7,
    RoDate >= "2018-10-05" & RoDate < "2019-10-05" ~ 8))

Thank you in advance

cut is a great function for this kind of thing. It categorizes ranges of a continuous variable, which included dates, to create a factor.

A short function for handling the specifics goes well in a dplyr pipeline.

library(lubridate)

divide_year <- function(dates) {
  years <- unique(year(dates))
  breaks <- make_date(years, month = 10, day = 5)
  breaks <- sort(breaks)
  cut(dates, breaks = breaks, right = FALSE, include.lowest = TRUE)
}

data.range <- data.source %>%
  mutate(RoDate = as.Date(RoDate)) %>%
  mutate(ReportingYear = divide_year(RoDate))

head(data.range)
#   ReferenceNumber     RoDate ReportingYear
# 1  08d4-ea84-4cca 2014-06-10    2013-10-05
# 2  08d4-ea84-4cca 2013-03-15    2012-10-05
# 3  08d4-ea84-4cca 2013-03-05    2012-10-05
# 4  08d4-ea84-4cca 2012-12-17    2012-10-05
# 5  137f-dd7f-4120 2019-06-26    2018-10-05
# 6  3302-f362-4244 2017-10-20    2017-10-05

You can use the labels argument of cut to change the level names. If you really need the results as integers, just add an extra line in the pipeline:

mutate(ReportingYear = as.integer(ReportingYear))

Edited: fixed incorrect variable names

1 Like

Thank you. Looks really clever but I have following errors:

 Error in make_date(years_included, month = 10, day = 5) : 
  object 'years_included' not found 

 Error in vapply(list(year, month, day), length, 1, USE.NAMES = FALSE) : 
  object 'years_included' not found 

Is this faster?

library(dplyr)

data.source <- data.frame(stringsAsFactors=FALSE,
                          ReferenceNumber = c("08d4-ea84-4cca", "08d4-ea84-4cca", "08d4-ea84-4cca",
                                              "08d4-ea84-4cca", "137f-dd7f-4120",
                                              "3302-f362-4244", "3302-f362-4244",
                                              "3302-f362-4244", "69f4-af0b-4640", "69f4-af0b-4640",
                                              "69f4-af0b-4640", "ac31-5bab-4798",
                                              "ac31-5bab-4798", "ac31-5bab-4798", "ac31-5bab-4798",
                                              "ac31-5bab-4798", "ac31-5bab-4798",
                                              "ac31-5bab-4798", "ac31-5bab-4798", "e0e5-7e8b-497c",
                                              "e0e5-7e8b-497c", "e0e5-7e8b-497c",
                                              "e0e5-7e8b-497c"),
                          RoDate = c("2014-06-10", "2013-03-15", "2013-03-05", "2012-12-17",
                                     "2019-06-26", "2017-10-20", "2018-10-24",
                                     "2017-01-27", "2015-10-07", "2015-06-03",
                                     "2014-02-27", "2016-11-24", "2015-11-24",
                                     "2015-01-09", "2014-11-28", "2013-11-26",
                                     "2012-11-27", "2011-11-29", "2011-11-24",
                                     "2014-06-05", "2014-06-05", "2014-03-21",
                                     "2014-01-15")
)


breaks <- seq.Date(as.Date("2011-10-05"), as.Date("2019-10-05"), by = "years")

data.source %>% 
    mutate(ReportingYear = cut(as.Date(RoDate), breaks, labels = 1:8))
#>    ReferenceNumber     RoDate ReportingYear
#> 1   08d4-ea84-4cca 2014-06-10             3
#> 2   08d4-ea84-4cca 2013-03-15             2
#> 3   08d4-ea84-4cca 2013-03-05             2
#> 4   08d4-ea84-4cca 2012-12-17             2
#> 5   137f-dd7f-4120 2019-06-26             8
#> 6   3302-f362-4244 2017-10-20             7
#> 7   3302-f362-4244 2018-10-24             8
#> 8   3302-f362-4244 2017-01-27             6
#> 9   69f4-af0b-4640 2015-10-07             5
#> 10  69f4-af0b-4640 2015-06-03             4
#> 11  69f4-af0b-4640 2014-02-27             3
#> 12  ac31-5bab-4798 2016-11-24             6
#> 13  ac31-5bab-4798 2015-11-24             5
#> 14  ac31-5bab-4798 2015-01-09             4
#> 15  ac31-5bab-4798 2014-11-28             4
#> 16  ac31-5bab-4798 2013-11-26             3
#> 17  ac31-5bab-4798 2012-11-27             2
#> 18  ac31-5bab-4798 2011-11-29             1
#> 19  ac31-5bab-4798 2011-11-24             1
#> 20  e0e5-7e8b-497c 2014-06-05             3
#> 21  e0e5-7e8b-497c 2014-06-05             3
#> 22  e0e5-7e8b-497c 2014-03-21             3
#> 23  e0e5-7e8b-497c 2014-01-15             3
1 Like

Much faster!!!
Thank you :smiley:

1 Like

Whoops! I forgot to change all instances of the years name. I've edited my answer, even though andresrcs posted a working version.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.