Conditional difference calculation between dates in R using the dplyr function

dplyr

#1

Following is a sample of the data set that I am currently working on:
C_ID M_ID max_ad_date_id max_DIS_DATE SER_C_DATE_ID CL_CCT_IND
102472781 3874931 24/05/2015 28/05/2015 24MAY2015:00:00:00 N
102472781 3874931 24/05/2015 28/05/2015 24MAY2015:00:00:00 N
102472781 3874931 24/05/2015 28/05/2015 25MAY2015:00:00:00 N
102472781 3874931 24/05/2015 28/05/2015 25MAY2015:00:00:00 N
102472781 3874931 24/05/2015 28/05/2015 25MAY2015:00:00:00 N
103011920 3998668 28/01/2015 28/01/2015 28JAN2015:00:00:00 Y

I need to create a new dataset where I will have distinct of the first 4 columns & calculate a new variable LOS which will be a count of distinct SER_C_DATE_ID where CL_CCT_IND=‘Y’ else LOS is a difference between the max_ad_date_id and max_DIS_DATE.

I am new to the R environment & tried using the following code to do the same:
AllData$max_ad_date_id<- as.Date(AllData$max_ad_date_id,"%d%b%Y")
AllData$max_DIS_DATE<- as.Date(AllData$max_DIS_DATE,"%d%b%Y")

DummyData <-AllData %>%
group_by(C_ID, M_ID,
IND) %>%
summarise(max_ad_date=max(max_ad_date_id),
max_DIS=max(max_DIS_DATE),LOS=n_distinct(SER_C_DATE[CL_CCT_IND==“Y”] ,
LOS=(max_ad_date
-max_ad_date)[CL_CCT_IND==“N”])

The code works fine until the difference calculation.
Error in summarise_impl(.data, dots) :
Column LOS must be length 1 (a summary value), not 5

I clearly understand that this is a coding error & the difference calculation should probably not be a part of the summarize function. Also tried using mutate function but couldn’t use it correctly & tried ifelse as well:

LOS = ifelse(CL_CCT_IND == ‘Y’, length(unique(SER_C_ID)), max_ad_date_id - max_DIS_DATE))

Can I get some help around this.Thanks in advance


#2

Hi,

Would you be able to make this into a minimal reproducible example (aka a reprex)?

In addition to the reprex section of the tidyverse site (linked to above), there’s a quick, helpful overview of the package and how to use it (Jenny starts ~10:40) in the video below.

You can also see the slides from that video here:


#3

I can’t quite figure out what you are wanting to do, but for those that are trying to help here is a dataset you can use.

library(dplyr)
library(tibble)

AllData <- tribble(
  ~C_ID, ~M_ID, ~max_ad_date_id, ~max_DIS_DATE, ~SER_C_DATE_ID, ~CL_CCT_IND,
  102472781, 3874931, "24/05/2015", "28/05/2015", "24MAY2015:00:00:00", "N",
  102472781, 3874931, "24/05/2015", "28/05/2015", "24MAY2015:00:00:00", "N",
  102472781, 3874931, "24/05/2015", "28/05/2015", "25MAY2015:00:00:00", "N",
  102472781, 3874931, "24/05/2015", "28/05/2015", "25MAY2015:00:00:00", "N",
  102472781, 3874931, "24/05/2015", "28/05/2015", "25MAY2015:00:00:00", "N",
  103011920, 3998668, "28/01/2015", "28/01/2015", "28JAN2015:00:00:00", "Y"
)

AllData <- AllData %>%
  mutate(max_ad_date_id = as.Date(max_ad_date_id, format = "%d/%m/%Y"),
         max_DIS_DATE   = as.Date(max_DIS_DATE, format = "%d/%m/%Y"))

#4

My thoughts:

I would echo Mara’s suggestion of a reprex, as at the moment you seem to be group_bying variables that are not in the data you showed, so it is hard to follow along.

As I read it you are wanting a summary column which is either a count value or a difference in time. This bothers me at a philosophical level as you seem to be mixing kind of information.

In general thinking through a problem in R, it is easier to keep different things in different categories- so I might want the count and the difference, but I only care about the count or difference for some of my entries.

Also, looking at your description and your code and you are using group_by, which means “for each group that is a unique combination of grouping variable values” rather than distinct to mean “remove duplicates”.

This is getting you into trouble at the summarise step, where you have several members within a group at the summarise step, and in calculating the time difference you are not specifying which of the potential many answers is the single summary value (or how to combine the many answers into a single summary value). Hence the error message.


#5

Is this sort of what you’re looking for?

AllData %>% 
  group_by(SER_C_DATE_ID) %>% 
  mutate(SER_count = n()) %>% 
  ungroup() %>% 
  mutate(difference = max_DIS_DATE - max_ad_date_id)

Which returns:

# A tibble: 6 x 8
# Groups:   SER_C_DATE_ID [3]
       C_ID    M_ID max_ad_date_id max_DIS_DATE      SER_C_DATE_ID CL_CCT_IND SER_count difference
      <dbl>   <dbl>         <date>       <date>              <chr>      <chr>     <int>     <time>
1 102472781 3874931     2015-05-24   2015-05-28 24MAY2015:00:00:00          N         2     4 days
2 102472781 3874931     2015-05-24   2015-05-28 24MAY2015:00:00:00          N         2     4 days
3 102472781 3874931     2015-05-24   2015-05-28 25MAY2015:00:00:00          N         3     4 days
4 102472781 3874931     2015-05-24   2015-05-28 25MAY2015:00:00:00          N         3     4 days
5 102472781 3874931     2015-05-24   2015-05-28 25MAY2015:00:00:00          N         3     4 days
6 103011920 3998668     2015-01-28   2015-01-28 28JAN2015:00:00:00          Y         1     0 days

I’m using the reproducible example provided by @davis. Also, I agree with @thoughtfulnz that it might not be a good idea to have counts and time differences in a single column, this isn’t a tidy approach. Also, if this isn’t what you want, provide an example of your desired output. Reproducible examples and desired outputs make it much more likely that your question will get answered.


#6

Hi ,
Thanks for sharing this. The only reason I need the calculation in the same column ids because that will represent the duration given whether the CL_CCT_IND is Y/N.