I have data which looks like
library(dplyr)
library(lubridate)
Date_Construct= c("10/03/2018 00:00", "10/03/2018 00:00","01/01/2016 00:00","21/03/2015 01:25", "21/03/2015 01:25", "17/04/2016 00:00","17/04/2016 00:00", "20/02/2012 00:00","20/02/2020 00:00")
Date_first_use = c("02/08/2018 00:00","02/08/2018 00:00", "01/04/2016 00:00","NA", "NA", "NA", "NA","13/08/2012 00:00","20/04/2020 00:00")
Date_fail = c("02/08/2019 00:00","02/08/2019 00:00", "21/06/2018 06:42","NA" , "NA" , "17/04/2016 00:00", "17/04/2016 00:00","13/08/2014 07:45","NA")
P_ID = c("0001", "0001" ,"0001" ,"0001", "0001","34000","34000","34000", "00425")
Comp_date= c("16/05/2019 00:00", "10/04/2018 12:55","25/06/2017 00:00","22/04/2015 00:00","08/05/2015 00:00" ,"04/05/2017 00:00" ,"15/07/2016 00:00","01/03/2014 00:00", "20/03/2020 00:00")
Type = c("a","a","b","c","c","b","b","a","c")
Date_Construct= dmy_hm(Date_Construct)
dfq= data.frame(`P_ID`, `Type`, `Date_Construct`, `Date_first_use`,`Date_fail`, `Comp_date`)%>%
arrange(P_ID, desc(Date_Construct))%>%
group_by( P_ID, Date_Construct, Type)%>%
mutate(A_ID= cur_group_id())%>%
select(P_ID,A_ID,Type, Date_Construct, Date_first_use, Date_fail, Comp_date)%>%
mutate(across(contains("Date", ignore.case = TRUE), dmy_hm))
View(dfq)
It is a data frame of different items (A_ID
) of type a/b/c, created for different clients (P_ID
), with date of construction, date of first use and date of failure. Each P_ID may have multiple A_ID
, and each A_ID
may have multiple Comp_date
.
I need to supply a date for where Date_fail
is NA
, which is the Date_construct
of the next constructed A_ID
for the same P_ID
.
i.e. Date_fail
for P_ID
0001, A_ID
1 should be 2016-01-01 00:00:00
.
For A_ID
which there are no subsequent A_ID
(as is the case for P_ID
00425, A_ID
4), the Date_fail
should remain NA
.
So result should look like:
P_ID A_ID Type Date_Construct Date_first_use Date_fail Comp_date
1 0001 1 c 2015-03-21 01:25:00 NA 2016-01-01 00:00:00 2015-04-22 00:00:00
2 0001 1 c 2015-03-21 01:25:00 NA 2016-01-01 00:00:00 2015-05-08 00:00:00
3 0001 2 b 2016-01-01 00:00:00 2016-04-01 2018-06-21 06:42:00 2017-06-25 00:00:00
4 0001 3 a 2018-03-10 00:00:00 2018-08-02 2019-08-02 00:00:00 2019-05-16 00:00:00
5 0001 3 a 2018-03-10 00:00:00 2018-08-02 2019-08-02 00:00:00 2018-04-10 12:55:00
6 00425 4 c 2020-02-20 00:00:00 2020-04-20 NA 2020-03-20 00:00:00
7 34000 5 a 2012-02-20 00:00:00 2012-08-13 2014-08-13 07:45:00 2014-03-01 00:00:00
8 34000 6 b 2016-04-17 00:00:00 NA 2016-04-17 00:00:00 2017-05-04 00:00:00
9 34000 6 b 2016-04-17 00:00:00 NA 2016-04-17 00:00:00 2016-07-15 00:00:00
I tried this, which I thought worked, but it is just given me the Date_Construct
of the next row in the group, which isn't correct as some A_ID
have multiple entries:
arrange(P_ID, Date_Construct)%>%
group_by(P_ID,) %>%
mutate(Date_fail2 = sort(Date_Construct, decreasing = FALSE)[row_number(Date_Construct) + 1])%>%
mutate(Date_fail = if_else( is.na(Date_fail), paste(Date_fail2), paste(Date_fail)))
I'm ideally looking for a dplyr/tidyverse solution as I find them easier to understand and reproduce.