Sentence of duplicates?

Slavek · October 8, 2019, 10:04am

Hi,
I've gone through documentation related to deduplication but I have slightly different challenge. I have this data frame:

data.frame(
                              VitalCode = c(223L, 221L, 221L, 221L, 2415L),
                               Accountnum = as.factor(c("A", "A", "A", "A", "B")),
                               RoDate = as.factor(c("17/12/2012 00:00", "05/03/2013 00:00",
                                                      "15/03/2013 00:00",
                                                      "10/06/2014 00:00",
                                                      "26/06/2019 00:00")),
                                                       MOTFlag = as.factor(c("NO", "NO", "NO", "YES", "NO"))
                     )

I don't really need to use unique records from " Accountnum". I would like to assign their order based on "RoDate" so New variable should have following values:

Sequence <- c(1, 2, 3, 4, 1)

Can we also have it based only on year rather than a full date? In this case the result I need should be:

SequenceYear <- (1, 2, 2, 3, 1)

Is it simple? Most of issues R users have are related to deduplications only...

phiggins · October 8, 2019, 10:36am

Hi Slavek,

I think that this approach with lubridate and mutate will give you what you are looking for - a newvar that is a factor based on year

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
df <- data.frame(
  VitalCode = c(223L, 221L, 221L, 221L, 2415L),
  Accountnum = as.factor(c("A", "A", "A", "A", "B")),
  RoDate = dmy_hm(c("17/12/2012 00:00", "05/03/2013 00:00",
                       "15/03/2013 00:00",
                       "10/06/2014 00:00",
                       "26/06/2019 00:00")),
  MOTFlag =c("NO", "NO", "NO", "YES", "NO"))

df %>% 
  mutate(newvar = factor(year(RoDate))) ->
df

glimpse(df)
#> Observations: 5
#> Variables: 5
#> $ VitalCode  <int> 223, 221, 221, 221, 2415
#> $ Accountnum <fct> A, A, A, A, B
#> $ RoDate     <dttm> 2012-12-17, 2013-03-05, 2013-03-15, 2014-06-10, 2019…
#> $ MOTFlag    <fct> NO, NO, NO, YES, NO
#> $ newvar     <fct> 2012, 2013, 2013, 2014, 2019

^{Created on 2019-10-08 by the reprex package (v0.3.0)}

Slavek · October 8, 2019, 10:48am

That is great. What about a simple sentence based on full date (1, 2, 3, 4, 1)?

Slavek · October 8, 2019, 11:59am

Also, how can I have a sentence for your solution so (1, 2, 3, 4, 1) rather than (2012, 2013, 2013, 2014, 2019)?

phiggins · October 8, 2019, 12:26pm

Hi Slavek - two answers in one

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date

df <- data.frame(
  VitalCode = c(223L, 221L, 221L, 221L, 2415L),
  Accountnum = as.factor(c("A", "A", "A", "A", "B")),
  RoDate = dmy_hm(c("17/12/2012 00:00", "05/03/2013 00:00",
                    "15/03/2013 00:00",
                    "10/06/2014 00:00",
                    "26/06/2019 00:00")),
  MOTFlag =c("NO", "NO", "NO", "YES", "NO"))

# for ranks by full date 
df %>% 
  mutate(newvar = factor(rank(RoDate))) ->
  df

glimpse(df)
#> Observations: 5
#> Variables: 5
#> $ VitalCode  <int> 223, 221, 221, 221, 2415
#> $ Accountnum <fct> A, A, A, A, B
#> $ RoDate     <dttm> 2012-12-17, 2013-03-05, 2013-03-15, 2014-06-10, 20...
#> $ MOTFlag    <fct> NO, NO, NO, YES, NO
#> $ newvar     <fct> 1, 2, 3, 4, 5

# for integer factors by year 
df %>% 
  mutate(newvar = as.integer(factor(year(RoDate)))) ->
  df

glimpse(df)
#> Observations: 5
#> Variables: 5
#> $ VitalCode  <int> 223, 221, 221, 221, 2415
#> $ Accountnum <fct> A, A, A, A, B
#> $ RoDate     <dttm> 2012-12-17, 2013-03-05, 2013-03-15, 2014-06-10, 20...
#> $ MOTFlag    <fct> NO, NO, NO, YES, NO
#> $ newvar     <int> 1, 2, 2, 3, 4

Slavek · October 8, 2019, 1:42pm

Thank you but they are not "Accountnum" related. As a result I have sequence of all record in the df…
I should get:

#> $ newvar     <fct> 1, 2, 3, 4, 1

andresrcs · October 8, 2019, 5:10pm

Since @phiggins already provided a proper reprex (what you should have done from the beginning), now is much easier to help you, this produces the result you are describing.

library(tidyverse)
library(lubridate)

df <- data.frame(
    VitalCode = c(223L, 221L, 221L, 221L, 2415L),
    Accountnum = c("A", "A", "A", "A", "B"),
    RoDate = dmy_hm(c("17/12/2012 00:00", "05/03/2013 00:00",
                      "15/03/2013 00:00",
                      "10/06/2014 00:00",
                      "26/06/2019 00:00")),
    MOTFlag =c("NO", "NO", "NO", "YES", "NO"))

df %>% 
    group_by(Accountnum) %>% 
    arrange(RoDate) %>% 
    mutate(Sequence = row_number(),
           SequenceYear = as.integer(factor(year(RoDate)))) %>% 
    ungroup()
#> # A tibble: 5 x 6
#>   VitalCode Accountnum RoDate              MOTFlag Sequence SequenceYear
#>       <int> <fct>      <dttm>              <fct>      <int>        <int>
#> 1       223 A          2012-12-17 00:00:00 NO             1            1
#> 2       221 A          2013-03-05 00:00:00 NO             2            2
#> 3       221 A          2013-03-15 00:00:00 NO             3            2
#> 4       221 A          2014-06-10 00:00:00 YES            4            3
#> 5      2415 B          2019-06-26 00:00:00 NO             1            1

Slavek · October 9, 2019, 12:04pm

Hi, it looks like a working solution.
Unfortunately it doesn't work with my longer files.

I know I should sort out my reprex error and submit everything in the proper reprex but can I use datapasta last time (I promise) to submit another short file your solution is not working?
(I promised to use proper reprex for all new posts)

the file is here:

data.frame(
   Accountnum = as.factor(c("SPO60", "SPO60", "SPO60", "SPO60", "LEV21",
                            "STO60", "STO60", "STO60", "HIL20", "HIL20",
                            "HIL20", "JJM55", "JJM55", "JJM55", "JJM55", "JJM55",
                            "JJM55", "JJM55", "JJM55")),
       RoDate = as.factor(c("17/12/2012 00:00", "05/03/2013 00:00",
                            "15/03/2013 00:00", "10/06/2014 00:00",
                            "26/06/2019 00:00", "27/01/2017 00:00", "20/10/2017 00:00",
                            "24/10/2018 00:00", "27/02/2014 00:00",
                            "03/06/2015 00:00", "07/10/2015 00:00", "24/11/2011 00:00",
                            "29/11/2011 00:00", "27/11/2012 00:00", "26/11/2013 00:00",
                            "28/11/2014 00:00", "09/01/2015 00:00",
                            "24/11/2015 00:00", "24/11/2016 00:00"))
)

and, sequence for each Accountnum is random.
I don't really know why. I have checked a few example files...

andresrcs · October 9, 2019, 12:09pm

It is because RoDate is a factor, you simply have to convert it into a date variable.

Slavek · October 10, 2019, 9:43am

This solution is absolutely brilliant and unique I think because now we could do further analysis of duplicates (for example first response, second etc.)
Well done!

I am wondering if reverse order is also possible (so in the initial example Sequence would be (4, 3, 2, 1, 1) and SequenceYear would be (3, 2, 2, 1, 1)

andresrcs · October 10, 2019, 12:01pm

Yes it is, and very simple in fact, give it a try, you already have said the key word, "order".

Slavek · October 10, 2019, 1:11pm

Hurray!!!

df %>% 
    group_by(Accountnum) %>% 
    arrange(desc(RoDate)) %>% 
    mutate(Sequence = row_number(),
           SequenceYear = as.integer(factor(year(RoDate)))) %>% 
    ungroup()

Thank you

system · October 17, 2019, 1:11pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.