How do i use the ranking function Row_number with multiple conditions in R

srisivanicharan · December 26, 2018, 10:48pm

Here is my dummy dataset.

 ID        Order       Case         Date_created   
123456   25800265        1     2018-06-27 07:40:23 
123456   25800265        1     2018-06-25 05:29:23
123456   25800265        0     2018-07-26 06:16:28
789454   25906588        1     2018-07-12 05:59:50
789454   25906588        0     2018-07-12 07:41:29
789454   25906588        0     2018-07-10 05:43:45
789454   25906588        0     2018-07-09 05:59:26
789454   25906588        0     2018-07-05 10:39:45
287541   32140567        0     2018-07-12 07:41:29
287541   32140567        0     2018-07-10 05:43:45
287541   32140567        0     2018-07-09 05:59:26
287541   32140567        0     2018-07-05 10:39:45

I need only one record for each order based on the following conditions.
Return the record where Case=1 when an Order contains both 0 and 1 in 'Case'. If multiple records are present where Case=1 then get the old Date_created record.
If the Order has only Case=0, then return the record with oldest Date_created date.

i.e.

 ID        Order       Case         Date_created        
123456   25800265        1     2018-06-25 05:29:23
789454   25906588        1     2018-07-12 05:59:50
287541   32140567        0     2018-07-05 10:39:45

In Redshift I could accomplish this using the following code.

select * from 
(
select *,
           ROW_NUMBER()over(partition by Order order by Case asc,Date_created desc ) as latest_time
 from tbl 
)where latest_time=1

How do i accomplish this in R?

rexevan · December 27, 2018, 8:16am

I think there is a way of doing this using tidyr and purrr to accomplish this. Something alongside of loop and double loop. Maybe just my hunch. This is the closest I got.

If I understand OP's post, then the rules as follows.
For each ID :

If Case has all 1 or 0, then take the oldest Date_created
If else, then take the oldest Date_created when the case is 1

First, let's import our data.

library(tidyverse)
library(magrittr)
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
#> The following object is masked from 'package:tidyr':
#> 
#>     extract

foo_tbl <- tibble::tribble( 
  ~ID,        ~Order,    ~Case,           ~Date_created,   
  123456,   25800265,        1,     "2018-06-27 07:40:23",
  123456,   25800265,        1,     "2018-06-25 05:29:23",
  123456,   25800265,        0,     "2018-07-26 06:16:28",
  789454,   25906588,        1,     "2018-07-12 05:59:50",
  789454,   25906588,        0,     "2018-07-12 07:41:29",
  789454,   25906588,        0,     "2018-07-10 05:43:45",
  789454,   25906588,        0,     "2018-07-09 05:59:26",
  789454,   25906588,        0,     "2018-07-05 10:39:45",
  287541,   32140567,        0,     "2018-07-12 07:41:29",
  287541,   32140567,        0,     "2018-07-10 05:43:45",
  287541,   32140567,        0,     "2018-07-09 05:59:26",
  287541,   32140567,        0,     "2018-07-05 10:39:45"
)

To check if the case contains all 1 or 0, I will need to know how much of ID (ID_n) and the sum of case (case_sum). If case_sum matches ID_n then It contains all 1, if 0 then all contains 0.

foo_tbl <- foo_tbl %>% 
  arrange(ID, Date_created)

by_ID <- foo_tbl %>% 
  group_by(ID) %>%
  mutate(
    ID_n = n(),
    case_sum= sum(Case)
  )

by_ID
#> # A tibble: 12 x 6
#> # Groups:   ID [3]
#>        ID    Order  Case Date_created         ID_n case_sum
#>     <dbl>    <dbl> <dbl> <chr>               <int>    <dbl>
#>  1 123456 25800265     1 2018-06-25 05:29:23     3        2
#>  2 123456 25800265     1 2018-06-27 07:40:23     3        2
#>  3 123456 25800265     0 2018-07-26 06:16:28     3        2
#>  4 287541 32140567     0 2018-07-05 10:39:45     4        0
#>  5 287541 32140567     0 2018-07-09 05:59:26     4        0
#>  6 287541 32140567     0 2018-07-10 05:43:45     4        0
#>  7 287541 32140567     0 2018-07-12 07:41:29     4        0
#>  8 789454 25906588     0 2018-07-05 10:39:45     5        1
#>  9 789454 25906588     0 2018-07-09 05:59:26     5        1
#> 10 789454 25906588     0 2018-07-10 05:43:45     5        1
#> 11 789454 25906588     1 2018-07-12 05:59:50     5        1
#> 12 789454 25906588     0 2018-07-12 07:41:29     5        1

Then I told R to return ID based on that condition, and then filter the table according to each condition.

# Return ID when case_sum is equal to ID_n (all 1s) OR case_sum equals to 0 (all 0s)
ID_true <- by_ID %$% ID[(case_sum == ID_n | case_sum == 0)] %>% unique()
ID_true
#> [1] 287541

# Return ID for everything else above (has both 1 and 0 in 'case')
ID_false <- by_ID %$% ID[(case_sum != ID_n & case_sum != 0)] %>% unique()
ID_false
#> [1] 123456 789454

# for ID that has all 1 or 0 in 'Case'
ID_all_1_or_0 <- by_ID %>%
  filter(ID %>% is_in(ID_true)) %>% 
  filter(Date_created == min(Date_created)) # Take the oldest date

# for ID that has both 1 and 0 in 'Case'
ID_both_1_and_0 <- by_ID %>%
  filter(ID %>% is_in(ID_false)) %>% 
  filter(Case == 1) %>%  # show only when the case is 1
  filter(Date_created == min(Date_created)) # take the oldest date

Lastly, union them back.

ID_all_1_or_0 %>% 
  union(ID_both_1_and_0) %>% 
  select(-ID_n, -case_sum) %>% 
  arrange(ID)
#> # A tibble: 3 x 4
#> # Groups:   ID [3]
#>       ID    Order  Case Date_created       
#>    <dbl>    <dbl> <dbl> <chr>              
#> 1 123456 25800265     1 2018-06-25 05:29:23
#> 2 287541 32140567     0 2018-07-05 10:39:45
#> 3 789454 25906588     1 2018-07-12 05:59:50

^{Created on 2018-12-27 by the reprex package (v0.2.1)}

technocrat · December 27, 2018, 9:51am

I agree with @rexevan 's approach, with one additional caveat. Unless your data is as small as the reprex, to import the source you will need unique rownames, which your example lacks.

joels · December 27, 2018, 9:54am

Here's an approach that applies the given selection rules (assuming I'm understanding them properly) in a single dplyr pipe. I've used rexevan's foo_tbl as the starting data frame.

library(tidyverse)
library(lubridate)

foo_tbl %>% 
  mutate(Date_created = ymd_hms(Date_created)) %>% 
  arrange(Case, Date_created) %>% 
  # Keep the earliest Date_created for each ID-Order-Case combination
  group_by(ID, Order, Case) %>% 
  slice(1) %>% 
  # Keep the highest Case number for each Order
  arrange(desc(Case)) %>% 
  group_by(ID, Order) %>% 
  slice(1)

#> # A tibble: 3 x 4
#> # Groups:   ID, Order [3]
#>       ID    Order  Case Date_created       
#>    <dbl>    <dbl> <dbl> <dttm>             
#> 1 123456 25800265     1 2018-06-25 05:29:23
#> 2 287541 32140567     0 2018-07-05 10:39:45
#> 3 789454 25906588     1 2018-07-12 05:59:50

rexevan · December 27, 2018, 12:13pm

Here is an easier way than my previous one. The idea is a double filter() not a double loop as I post before. I will also use foo_tbl from previous post.

The idea :

Filter by the highest case.
Filter again by the earliest Date_created.

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date

foo_tbl %>%
  mutate(Date_created = ymd_hms(Date_created)) %>%
  group_by(ID) %>%
  filter(Case == max(Case)) %>%
  filter(Date_created == min(Date_created))
#> # A tibble: 3 x 4
#> # Groups:   ID [3]
#>       ID    Order  Case Date_created       
#>    <dbl>    <dbl> <dbl> <dttm>             
#> 1 123456 25800265     1 2018-06-25 05:29:23
#> 2 789454 25906588     1 2018-07-12 05:59:50
#> 3 287541 32140567     0 2018-07-05 10:39:45

^{Created on 2018-12-27 by the reprex package (v0.2.1)}

srisivanicharan · December 27, 2018, 6:01pm

@rexevan, @joels, @technocrat Thank you for your inputs and approaches. I found another way which is a bit easy.

df %>% 
  group_by(Order) %>% 
  arrange(desc(Case), Date_created) %>% 
  mutate(row = row_number()) %>% 
  ungroup() %>% 
  filter(row == 1) %>% 
  select(-row) %>% 
  arrange(Order)

You can go through this link

mishabalyasin · December 27, 2018, 6:04pm

It's always a good idea to make it explicit, that you are cross-posting your question to multiple places so that people don't end up wasting their time on question that has been already answered. There is more about this forum policy on cross-posting:

system · January 3, 2019, 6:04pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.