Remove 3rd iteration of duplicate

Hi all,
I want R to go down my data line by line and if it finds the third (fourth, fifth, etc) line that is identical to the previous two lines I want it to delete that line. I think the it makes sense to start at the top until it finds a third instance, and then go back from the top again over and over, until it goes through all my data without finding an instance.
I don't want to just dedupe, because if a pattern is AABAAA, I want to only remove the last A (the third in a row), not all 3 of the last As.
I have put example data down below with the word Delete behind the row when I need R to delete it. I don't want Date of purchase to be looked at to find duplicates, Only Contact Number, Year of purchase, and Type of Purchase need to be the same three times in a row.

Is this possible? how would I go about doing it?

Row number Contact Number Date of purchase Year of purchase type of purchase delete?
2 1 01/04/2008 2008 A
3 1 01/02/2009 2009 A
4 1 01/06/2009 2009 A
5 1 01/10/2009 2009 A Delete
6 1 01/02/2010 2010 A
7 1 01/02/2010 2010 A
8 1 01/03/2010 2010 B
9 1 01/11/2010 2010 A
10 1 01/12/2010 2010 A
11 2 01/01/2014 2014 A
12 2 01/06/2014 2014 A
13 2 01/09/2015 2015 A
14 3 01/06/2015 2015 A
15 3 01/07/2016 2016 B
16 3 01/10/2016 2016 B
17 3 01/11/2018 2018 B
18 3 01/11/2018 2018 B
19 3 01/06/2018 2018 B Delete
20 3 01/07/2019 2019 A
21 4 01/05/2018 2018 B
22 5 01/04/2010 2010 A
23 5 01/12/2015 2015 B
24 5 01/04/2016 2016 B
25 5 01/07/2016 2016 B
26 5 01/10/2016 2016 B Delete
27 5 01/10/2016 2016 A
28 6 01/07/2019 2019 A
29 6 01/08/2019 2019 A
30 6 01/09/2019 2019 B
31 6 01/11/2019 2019 A
32 6 01/11/2019 2019 A
33 6 01/11/2019 2019 B
34 7 01/10/2014 2014 A
35 7 01/06/2015 2015 B
36 7 01/07/2015 2015 B
37 8 01/07/2014 2014 B
38 8 01/08/2014 2014 B
39 9 01/11/2014 2014 B Delete
40 9 01/12/2014 2014 B Delete
41 9 01/12/2014 2014 B Delete
42 9 01/12/2014 2014 A
43 10 01/01/2016 2016 B
44 10 01/05/2016 2016 B
45 10 01/06/2016 2016 A
46 10 01/06/2016 2016 A
47 10 01/11/2016 2016 B
48 10 01/12/2016 2016 B
49 10 01/04/2018 2018 A
50 10 01/08/2018 2018 B

Thanks so much!

You can achieve the required functionality with dplyr::lag() function; consider the example below:

What it does is it computes a technical column called duplicate, which is based on the value of current name column, compared to the previous row and previous by two. TRUE for all three values equal, FALSE for different (note the NA in first row where lag is undefined).

You can then use it easily for eliminating the offending rows, if so desired.

library(dplyr)

animals <- tibble(row = 1:6,
                  name = c("cat", "dog", "cat", "dog", "dog", "dog"))

animals <- animals %>% 
   mutate(duplicate = ifelse(name == lag(name, n = 1) 
                             & name == lag(name, n = 2), T, F))

print(animals)

    row name  duplicate
  <int> <chr> <lgl>    
1     1 cat   NA       
2     2 dog   FALSE    
3     3 cat   FALSE    
4     4 dog   FALSE    
5     5 dog   FALSE    
6     6 dog   TRUE
2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.