Remove rows if value in column repeats when using group_by()

I have a tibble containing small mammal trapping effort data that has some errors and I want to remove those records from analysis. I can identify those errors by using either arrange_by() or group_by() for my "trap_id" column. When I arrange that way it is clear to see that the "status" column sometimes has two records in a row that state the trap was "closed" for two calendar days in a row, which is not possible. I want to conditionally filter the data so that for any cases where there are repeating values of "closed" in the "status" column the first record in the repeating series is removed.

How would I do this?
Thanks

1 Like

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

Hello,
I created the following code for a tibble to reproduce my issue simulating what the data would look like after I group_by(trap_ID). You can see circumstances of two dates with either "open" or "closed" in a row, and I want to write code to recognize those circumstances and then be able to remove the first of two "open" records and the second record of repeating "closed" status so that the final data set, when grouped by trap_ID would have records alternating open and closed throughout the entire data set.
Can you help?
Thanks

library(lubridate)
library(tidyverse)

df <- 
tribble(
  ~trap_ID, ~date, ~status,
  1, "october 01, 2016", "open",
  1, "october 03, 2016", "closed",
  1, "october 04, 2016", "open",
  1, "october 10, 2016", "closed",
  1, "october 12, 2016", "closed",
  2, "october 12, 2016", "open",
  2, "october 03, 2016", "open",
  2, "october 10, 2016", "closed")  %>% 
  mutate(
    date = mdy(date))

Is this what you mean?

library(lubridate)
library(tidyverse)

df <- 
    tribble(
        ~trap_ID, ~date, ~status,
        1, "october 01, 2016", "open",
        1, "october 03, 2016", "closed",
        1, "october 04, 2016", "open",
        1, "october 10, 2016", "closed",
        1, "october 12, 2016", "closed",
        2, "october 12, 2016", "open",
        2, "october 03, 2016", "open",
        2, "october 10, 2016", "closed")  %>% 
    mutate(
        date = mdy(date))

df %>%
    pivot_wider(names_from = status,
                values_from = date,
                values_fn = list) %>% 
    mutate(open = map(open, min),
           closed = map(closed, max)) %>% 
    unnest_longer(col = c(closed, open)) %>% 
    pivot_longer(cols = -trap_ID, names_to = 'status', values_to = 'date')
#> # A tibble: 4 × 3
#>   trap_ID status date      
#>     <dbl> <chr>  <date>    
#> 1       1 open   2016-10-01
#> 2       1 closed 2016-10-12
#> 3       2 open   2016-10-03
#> 4       2 closed 2016-10-10

Created on 2022-05-16 by the reprex package (v2.0.1)

When the dates are sorted, there's no repeated status for trap_ID=2. Is that what you intended?

For the output, is this what you had in mind?

library(lubridate)
library(tidyverse)

df <- 
  tribble(
    ~trap_ID, ~date, ~status,
    1, "october 01, 2016", "open",
    1, "october 03, 2016", "closed",
    1, "october 04, 2016", "open",
    1, "october 10, 2016", "closed",
    1, "october 12, 2016", "closed",
    2, "october 12, 2016", "open",
    2, "october 03, 2016", "open",
    2, "october 10, 2016", "closed")  %>% 
  mutate(
    date = mdy(date))
df %>% 
  mutate(date=as_date(date)) %>% 
  arrange(trap_ID, date) %>% 
  group_by(trap_ID) %>% 
  filter(lag(status) %in% c(NA_character_, "open") & status=="closed" |
           lead(status) %in% c(NA_character_, "closed") & status=="open") 
#> # A tibble: 7 × 3
#> # Groups:   trap_ID [2]
#>   trap_ID date       status
#>     <dbl> <date>     <chr> 
#> 1       1 2016-10-01 open  
#> 2       1 2016-10-03 closed
#> 3       1 2016-10-04 open  
#> 4       1 2016-10-10 closed
#> 5       2 2016-10-03 open  
#> 6       2 2016-10-10 closed
#> 7       2 2016-10-12 open

Created on 2022-05-16 by the reprex package (v2.0.1)

Hmmm.
First, sorry I made a typo in my example df. I had intended for it to be as below where the three dates for trap_ID=2 were October 02, 03, and 10. So I corrected that and ran your script and ended up with 6 records which is what I was expecting (and alternating open/closed as described above). However, now when I ran the code for my data set of 16,000 records, I only had 12 records and all of them where status=closed.

I wasn't sure how "NA_character_" was operating in filter, so I also tried replacing it with ".", but got the same results with my large dataset.

  tribble(
    ~trap_ID, ~date, ~status,
    1, "october 01, 2016", "open",
    1, "october 03, 2016", "closed",
    1, "october 04, 2016", "open",
    1, "october 10, 2016", "closed",
    1, "october 12, 2016", "closed",
    2, "october 02, 2016", "open",
    2, "october 03, 2016", "open",
    2, "october 10, 2016", "closed")  %>% 
  mutate(
    date = mdy(date))

Thanks for your suggestion andresrcs, unfortunately no. I was looking for 6 of the 8 original records to be returned.

Your real data might have patterns of open/closed that aren't accounted for in my answer. Can you post a new data sample that has some additional patterns that appear in your real data?

Also, the NA_character_ values are there to handle logical tests that occur in the first and last rows of each trap_ID group. lag(status) returns a missing value for the first row of each group, while lead(status) returns a missing value for the last row of each group. For example, run lag(c("open", "closed")) or lead(c("open","closed")).

Hurray! You saved me. Thank you. It turns out I made a transcription error when applying your code to my real dataset, so now it worked. And I actually had to remove NA_character_ from both arguments because when NA_character_ was in the code it returned records at the end of the trap_ID group whereby the field staff had recorded "open" on the final date of data collection, but not recorded "closed" (and the same where "closed" was input on the first recorded date of data collection, but not "open" for the actual first date if that makes sense). Anyways, I wanted those on the end that didnʻt follow the pattern removed as well, and was able to do so because I understood what NA_character_ was doing.
Thanks a bunch.

1 Like

You're welcome! Good luck with your project!