Remove rows if value in column repeats when using group_by()

Smalickwahls · May 13, 2022, 12:52am

I have a tibble containing small mammal trapping effort data that has some errors and I want to remove those records from analysis. I can identify those errors by using either arrange_by() or group_by() for my "trap_id" column. When I arrange that way it is clear to see that the "status" column sometimes has two records in a row that state the trap was "closed" for two calendar days in a row, which is not possible. I want to conditionally filter the data so that for any cases where there are repeating values of "closed" in the "status" column the first record in the repeating series is removed.

How would I do this?
Thanks

andresrcs · May 13, 2022, 1:00am

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Smalickwahls · May 17, 2022, 12:46am

Hello,
I created the following code for a tibble to reproduce my issue simulating what the data would look like after I group_by(trap_ID). You can see circumstances of two dates with either "open" or "closed" in a row, and I want to write code to recognize those circumstances and then be able to remove the first of two "open" records and the second record of repeating "closed" status so that the final data set, when grouped by trap_ID would have records alternating open and closed throughout the entire data set.
Can you help?
Thanks

library(lubridate)
library(tidyverse)

df <- 
tribble(
  ~trap_ID, ~date, ~status,
  1, "october 01, 2016", "open",
  1, "october 03, 2016", "closed",
  1, "october 04, 2016", "open",
  1, "october 10, 2016", "closed",
  1, "october 12, 2016", "closed",
  2, "october 12, 2016", "open",
  2, "october 03, 2016", "open",
  2, "october 10, 2016", "closed")  %>% 
  mutate(
    date = mdy(date))

andresrcs · May 17, 2022, 1:12am

Is this what you mean?

library(lubridate)
library(tidyverse)

df <- 
    tribble(
        ~trap_ID, ~date, ~status,
        1, "october 01, 2016", "open",
        1, "october 03, 2016", "closed",
        1, "october 04, 2016", "open",
        1, "october 10, 2016", "closed",
        1, "october 12, 2016", "closed",
        2, "october 12, 2016", "open",
        2, "october 03, 2016", "open",
        2, "october 10, 2016", "closed")  %>% 
    mutate(
        date = mdy(date))

df %>%
    pivot_wider(names_from = status,
                values_from = date,
                values_fn = list) %>% 
    mutate(open = map(open, min),
           closed = map(closed, max)) %>% 
    unnest_longer(col = c(closed, open)) %>% 
    pivot_longer(cols = -trap_ID, names_to = 'status', values_to = 'date')
#> # A tibble: 4 × 3
#>   trap_ID status date      
#>     <dbl> <chr>  <date>    
#> 1       1 open   2016-10-01
#> 2       1 closed 2016-10-12
#> 3       2 open   2016-10-03
#> 4       2 closed 2016-10-10

^{Created on 2022-05-16 by the reprex package (v2.0.1)}

joels · May 17, 2022, 1:22am

When the dates are sorted, there's no repeated status for trap_ID=2. Is that what you intended?

For the output, is this what you had in mind?

library(lubridate)
library(tidyverse)

df <- 
  tribble(
    ~trap_ID, ~date, ~status,
    1, "october 01, 2016", "open",
    1, "october 03, 2016", "closed",
    1, "october 04, 2016", "open",
    1, "october 10, 2016", "closed",
    1, "october 12, 2016", "closed",
    2, "october 12, 2016", "open",
    2, "october 03, 2016", "open",
    2, "october 10, 2016", "closed")  %>% 
  mutate(
    date = mdy(date))

df %>% 
  mutate(date=as_date(date)) %>% 
  arrange(trap_ID, date) %>% 
  group_by(trap_ID) %>% 
  filter(lag(status) %in% c(NA_character_, "open") & status=="closed" |
           lead(status) %in% c(NA_character_, "closed") & status=="open") 
#> # A tibble: 7 × 3
#> # Groups:   trap_ID [2]
#>   trap_ID date       status
#>     <dbl> <date>     <chr> 
#> 1       1 2016-10-01 open  
#> 2       1 2016-10-03 closed
#> 3       1 2016-10-04 open  
#> 4       1 2016-10-10 closed
#> 5       2 2016-10-03 open  
#> 6       2 2016-10-10 closed
#> 7       2 2016-10-12 open

^{Created on 2022-05-16 by the reprex package (v2.0.1)}

Smalickwahls · May 17, 2022, 7:10pm

Hmmm.
First, sorry I made a typo in my example df. I had intended for it to be as below where the three dates for trap_ID=2 were October 02, 03, and 10. So I corrected that and ran your script and ended up with 6 records which is what I was expecting (and alternating open/closed as described above). However, now when I ran the code for my data set of 16,000 records, I only had 12 records and all of them where status=closed.

I wasn't sure how "NA_character_" was operating in filter, so I also tried replacing it with ".", but got the same results with my large dataset.

  tribble(
    ~trap_ID, ~date, ~status,
    1, "october 01, 2016", "open",
    1, "october 03, 2016", "closed",
    1, "october 04, 2016", "open",
    1, "october 10, 2016", "closed",
    1, "october 12, 2016", "closed",
    2, "october 02, 2016", "open",
    2, "october 03, 2016", "open",
    2, "october 10, 2016", "closed")  %>% 
  mutate(
    date = mdy(date))

Smalickwahls · May 17, 2022, 7:11pm

Thanks for your suggestion andresrcs, unfortunately no. I was looking for 6 of the 8 original records to be returned.

joels · May 17, 2022, 7:31pm

Your real data might have patterns of open/closed that aren't accounted for in my answer. Can you post a new data sample that has some additional patterns that appear in your real data?

Also, the NA_character_ values are there to handle logical tests that occur in the first and last rows of each trap_ID group. lag(status) returns a missing value for the first row of each group, while lead(status) returns a missing value for the last row of each group. For example, run lag(c("open", "closed")) or lead(c("open","closed")).

Smalickwahls · May 18, 2022, 1:38am

Hurray! You saved me. Thank you. It turns out I made a transcription error when applying your code to my real dataset, so now it worked. And I actually had to remove NA_character_ from both arguments because when NA_character_ was in the code it returned records at the end of the trap_ID group whereby the field staff had recorded "open" on the final date of data collection, but not recorded "closed" (and the same where "closed" was input on the first recorded date of data collection, but not "open" for the actual first date if that makes sense). Anyways, I wanted those on the end that didnʻt follow the pattern removed as well, and was able to do so because I understood what NA_character_ was doing.
Thanks a bunch.

joels · May 18, 2022, 4:02am

You're welcome! Good luck with your project!

system · June 8, 2022, 4:03am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.