Removing all rows after a certain time point depending on a condition

Hi,

I am trying to remove all rows from a dataframe based on a value in a column, but for each unique ID as shown below.

ID | Time | Data | Pregnancy
1 | 0 | 1 | Yes
1 | 1 | 0 | No
2 | 3 | 1 | No
3 | 0 | 1 | No
3 | 1 | 0 | Yes
3 | 4 | 1 | No

so it would show in the end:
ID | Time | Data | Pregnancy
2 | 3 | 1 | No
3 | 0 | 1 | No

So I am trying to ask, if df$Pregnancy = "Yes" then remove all rows for that same ID where Time is after or equal to the Time at Pregnancy.

ID <- c("1", "1", "2", "3", "3", "3")
Time <- c(0L, 1L, 3L, 0L, 1L, 4L)
data <- c("1", "0", "1", "0", "0", "1")
Pregnancy <- c("yes", "No", "No", "No", "Yes", "No")
df <- data.frame(ID, Time, data, Pregnancy)

I am stuck as to the code I would require to do this, and any help would be greatly appreciated as still very new to R!

I've got 2 solutions for you.

dplyr

library(dplyr)

df |>
  subset(preg == "No") |>
  group_by(id) |>
  slice_min(time)

data.table

library(data.table)

dt[preg == "No", .(time = min(time)), .(id, data, preg)][, .(id, time, data, preg)][]

I'd also suggest that you format your post so that it's more legible in the future.

Hi,

Thank you for your reply, I tried both your solutions and have formatted the data better now!
Using the dplyr code you suggested, the original:

 ID Time data Pregnancy
1  1    0    1       yes
2  1    1    0        No
3  2    3    1        No
4  3    0    0        No
5  3    1    0       Yes
6  3    4    1        No

turned into:

# A tibble: 3 × 4
# Groups:   ID [3]
  ID     Time data  Pregnancy
  <chr> <int> <chr> <chr>    
1 1         1 0     No       
2 2         3 1     No       
3 3         0 0     No

So some worked correctly, whereas the row 1 should have also been removed, since that ID was pregnant at Time 0 and it has kept in the row for this ID (1) at Time 1 which is after the pregnancy. But the others seemed to have worked.

Also, the second data.table package gave me this error:

dt[Pregnancy == "No", .(Time = min(Time)), .(ID, data, Pregnancy)][, .(ID, Time, data, Pregnancy)][]
Error in .(Time = min(Time)) : could not find function "."

Thank you again.

If you're keeping with dplyr, then you'd rearrange it like this;

df |>
  group_by(id) |>
  slice_min(time) |>
  subset(preg == "No")

Brilliant, thank you so much that worked!

I'm not entirely sure why it would return an error.
This may be a possible solution; r - Data table error could not find function "." - Stack Overflow

Regardless, this is how I would rewrite the data.table code to match the outcome of the dplyr code.

dt[, .SD[which.min(Time), .(Time, data, Pregnancy)], ID][Pregnancy == "No"][]

Thank you I appreciate your help!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.