Locate row indices with given condtions

tanishahudson1 · March 1, 2020, 7:27am

Background

I have a very large dataset, df, where I have Length, Date and Edit Columns. My goal is to iterate through this dataset and find the index, start and end times for a given condition.

Working backwards, I need to get the index or row number where Edit is False *with the condition that the previous Edit is True.This will output an 'End' and the value that is in the Length column.

The Start is generated by going backwards from the 'End' index (Edit is False) and when you come across the next (Edit is False) + 1

 Length        Date                               Edit

  20            1/2/2020 1:00:00 AM               False
  21            1/2/2020 1:00:01 AM               True
  81            1/2/2020 1:00:02 AM               True
  81            1/2/2020 1:00:03 AM               True
  90            1/2/2020 1:00:04 AM               False
  20            1/2/2020 1:00:05 AM               True
  90            1/2/2020 1:00:06 AM               True
  81            1/2/2020 1:00:10 AM               True
  90            1/2/2020 1:00:15 AM               False        
  20            1/2/2020 1:00:25 AM               True

Desired output

Start                   End                   Duration   RowNum      Length 

1/2/2020 1:00:05 AM     1/2/2020 1:00:15 AM   10         8              90
1/2/2020 1:00:01 AM     1/2/2020 1:00:04 AM   3          4              90

Starting backwards, we see that the first End time is at, 1/2/2020 1:00:15 AM, because Edit is False, and its previous Edit value is True. The length is 90, and the RowNumber is 8. The Start would go backwards from 1/2/2020 1:00:15 AM until we come to another Edit is False line plus 1 , so it would be: 1/2/2020 1:00:05 AM

dput

structure(list(Length = c(20L, 21L, 81L, 81L, 90L, 20L, 90L, 
81L, 90L, 20L), Date = structure(1:10, .Label = c("1/2/2020 1:00:00 AM", 
"1/2/2020 1:00:01 AM", "1/2/2020 1:00:02 AM", "1/2/2020 1:00:03 AM", 
"1/2/2020 1:00:04 AM", "1/2/2020 1:00:05 AM", "1/2/2020 1:00:06 AM", 
"1/2/2020 1:00:10 AM", "1/2/2020 1:00:15 AM", "1/2/2020 1:00:25 AM"
 ), class = "factor"), Edit = c(FALSE, TRUE, TRUE, TRUE, FALSE, 
TRUE, TRUE, TRUE, FALSE, TRUE)), class = "data.frame", row.names = c(NA, 
-10L))

I have tried

 library(dplyr)
 library(readr)

 for (i in 1:nrow(df) {


if (df[i] == Edit == "False") {
print(df[rows]) 
}
    else if (df[i] < condition) {
print(df[rows])

}
    }

   mutate(Date = as.POSIXct(Date, format = '%m/%d/%Y %H:%M:%OS')) %>%
   mutate(RowNum = cumsum(!cond)) %>%
   group_by(Length) %>%
   summarize(Start = min(Date),
        End = max(Date),
        Duration = End - Start) %>%

I am still researching how to put this altogether. any help or suggestions is greatly appreciated.

Yarnabrina · March 1, 2020, 8:31am

Frankly speaking, I do not understand your question at all. I am not even sure whether your code works or not, as we do not what is cond. Also, I did not understand how you did not get an error while using two == consecutively. Please post a reproducible example of your problem from next post onwards.

Assuming RowNum is the the normal row number (and not cumsum of some condition which you did not specify) and you want the RowNum and Length of the last row of each group, the following code probably works

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

sample_data <- tibble(Length = c(20L, 21L, 81L, 81L, 90L, 20L, 90L, 81L, 90L, 20L),
                      Date = c("1/2/2020 1:00:00 AM", "1/2/2020 1:00:01 AM", "1/2/2020 1:00:02 AM", "1/2/2020 1:00:03 AM", "1/2/2020 1:00:04 AM", "1/2/2020 1:00:05 AM", "1/2/2020 1:00:06 AM", "1/2/2020 1:00:10 AM", "1/2/2020 1:00:15 AM", "1/2/2020 1:00:25 AM"),
                      Edit = c(FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE))

sample_data %>%
  mutate(Date = as.POSIXct(x = Date,
                           format = "%m/%d/%Y %H:%M:%OS"),
         RowNum = row_number()) %>%
  arrange(-RowNum) %>%
  mutate(temp1 = !Edit & lead(x = Edit, n = 1),
         temp2 = cumsum(x = temp1)) %>%
  group_by(temp2) %>%
  filter(any(temp1, na.rm = TRUE)) %>%
  summarise(Start = min(Date),
            End = max(Date),
            Duration = End - Start,
            RowNum= first(x = RowNum),
            Length = first(x = Length)) %>%
  select(-temp2)
#> # A tibble: 2 x 5
#>   Start               End                 Duration RowNum Length
#>   <dttm>              <dttm>              <drtn>    <int>  <int>
#> 1 2020-01-02 01:00:05 2020-01-02 01:00:15 10 secs       9     90
#> 2 2020-01-02 01:00:01 2020-01-02 01:00:04  3 secs       5     90

^{Created on 2020-03-01 by the reprex package (v0.3.0)}

Edit

If you want the Duration column in seconds only, you have to use the difftime function. There may be another way, but I do not know.

Replace Duration = End - Start with Duration = difftime(time1 = End, time2 = Start, units = "secs"). I cannot check as I don't have data, but that is supposed to do the job. Then depending on whether you need to keep it as a difftime object, or as a float or as a string, you can keep it as is, or use as.numeric or use format.

Hope this helps.

tanishahudson1 · March 1, 2020, 8:42am

Sorry I am new. This appears to work, thank you. I would like to ask is there a way to output in seconds? It seems to be giving me minutes and seconds

tanishahudson1 · March 1, 2020, 9:41am

I will try thank you

system · March 8, 2020, 9:41am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.