filtering large data set 1st integer above zero

Ash · December 2, 2020, 3:15pm

I currently have 10 minutes of data sampled in 100th of a second, measuring ground contact when walking, the data comes in the form of 0,0,0,0,0,0,0,0,15,40,70,80,80,80,40,15,0,0,0,0,0,0,0,0,0,20,45,80,80,75,60,50,20,0 and so on. The only data point I'm interested in is the first numerical value above 0 (ground contact) this will happen on multiple occasions, approx 1,000 - 1,500 and won't always be the same number.

Any help greatly appreciated.

ADernild · December 2, 2020, 3:38pm

Could you provide some sample data? It sound like you have one variable/column with a ton of rows, and you want to find the first row above 0? Without knowing the purpose of this I guess you could use the filter() function:

> library(tidyverse)
> df <- tibble(id = 0:100, x = 0:100)
> df
# A tibble: 101 x 2
      id     x
   <int> <int>
 1     0     0
 2     1     1
 3     2     2
 4     3     3
 5     4     4
 6     5     5
 7     6     6
 8     7     7
 9     8     8
10     9     9
# … with 91 more rows

> filter(df, x > 5)
# A tibble: 95 x 2
      id     x
   <int> <int>
 1     6     6
 2     7     7
 3     8     8
 4     9     9
 5    10    10
 6    11    11
 7    12    12
 8    13    13
 9    14    14
10    15    15
# … with 85 more rows

Basically filter() creates a subset of your data based on a logical expression, in this case I told filter to return x values above 5.

Ash · December 2, 2020, 4:15pm

Yes that's correct I want to find the first row above 0, but on multiple occasions. In the screenshot it is the force row I'm interested in filtering, this pattern repeats but with differing values above 0.

gavg712 · December 2, 2020, 5:52pm

Seems like a problem of Inter-Event Time, mainly used in time series of environmental variables . The aim is to separate/classify different events in the signal based on a Inter-Event Time where "nothing happen". I used the package IETD for rain events that could be used as well with your data. Some tricks to do are following:

library(tidyverse)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
# data
x <- c(0,0,0,0,0,0,0,0,15,40,70,80,80,80,
       40,15,0,0,0,0,0,0,0,0,0,20,45,80,
       80,75,60,50,20,0,0,1, 0, 5)
data_raw <- tibble(Frames = seq_along(x),
                       Time = runif(length(x), 4.5, 6.5),
                       Abs_time = Sys.time() + Time,
                       Force = x)
data_raw
#> # A tibble: 38 x 4
#>    Frames  Time Abs_time            Force
#>     <int> <dbl> <dttm>              <dbl>
#>  1      1  5.70 2020-12-02 12:50:46     0
#>  2      2  5.53 2020-12-02 12:50:45     0
#>  3      3  5.31 2020-12-02 12:50:45     0
#>  4      4  5.37 2020-12-02 12:50:45     0
#>  5      5  6.38 2020-12-02 12:50:46     0
#>  6      6  5.05 2020-12-02 12:50:45     0
#>  7      7  4.78 2020-12-02 12:50:45     0
#>  8      8  6.13 2020-12-02 12:50:46     0
#>  9      9  5.44 2020-12-02 12:50:45    15
#> 10     10  5.54 2020-12-02 12:50:45    40
#> # … with 28 more rows
#Build a data frame with a pseudo-timestamp as first column, 
# then signal column, and the everything else
data <- data_raw %>% 
  mutate(pseudo_date = Sys.time() + Frames * 1000) %>%
  select(pseudo_date, Force, everything())
data
#> # A tibble: 38 x 5
#>    pseudo_date         Force Frames  Time Abs_time           
#>    <dttm>              <dbl>  <int> <dbl> <dttm>             
#>  1 2020-12-02 13:07:20     0      1  5.70 2020-12-02 12:50:46
#>  2 2020-12-02 13:24:00     0      2  5.53 2020-12-02 12:50:45
#>  3 2020-12-02 13:40:40     0      3  5.31 2020-12-02 12:50:45
#>  4 2020-12-02 13:57:20     0      4  5.37 2020-12-02 12:50:45
#>  5 2020-12-02 14:14:00     0      5  6.38 2020-12-02 12:50:46
#>  6 2020-12-02 14:30:40     0      6  5.05 2020-12-02 12:50:45
#>  7 2020-12-02 14:47:20     0      7  4.78 2020-12-02 12:50:45
#>  8 2020-12-02 15:04:00     0      8  6.13 2020-12-02 12:50:46
#>  9 2020-12-02 15:20:40    15      9  5.44 2020-12-02 12:50:45
#> 10 2020-12-02 15:37:20    40     10  5.54 2020-12-02 12:50:45
#> # … with 28 more rows

# separate events
IETD::drawre(as.data.frame(data), 0, 0)[[2]] %>% 
  #extract first row of each event
  map_df(slice, n = 1)
#>           pseudo_date Force Frames     Time            Abs_time
#> 1 2020-12-02 15:20:40    15      9 5.442333 2020-12-02 12:50:45
#> 2 2020-12-02 20:04:00    20     26 6.315026 2020-12-02 12:50:46
#> 3 2020-12-02 22:50:40     1     36 5.312519 2020-12-02 12:50:45
#> 4 2020-12-02 23:24:00     5     38 5.815844 2020-12-02 12:50:46

^{Created on 2020-12-02 by the reprex package (v0.3.0)}

The psuedo_date column can be calculated with a base time like Sys.Time() plus a sequence numbers multiplied by 1000. This is needed because you data is 10 minutes long (no enough time range to use IETD functions). And the column reordering is because the drawre() function requirements.

FJCC · December 2, 2020, 8:18pm

Here is a simplistic approach.

DF <- data.frame(Time = seq(1,34),
                 Force = c(0,0,0,0,0,0,0,0,15,40,70,80,80,80,40,15,0,0,0,0,0,0,
                           0,0,0,20,45,80,80,75,60,50,20,0))
library(dplyr)

DF2 <- DF %>% mutate(Lag = lag(Force),
                    FirstNonZero = Force > 0 & Lag == 0) %>% 
  filter(FirstNonZero)
DF2
#>   Time Force Lag FirstNonZero
#> 1    9    15   0         TRUE
#> 2   26    20   0         TRUE

^{Created on 2020-12-02 by the reprex package (v0.3.0)}

AlexisW · December 2, 2020, 10:33pm

FJCC's solution would also be my favorite. If speed is a concern (very big dataset), you can also use a base R solution that is 10x faster, but a bit harder to read:

first_non_zero <- function(x){
  c(0,x[-length(x)]) == 0 & x > 0
}

DF3 <- DF[first_non_zero(DF$Force),]

Ash · December 3, 2020, 3:43pm

Thank you for your help @FJCC , this works.