select/filter rows with R: identify values with the same variable or attribute a code to different consecutive cycles.

Hello, I am beginner with R and now I am struggling with a new problem.

I have many data.frame structured like the example below.
Every data frame is a long consecutive repetition (approx. 80000 rows per file) of measurements taken from different ports (in the example, only from port 1 to 5). Every cycle from port 1 to 5 is repeated multiple times.

I would like to find the way to filter only the last 3 values of every port in each repeated cycle, and later to calculate the mean between the 3 values.

My problem is how to tell R that every cycle is kind of independent. I tried to dplyr::filter each port but I don’t know how to explain R that I have consecutive cycles. I hope it is possible.

Otherwise I thought to find a way to identify each consecutive cycle, so later I could work on the data in according to the number of cycle. Would it be possible?
Thank you

	date	        time	        port	nh3
1	28.01.2019	09:44:52.657	1	361.9
2	28.01.2019	09:44:54.206	1	358.6
3	28.01.2019	09:44:56.012	1	363.6
4	28.01.2019	09:44:57.836	1	368.0
5	28.01.2019	09:44:59.637	1	371.8
6	28.01.2019	09:45:01.427	2	374.8
7	28.01.2019	09:45:03.226	2	378.4
8	28.01.2019	09:45:05.026	2	381.8
9	28.01.2019	09:45:06.828	2	386.3
10	28.01.2019	09:45:08.637	2	390.3
11	28.01.2019	09:45:10.457	2	396.3
12	28.01.2019	09:45:12.269	2	399.1
13	28.01.2019	09:45:13.917	3	399.1
14	28.01.2019	09:45:15.483	3	399.2
15	28.01.2019	09:45:17.315	3	398.1
16	28.01.2019	09:45:19.125	3	397.5
17	28.01.2019	09:45:20.934	3	396.1
18	28.01.2019	09:45:22.755	4	394.9
19	28.01.2019	09:45:24.563	4	390.6
20	28.01.2019	09:45:26.378	4	387.1
21	28.01.2019	09:45:28.193	4	384.9
22	28.01.2019	09:45:30.013	4	382.7
23	28.01.2019	09:45:31.812	5	381.8
24	28.01.2019	09:45:33.615	5	380.6
25	28.01.2019	09:45:35.276	5	380.6
26	28.01.2019	09:45:36.834	5	382.2
27	28.01.2019	09:45:38.647	5	385.5
28	28.01.2019	09:45:40.456	5	387.0
29	28.01.2019	09:45:42.266	1	386.3
30	28.01.2019	09:45:44.065	1	384.6
31	28.01.2019	09:45:45.875	1	382.4
32	28.01.2019	09:45:47.703	2	381.1
33	28.01.2019	09:45:49.516	2	380.5
34	28.01.2019	09:45:53.167	2	375.2
35	28.01.2019	09:45:54.973	2	370.8
36	28.01.2019	09:45:56.615	2	370.8
37	28.01.2019	09:45:58.157	3	413.1
38	28.01.2019	09:45:59.956	3	439.2
39	28.01.2019	09:46:01.755	3	470.7
40	28.01.2019	09:46:03.585	3	473.6
41	28.01.2019	09:46:05.408	4	482.6
42	28.01.2019	09:46:07.232	4	483.1
43	28.01.2019	09:46:09.055	4	506.0
44	28.01.2019	09:46:10.869	4	504.6
45	28.01.2019	09:46:12.693	4	504.
46	28.01.2019	09:46:14.510	5	508.7
47	28.01.2019	09:46:16.332	5	508.5
48	28.01.2019	09:46:18.006	5	508.5
49	28.01.2019	09:46:19.581	5	503.9
50	28.01.2019	09:46:21.395	5	502.4
51	28.01.2019	09:46:23.214	5	494.9

Hi, thank you for your help.

every time, once the port 5 has finished, a new "cycle" starts from port 1 and run until port 5, and then again from 1 to 5....for many times

In may case my files are much longer and complex, but for the example here: I would like to isolate the last 3 values of every port (in each cycle) and calculate then the average, for example:

cycle            port	  mean_nh3
1	         1	  value1
1	         2	  value2
1	         3	  value3
1	         4	  value4
1	         5	  value5
2	         1	  value6
2	         2	  value7
2	         3	  value8
2	         4	  value9
2	         5	  value10

Or, if it is not possible I would also be happy to add a column that identify the "cycle", so I can later filter and work on dirrent tables before to rbind everything again, for example:

date          	time	        port	nh3	cycle
28.01.2019	09:44:52.657	1	361.9	1
28.01.2019	09:44:54.206	1	358.6	1
28.01.2019	09:44:56.012	1	363.6	1
28.01.2019	09:44:57.836	1	368	        1
28.01.2019	09:44:59.637	1	371.8	1
28.01.2019	09:45:01.427	2	374.8	1
28.01.2019	09:45:03.226	2	378.4	1
28.01.2019	09:45:05.026	2	381.8	1
28.01.2019	09:45:06.828	2	386.3	1
28.01.2019	09:45:08.637	2	390.3	1
28.01.2019	09:45:10.457	2	396.3	1
28.01.2019	09:45:12.269	2	399.1	1
28.01.2019	09:45:13.917	3	399.1	1
28.01.2019	09:45:15.483	3	399.2	1
28.01.2019	09:45:17.315	3	398.1	1
28.01.2019	09:45:19.125	3	397.5	1
28.01.2019	09:45:20.934	3	396.1	1
28.01.2019	09:45:22.755	4	394.9	1
28.01.2019	09:45:24.563	4	390.6	1
28.01.2019	09:45:26.378	4	387.1	1
28.01.2019	09:45:28.193	4	384.9	1
28.01.2019	09:45:30.013	4	382.7	1
28.01.2019	09:45:31.812	5	381.8	1
28.01.2019	09:45:33.615	5	380.6	1
28.01.2019	09:45:35.276	5	380.6	1
28.01.2019	09:45:36.834	5	382.2	1
28.01.2019	09:45:38.647	5	385.5	1
28.01.2019	09:45:40.456	5	387	        1
28.01.2019	09:45:42.266	1	386.3	2
28.01.2019	09:45:44.065	1	384.6	2
28.01.2019	09:45:45.875	1	382.4	2
28.01.2019	09:45:47.703	2	381.1	2
28.01.2019	09:45:49.516	2	380.5	2
28.01.2019	09:45:53.167	2	375.2	2
28.01.2019	09:45:54.973	2	370.8	2
28.01.2019	09:45:56.615	2	370.8	2
28.01.2019	09:45:58.157	3	413.1	2
28.01.2019	09:45:59.956	3	439.2	2
28.01.2019	09:46:01.755	3	470.7	2
28.01.2019	09:46:03.585	3	473.6	2
28.01.2019	09:46:05.408	4	482.6	2
28.01.2019	09:46:07.232	4	483.1	2
28.01.2019	09:46:09.055	4	506	        2
28.01.2019	09:46:10.869	4	504.6	2
28.01.2019	09:46:12.693	4	504	        2
28.01.2019	09:46:14.510	5	508.7	2
28.01.2019	09:46:16.332	5	508.5	2
28.01.2019	09:46:18.006	5	508.5	2
28.01.2019	09:46:19.581	5	503.9	2
28.01.2019	09:46:21.395	5	502.4	2
28.01.2019	09:46:23.214	5	494.9	2

every suggestion is useful :slight_smile:

I think this will give you the required averages.

dataset %>%
  mutate(temp = with(data = rle(x = port),
                     expr = rep.int(x = seq_along(along.with = values),
                                    times = lengths)),
         row_index = row_number()) %>%
  group_by(temp) %>%
  top_n(n = 3) %>%
  summarise(corresponding_port = unique(x = port),
            avg_of_last_three = mean(nh3))

I'm trying on the cycle part. Will update this very post (hopefully soon enough)

Edit

Does this help?

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dataset <- tibble(date = c("28.1.19", "28.1.19", "28.1.19", "28.1.19", "28.1.19",
                           "28.1.19", "28.1.19", "28.1.19", "28.1.19",
                           "28.1.19", "28.1.19", "28.1.19", "28.1.19", "28.1.19",
                           "28.1.19", "28.1.19", "28.1.19", "28.1.19", "28.1.19",
                           "28.1.19", "28.1.19", "28.1.19", "28.1.19", "28.1.19",
                           "28.1.19", "28.1.19", "28.1.19", "28.1.19", "28.1.19",
                           "28.1.19", "28.1.19", "28.1.19", "28.1.19",
                           "28.1.19", "28.1.19", "28.1.19", "28.1.19", "28.1.19",
                           "28.1.19", "28.1.19", "28.1.19", "28.1.19", "28.1.19",
                           "28.1.19", "28.1.19", "28.1.19", "28.1.19", "28.1.19",
                           "28.1.19", "28.1.19", "28.1.19"),
                  time = c("09:44:52.657", "09:44:54.206", "09:44:56.012",
                           "09:44:57.836", "09:44:59.637", "09:45:01.427",
                           "09:45:03.226", "09:45:05.026", "09:45:06.828",
                           "09:45:08.637", "09:45:10.457", "09:45:12.269",
                           "09:45:13.917", "09:45:15.483", "09:45:17.315", "09:45:19.125",
                           "09:45:20.934", "09:45:22.755", "09:45:24.563",
                           "09:45:26.378", "09:45:28.193", "09:45:30.013",
                           "09:45:31.812", "09:45:33.615", "09:45:35.276", "09:45:36.834",
                           "09:45:38.647", "09:45:40.456", "09:45:42.266",
                           "09:45:44.065", "09:45:45.875", "09:45:47.703",
                           "09:45:49.516", "09:45:53.167", "09:45:54.973", "09:45:56.615",
                           "09:45:58.157", "09:45:59.956", "09:46:01.755",
                           "09:46:03.585", "09:46:05.408", "09:46:07.232",
                           "09:46:09.055", "09:46:10.869", "09:46:12.693", "09:46:14.510",
                           "09:46:16.332", "09:46:18.006", "09:46:19.581",
                           "09:46:21.395", "09:46:23.214"),
                  port = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4,
                           4, 5, 5, 5, 5, 5, 5, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3,
                           3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5),
                  nh3 = c(361.9, 358.6, 363.6, 368, 371.8, 374.8, 378.4, 381.8, 386.3,
                          390.3, 396.3, 399.1, 399.1, 399.2, 398.1, 397.5,
                          396.1, 394.9, 390.6, 387.1, 384.9, 382.7, 381.8, 380.6,
                          380.6, 382.2, 385.5, 387, 386.3, 384.6, 382.4,
                          381.1, 380.5, 375.2, 370.8, 370.8, 413.1, 439.2, 470.7,
                          473.6, 482.6, 483.1, 506, 504.6, 504, 508.7, 508.5,
                          508.5, 503.9, 502.4, 494.9))

dataset %>%
  mutate(cycle = cumsum(x = (((port == 1) & (lag(x = port) != 1)) | (row_number() == 1))),
         row_index = row_number()) %>%
  group_by(cycle, port) %>%
  top_n(3) %>%
  summarise(average_of_last_three_nh3 = mean(x = nh3)) %>%
  ungroup()
#> Selecting by row_index
#> # A tibble: 10 x 3
#>    cycle  port average_of_last_three_nh3
#>    <int> <dbl>                     <dbl>
#>  1     1     1                      368.
#>  2     1     2                      395.
#>  3     1     3                      397.
#>  4     1     4                      385.
#>  5     1     5                      385.
#>  6     2     1                      384.
#>  7     2     2                      372.
#>  8     2     3                      461.
#>  9     2     4                      505.
#> 10     2     5                      500.

Created on 2019-08-23 by the reprex package (v0.3.0)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.