Conditional loop writing for big data


#1

I am new in Rstudio as well as in R. I am working on big data. I need to clean that big data based on some conditions in column variables Mac_address, UNIX_Time.x and UNIX_time.y. I want to write the code like as follows: if (Mac_address is same in successive rows, UNIX_Time.x is also same and UNIX_time.y is different in successive rows) {then Keep the rows having minimum TT value. And also keep the rows having 2nd min.TT value, 3rd min.TT value, and 4th min.TT value if the difference between 1st min.TT value and 2nd min.TT value within the range of 60. and keep all other rows}.if FALSE: keep all rows.


#2

My advice to people tackling new problems is to invest in a bit of time thinking about what you might want to achieve, for example in what you are looking for is that you are working with groups of records, but implicitly if the mac address is the same as a later one separated by other addresses that is a new group. So a first step is identifying groups by changes in the mac address.

as you are dealing with sequential comparisons, lag() or lead() in dplyr would seem helpful (depending on if you are choosing to look forwards from an entry or back from an entry) a test along the lines of

mac == lag(mac)

or the converse with lead() would identify thresholds. You could apply similar principles for the other criteria you want, to identify the start of the group, and then cumsum() to cumulatively add the successful thresholds for each group.

Then you group_by() the group and filter to keep the row with the minimum value.

Because you are working with log files, you may need to convert textual formats of time to time formats.

But the question you haven’t addressed is when you say you are working on “big data” what does that mean in terms of limitations in your ability to work with the data. do you have more data than you can load into the computer’s memory at once?


#3

@thoughtfulnz

Result_Min

Thank you very much. I have used group_by() and I got minimum TT for each group. I also need 2nd min.TT, 3rd min.TT and 4th min.TT. I tried code

tt<- tt%>% group_by(Mac_address)
tt<- tt%>% group_by(Mac_address)
tt1<- tt%>%group_by(UNIX_Time.x)
tt2<- tt1%>% filter(TT== min(TT) & TT<=(min(TT)+1000))#gives only minimum TT in each group
tt3<- tt2%>%group_by(UNIX_Time.y)
tt4<- tt3%>% filter(TT== min(TT) & TT<(min(TT)+2000))#gives only minimum TT in each group

but it does not work.

But expect the result look like:

Could you have any suggestions?


#4

This kind of idea uses group_by to mark records on the basis of rules, then filter for keeping those that match

library(dplyr)
extxt <- "
M, UTx, UTy, TT
1,1,1,45
1,1,2,44
1,1,3,43
2,4,4,42
2,4,5,41
2,5,6,40
2,5,7,39
1,4,8,38"

ex <- read.csv(text=extxt)
# First we will build the thresholds, 
# because I am using lag() the nist NA is reset as true
wanted <- ex %>% mutate(threshold = (M != lag(M) & UTy != lag(UTy)) | 
                (UTx != lag(UTx) & UTy != lag(UTy)),
              threshold = ifelse(is.na(threshold), TRUE, threshold),
              ## use cumsum to get different numbers
              groupID = cumsum(threshold)) %>%
  # order them
  arrange(groupID, UTx, TT) %>%
  group_by(groupID) %>%
  mutate(keepme = TT < lag(TT) +60,
  # because of the arrange the first entry in thr group 
  # is the minimum, so we just rest the NA as before
  keepme = ifelse(is.na(keepme), TRUE, keepme)
  ) %>%
  ungroup() %>% 
  filter(keepme)

I would just use that approach as an option- from what you described above I am not clear on what should be the outcomes if Mac, UnixtimeX and UnixTimeY are all the same and potential cases like that.


#5

@thoughtfulnz, I am surprised with your code. This is just what I wanted. You are right. I need to do the same operation in each group of Mac_address; if UNIX_Time.y is duplicated and UNIX_time.x is not duplicated in successive rows. {then Keep the rows having minimum TT value. And also keep the rows having 2nd min.TT value, 3rd min.TT value, and 4th min.TT value and 5th min.TT (which represents successive detection of Mac) if the difference between [1st min.TT value and 2nd min.TT value =60]; [2nd min.TT value and 3rd min.TT value) =60 & (1st min.TT value and 3rd min.TT value)=120]; [3rd min.TT value and 4th min.TT value = 60 & (1st min.TT value and 4th min.TT value)=180]; [4th min.TT value and 5th min.TT value = 60 & (1st min.TT value and 5th min.TT value)=240]; and keep all other rows.
This is same as previous case but I add another conditions to get my expected results. Your filtering range [TT<lag(TT)+60] is absolutely amazing. For the successive detection, I need to connect with 1st min.TT and 5th min.TT as i mentioned.

I need to do another operation. I want to keep the rows having 90th percentile of TT for each 5 minutes from beginning. I mean,I want to calculate 95th percentile TT for first 5 minutes and keep the rows having 95th percentile TT within the first 5-Minutes. Then proceed to the next 5 minutes successively.


#6

for introducing time periods to your data you are going to want to introduce a new variable , I would lean heavily on the lubridate package for the practical functions, however there is a slight subtlety depending on if you are just wanting 5 minute periods relative to UTC (easy) or from the arbitrary starting point in your data (harder).

For the easy option, you can just use lubridate’s floor_date(x, “5 mins”) function.

For the harder option, one approach to make sure that you time range is normalised to the same starting point as the floor_date function, by subtracting the difference between the first entry and the midnight preceding it from each entry.

example <- data.frame(x = c(ISOdatetime(2018,10,1,1,2,3, tz="UTC"),ISOdatetime(2018,10,1,14,26,38, tz="UTC")))
example %>% mutate(date_chunk = floor_date(x - (min(x) - ISOdatetime(year(x), month(x), day(x), 0, 0, 0)), "5 mins")

Once you have a variable of what time group you are in there are 2 general strategies for combining filtered summarised data with the full rows of the data:
the commonest strategy would be to work out the summary percentile values for each group (I am guess the upper and lower percentiles), save that data set, then you a database type join operation to stick the summary value to each row of the original data, so you can then compare each row to see if it is in the desired range.

The other approach is more of a set theory one where you imagine adjusting the boundaries of a Venn diagram- one calculates the time groups, then groups the data to the time groups and uses mutate to introduce the summary values to each group. You can then ungroup() or group_by() to different grouping levels in the data.

One trick related to this that might be needed subsequently, depending on approach, is if you want to set all members of a current group to the first member of a group you can go mutate(x2 = rep(x[1], n())) which repeats the first value of x for the entire group.


#7
`@thoughtfulnz, I want to make group, sequentially, 
for each 5minutes from the starting to finishing time of my data.
 I want to keep the rows having travel time less than 95th percentile for each group. 
The 95th percentile will be counted for each group and will be applied for the same group only, not through the data. 
There is no need of arbitrary time period. I have tried this code but does not work but it counts 95th percentile for all data;`

tt$UNIX_Time.x<- as.POSIXct(tt$UNIX_Time.x, origin = "1970-01-01", tz = "GMT")
tt<- tt%>% mutate(interval = floor_date(tt$UNIX_Time.x, "5mins")) %>% 
  group_by(interval)%>%
  mutate(keepme = (TT <= quantile(TT, 0.95)),
  keepme = ifelse(is.na(keepme), TRUE, kepme)) %>%
  ungroup() %>% 
  filter(keepme)
Could you assist me? I will be highly grateful to you.

#8

If you are looking at setting time periods or working on time series, you should definitely check out the tibbletime package


#9

I am working on the time of a single day using lubridate package.
I want to make sequentially group for each 5minutes, then calculate 95th percentile of TT for each group.
then keep the rows having TT<96th percentile of TT for each group.
Could you make an example regarding my problem?


#10
@thoughtfulnz, Thanks from my heart. I am sorry for misunderstading. 
I have tried the following code and it works well. 

tt$UNIX_Time.y<- as.POSIXct(tt$UNIX_Time.x, origin = "1970-01-01", tz = "GMT")
tt5<- tt%>% mutate(interval = floor_date(tt$UNIX_Time.y,  "3mins")) %>% 
  group_by(interval)%>%
  filter(TT <= quantile(TT, 0.90))

Here I just made 3mins time bin. It starts from the earlier than my data. 
But I just want to get from starting of my data. 
my data starts at 6:37am but interval starts at 6:36. You can see in fig. 
Could you give  me further suggestion?


#11

floor_date uses midnight of the first day as the starting point, and makes intervals from that point forwards. For starting at a point during the day, you need to adjust the times so that they are all moved by the distance from midnight from the first date, this makes the starting point a “fake midnight” to have the intervals from.

See my earlier comments about the harder strategy for setting intervals.


#12

@thoughtfulnz, I have tried the following code to make group for each 5 minutes of my data. The time starts from the 6:37:00 and finishes at 20:30:01.
tt$UNIX_Time.x<- data.frame(UNIX_Time.x= c(ISOdatetime(2017,4,20,6,37,1, tz="UTC"), ISOdatetime(2017,4,20,20,30,1, tz="UTC"))) x<- tt$UNIX_Time.x tt<- tt %>% mutate(INTVL = floor_date(x - (min(x) - ISOdatetime(year(x), month(x), day(x), 6, 37, 0)), "5 mins"))%>% group_by(INTVL)

The output is like as attached image. I can’t identify error in my code. Could you help me , please?


#13

Looking back, I see at one point I should have used the minimum date when writing an example instead of the date. Have a look at this example. I have also spaced out the steps so it is easier to see what each step is doing

library (lubridate)

ex1 <- data.frame(tm = seq.POSIXt(from= ISOdatetime(2018,10,1,1,2,3, tz="UTC"),
                                      to= ISOdatetime(2018,10,1,2,13,3, tz="UTC"),
                                      by="min"))
# find earliest date
date1 <- min(ex1$tm)
# work out offset to midnight as floor_date is based on midnight
offset_for_floor <- date1 - ISOdatetime(year(date1), month(date1), day(date1), 0, 0, 0)

# calculate the fllor with the offset to get starting point of first value
ex2 <- ex1 %>% 
  mutate(date_chunk = floor_date(tm - offset_for_floor, "5 mins") + offset_for_floor)

#14

@thoughtfulnz, This is exactly what I expected. Thank you so much!


#15

I think tibbletime can do exactly this. Note that what I show here may change a little bit in the next CRAN release. This package is still in early dev. With version 0.1.0:

# install.packages("tibbletime")
library(tibbletime)
library(dplyr)

# Minutely series
ex1 <- create_series("2018-10-01 01:02:03" ~ "2018-10-01 02:13:03",
                     period = "1 min", tz = "UTC")

ex1 %>%
  mutate(date_group = collapse_index(date, "5 min"))
#> # A time tibble: 72 x 2
#> # Index: date
#>    date                date_group         
#>    <dttm>              <dttm>             
#>  1 2018-10-01 01:02:03 2018-10-01 01:06:03
#>  2 2018-10-01 01:03:03 2018-10-01 01:06:03
#>  3 2018-10-01 01:04:03 2018-10-01 01:06:03
#>  4 2018-10-01 01:05:03 2018-10-01 01:06:03
#>  5 2018-10-01 01:06:03 2018-10-01 01:06:03
#>  6 2018-10-01 01:07:03 2018-10-01 01:11:03
#>  7 2018-10-01 01:08:03 2018-10-01 01:11:03
#>  8 2018-10-01 01:09:03 2018-10-01 01:11:03
#>  9 2018-10-01 01:10:03 2018-10-01 01:11:03
#> 10 2018-10-01 01:11:03 2018-10-01 01:11:03
#> # ... with 62 more rows

#16

@davis, thanks for your answer.

@thoughtfulnz and @davis
I have solved previous problem by using lubridate package. I need some operations within the group. I am adding my current data format as image bellow.

Firstly, I want to define the upper threshold of each group as Tuthr, for example
Tuthr= 06:42:28 (Upper threshold of a group) and
Tlthr=06:37:29 (Lower threshold of a group) . Please see graph.
I tried the code:

Tuthr= lag(date_chunk) #but it doesn’t work.
Tlthr= min(date_chunk) # it doesn’t work.
Secondly, my data frame is defined as tt,
tt %>% group by(date_chunk) %>% #grouped based on thoughtfulnz’s answer
Tuthr= lag(date_chunk) %>% #Tuthr= uper threshold, Tlthr= Lower threshold
filter(T2<Tuthr)%>%
if T1>=Tuthr & T2<=Tlthr
mutate (T= Tuthr - Tlthr)

if T1<=Tuthr & T2<=Tlthr
mutate (T= T1 - Tlthr)

if T1>=Tuthr & T2>=Tlthr
mutate (T= Tuthr -T2)

if T1<Tuthr & T2>Tlthr
else
mutate (T= T1-T2)

I want to include some data from previous group to the next successive group based on the following conditions, dynamically. if T1>Tuthr (uper threshold) of a group, it should be included in both current and next group. Please consider the graph for better understanding. This graph represent movement of vehicles from P2 location to P3 location, C1 means camera position.
Your helping hand is highly appreciated.


#17

Making some observations about your code based on my understanding of it, if defining members of a group is being done on the basis of sharing the same date_chunk, then all members of the group share the same date_chunk, so and lag of date_chunk or lead of date_chunk is not going to work as it is the shared group identifier. Do you actually want the lag or lead of the original time within the group.

If you want information to cross between groups , the the questions becomes is there some other arrangement of group that could be distinguished, otherwise you may need to work without a grouping level (note the ungroup() command) and work out the crossover rules.


#18

@thoughtfulnz,
Firstly, I don’t want lag or lead of date_chunk. In some conditions, I need to use upper threshold or lower threshold of each group as I mentioned in last question and graph.

if T1>=Tuthr & T2<=Tlthr
mutate (T= Tuthr - Tlthr)
if T1<=Tuthr & T2<=Tlthr
mutate (T= T1 - Tlthr)
if T1>Tuthr & T2>=Tlthr
mutate (T= Tuthr -T2)
else
mutate (T= T1 -T2)

Secondly, I want information to cross between groups if T1>Tuthr of a group. I need to fulfil both conditions for my analysis. I will be highly benefited with your solution.


#19

Stating at the outset that the kind of problem you are approaching is not one I have tackled (so have no innate familiarity with the domain), and I do not have your data, here are a few thoughts that might be useful:

If for an individual entry your have its chunk and can derive its upper threshold by a matter of rule, it should be that here is some pathway of either:
creating a new grouping variable that is the orginal date_chunk, but has time added to it to displace it to the next chunk if it meets the rule.

A alternative form of that, as a general strategy) is creating a new grouping variable, and then using a self join between the old and new data sets to carry information across. But in those cases you need to be careful how many entries you are joining as you seem to be moving between aggregate and individual data.

As for the multiple if statements, use case_when, as in this example

library(dplyr)

data(iris) 
iris %>%
    mutate(
      Tval = case_when(
        Sepal.Width  >= Petal.Width | Sepal.Length >= Petal.Length  ~ Petal.Width - Sepal.Width,
        Sepal.Width  >= Petal.Width | Sepal.Length <= Petal.Length  ~ Petal.Length - Sepal.Length,
        Sepal.Width  <= Petal.Width | Sepal.Length <= Petal.Length  ~ Sepal.Length - Petal.Length,
        TRUE                      ~  Sepal.Width  - Petal.Width 
      )
    )


#20

@thoughtfulnz, I could not understand your last codes. I have just added reproducible dataset with grouping variable. I tried the following code but it produce zero. Could you re-code by using my data? date_chunk represents the grouping variable.

tt<- tt %>%
  mutate(date_chunk = floor_date(Time.x - offset_for_floor, "5 mins") + offset_for_floor)%>%
   group_by(date_chunk)%>%
   mutate(T= case_when(
                    T2<=min(date_chunk) | T1>=max(date_chunk) ~ max(date_chunk) - min(date_chunk),
                    T2<=min(date_chunk) | T1<=max(date_chunk) ~ T1 - min(date_chunk),
                    T2>=min(date_chunk) | T1>=max(date_chunk) ~ max(date_chunk)- T2,
                    T2>=max(date_chunk) | T1>=max(date_chunk) ~ 0,
                    T2>=min(date_chunk) | T1<=max(date_chunk) ~ T1- T2 
                    )
       )
Mac_address	                        UNIX_T.x	  UTC_T.x     Loc.x	UNIX_T.y	  UTC_T.y	  Loc.y	Link_1	TT	      T1	       T2	        date_chunk
6174ba6829753695bb111bfd3a1288d6	1492670249	4/20/2017 6:37	P2	1492670369	4/20/2017 6:39	P3	P2P3	120	1492670323	1492670323	4/20/2017 6:37
9ca4cf09f1caf2de82e3ec982892df7a	1492670253	4/20/2017 6:37	P2	1492674921	4/20/2017 7:55	P3	P2P3	4668	1492673132	1492673127	4/20/2017 6:37
2521a4d2e65100d0fe6c57f111e0109c	1492670258	4/20/2017 6:37	P2	1492671709	4/20/2017 7:01	P3	P2P3	1451	1492671153	1492671151	4/20/2017 6:37
2521a4d2e65100d0fe6c57f111e0109c	1492670290	4/20/2017 6:38	P2	1492671709	4/20/2017 7:01	P3	P2P3	1419	1492671165	1492671164	4/20/2017 6:37
a4f70f558c1f3479a36d62a8a22ab77f	1492670354	4/20/2017 6:39	P2	1492670614	4/20/2017 6:43	P3	P2P3	260	1492670514	1492670514	4/20/2017 6:37
2f15e55a9c3cd70349c1cc5b0d5db00c	1492670432	4/20/2017 6:40	P2	1492670584	4/20/2017 6:43	P3	P2P3	152	1492670526	1492670526	4/20/2017 6:37
35571d9eb84fc322e868ba0f42936a99	1492670489	4/20/2017 6:41	P2	1492670629	4/20/2017 6:43	P3	P2P3	140	1492670575	1492670575	4/20/2017 6:37
5ee52129ae79c183c194ede0b934aa53	1492670520	4/20/2017 6:42	P2	1492670651	4/20/2017 6:44	P3	P2P3	131	1492670601	1492670601	4/20/2017 6:37
2521a4d2e65100d0fe6c57f111e0109c	1492670557	4/20/2017 6:42	P2	1492671709	4/20/2017 7:01	P3	P2P3	1152	1492671268	1492671266	4/20/2017 6:42
e4297c1687b3fea3da2392564720c443	1492670564	4/20/2017 6:42	P2	1492691151	4/20/2017 12:25	P3	P2P3	20587	1492683263	1492683241	4/20/2017 6:42
c27b544af1552a77cd7a38b36923d037	1492670566	4/20/2017 6:42	P2	1492670707	4/20/2017 6:45	P3	P2P3	141	1492670653	1492670653	4/20/2017 6:42
4e68c4a8ff6a71e1fc4df59b587556a7	1492670595	4/20/2017 6:43	P2	1492670742	4/20/2017 6:45	P3	P2P3	147	1492670686	1492670686	4/20/2017 6:42
4bb0326fd40a4a159296cc20a5cc196c	1492670640	4/20/2017 6:44	P2	1492670772	4/20/2017 6:46	P3	P2P3	132	1492670721	1492670721	4/20/2017 6:42
badbb2a52973196ace7cb66c50aa39b2	1492670700	4/20/2017 6:45	P2	1492671974	4/20/2017 7:06	P3	P2P3	1274	1492671486	1492671485	4/20/2017 6:42
badbb2a52973196ace7cb66c50aa39b2	1492670761	4/20/2017 6:46	P2	1492671889	4/20/2017 7:04	P3	P2P3	1128	1492671457	1492671456	4/20/2017 6:42
badbb2a52973196ace7cb66c50aa39b2	1492670820	4/20/2017 6:47	P2	1492671880	4/20/2017 7:04	P3	P2P3	1060	1492671474	1492671473	4/20/2017 6:42
badbb2a52973196ace7cb66c50aa39b2	1492670849	4/20/2017 6:47	P2	1492671989	4/20/2017 7:06	P3	P2P3	1140	1492671552	1492671551	4/20/2017 6:47
badbb2a52973196ace7cb66c50aa39b2	1492670849	4/20/2017 6:47	P2	1492682285	4/20/2017 9:58	P3	P2P3	11436	1492677903	1492677891	4/20/2017 6:47
badbb2a52973196ace7cb66c50aa39b2	1492670849	4/20/2017 6:47	P2	1492671882	4/20/2017 7:04	P3	P2P3	1033	1492671486	1492671485	4/20/2017 6:47
badbb2a52973196ace7cb66c50aa39b2	1492670883	4/20/2017 6:48	P2	1492671872	4/20/2017 7:04	P3	P2P3	989	1492671493	1492671492	4/20/2017 6:47
badbb2a52973196ace7cb66c50aa39b2	1492670941	4/20/2017 6:49	P2	1492671872	4/20/2017 7:04	P3	P2P3	931	1492671515	1492671514	4/20/2017 6:47
badbb2a52973196ace7cb66c50aa39b2	1492671001	4/20/2017 6:50	P2	1492671889	4/20/2017 7:04	P3	P2P3	888	1492671549	1492671548	4/20/2017 6:47
badbb2a52973196ace7cb66c50aa39b2	1492671061	4/20/2017 6:51	P2	1492671880	4/20/2017 7:04	P3	P2P3	819	1492671566	1492671565	4/20/2017 6:47
badbb2a52973196ace7cb66c50aa39b2	1492671120	4/20/2017 6:52	P2	1492671979	4/20/2017 7:06	P3	P2P3	859	1492671650	1492671649	4/20/2017 6:47
badbb2a52973196ace7cb66c50aa39b2	1492671149	4/20/2017 6:52	P2	1492671989	4/20/2017 7:06	P3	P2P3	840	1492671667	1492671666	4/20/2017 6:52
badbb2a52973196ace7cb66c50aa39b2	1492671149	4/20/2017 6:52	P2	1492671991	4/20/2017 7:06	P3	P2P3	842	1492671668	1492671667	4/20/2017 6:52
badbb2a52973196ace7cb66c50aa39b2	1492671149	4/20/2017 6:52	P2	1492682285	4/20/2017 9:58	P3	P2P3	11136	1492678018	1492678006	4/20/2017 6:52
badbb2a52973196ace7cb66c50aa39b2	1492671182	4/20/2017 6:53	P2	1492671882	4/20/2017 7:04	P3	P2P3	700	1492671614	1492671613	4/20/2017 6:52
badbb2a52973196ace7cb66c50aa39b2	1492671240	4/20/2017 6:54	P2	1492671987	4/20/2017 7:06	P3	P2P3	747	1492671701	1492671700	4/20/2017 6:52
badbb2a52973196ace7cb66c50aa39b2	1492671306	4/20/2017 6:55	P2	1492671882	4/20/2017 7:04	P3	P2P3	576	1492671661	1492671661	4/20/2017 6:52
badbb2a52973196ace7cb66c50aa39b2	1492671361	4/20/2017 6:56	P2	1492671889	4/20/2017 7:04	P3	P2P3	528	1492671687	1492671686	4/20/2017 6:52
badbb2a52973196ace7cb66c50aa39b2	1492671450	4/20/2017 6:57	P2	1492671989	4/20/2017 7:06	P3	P2P3	539	1492671782	1492671782	4/20/2017 6:57
badbb2a52973196ace7cb66c50aa39b2	1492671450	4/20/2017 6:57	P2	1492682285	4/20/2017 9:58	P3	P2P3	10835	1492678133	1492678122	4/20/2017 6:57
badbb2a52973196ace7cb66c50aa39b2	1492671480	4/20/2017 6:58	P2	1492671884	4/20/2017 7:04	P3	P2P3	404	1492671729	1492671729	4/20/2017 6:57
badbb2a52973196ace7cb66c50aa39b2	1492671541	4/20/2017 6:59	P2	1492671889	4/20/2017 7:04	P3	P2P3	348	1492671756	1492671755	4/20/2017 6:57
badbb2a52973196ace7cb66c50aa39b2	1492671602	4/20/2017 7:00	P2	1492682285	4/20/2017 9:58	P3	P2P3	10683	1492678192	1492678180	4/20/2017 6:57
badbb2a52973196ace7cb66c50aa39b2	1492671662	4/20/2017 7:01	P2	1492671991	4/20/2017 7:06	P3	P2P3	329	1492671865	1492671865	4/20/2017 6:57
badbb2a52973196ace7cb66c50aa39b2	1492671722	4/20/2017 7:02	P2	1492671991	4/20/2017 7:06	P3	P2P3	269	1492671888	1492671888	4/20/2017 6:57