I am new in Rstudio as well as in R. I am working on big data. I need to clean that big data based on some conditions in column variables Mac_address, UNIX_Time.x and UNIX_time.y
. I want to write the code like as follows: if (Mac_address
is same in successive rows, UNIX_Time.x
is also same and UNIX_time.y
is different in successive rows) {then Keep the rows having minimum TT
value. And also keep the rows having 2nd min.TT
value, 3rd min.TT
value, and 4th min.TT
value if the difference between 1st min.TT value and 2nd min.TT value within the range of 60. and keep all other rows}.if FALSE: keep all rows.
My advice to people tackling new problems is to invest in a bit of time thinking about what you might want to achieve, for example in what you are looking for is that you are working with groups of records, but implicitly if the mac address is the same as a later one separated by other addresses that is a new group. So a first step is identifying groups by changes in the mac address.
as you are dealing with sequential comparisons, lag() or lead() in dplyr would seem helpful (depending on if you are choosing to look forwards from an entry or back from an entry) a test along the lines of
mac == lag(mac)
or the converse with lead() would identify thresholds. You could apply similar principles for the other criteria you want, to identify the start of the group, and then cumsum() to cumulatively add the successful thresholds for each group.
Then you group_by() the group and filter to keep the row with the minimum value.
Because you are working with log files, you may need to convert textual formats of time to time formats.
But the question you haven't addressed is when you say you are working on "big data" what does that mean in terms of limitations in your ability to work with the data. do you have more data than you can load into the computer's memory at once?
Thank you very much. I have used group_by() and I got minimum TT for each group. I also need 2nd min.TT, 3rd min.TT and 4th min.TT. I tried code
tt<- tt%>% group_by(Mac_address)
tt<- tt%>% group_by(Mac_address)
tt1<- tt%>%group_by(UNIX_Time.x)
tt2<- tt1%>% filter(TT== min(TT) & TT<=(min(TT)+1000))#gives only minimum TT in each group
tt3<- tt2%>%group_by(UNIX_Time.y)
tt4<- tt3%>% filter(TT== min(TT) & TT<(min(TT)+2000))#gives only minimum TT in each group
but it does not work.
But expect the result look like:
Could you have any suggestions?
This kind of idea uses group_by to mark records on the basis of rules, then filter for keeping those that match
library(dplyr)
extxt <- "
M, UTx, UTy, TT
1,1,1,45
1,1,2,44
1,1,3,43
2,4,4,42
2,4,5,41
2,5,6,40
2,5,7,39
1,4,8,38"
ex <- read.csv(text=extxt)
# First we will build the thresholds,
# because I am using lag() the nist NA is reset as true
wanted <- ex %>% mutate(threshold = (M != lag(M) & UTy != lag(UTy)) |
(UTx != lag(UTx) & UTy != lag(UTy)),
threshold = ifelse(is.na(threshold), TRUE, threshold),
## use cumsum to get different numbers
groupID = cumsum(threshold)) %>%
# order them
arrange(groupID, UTx, TT) %>%
group_by(groupID) %>%
mutate(keepme = TT < lag(TT) +60,
# because of the arrange the first entry in thr group
# is the minimum, so we just rest the NA as before
keepme = ifelse(is.na(keepme), TRUE, keepme)
) %>%
ungroup() %>%
filter(keepme)
I would just use that approach as an option- from what you described above I am not clear on what should be the outcomes if Mac, UnixtimeX and UnixTimeY are all the same and potential cases like that.
@thoughtfulnz, I am surprised with your code. This is just what I wanted. You are right. I need to do the same operation in each group of Mac_address; if UNIX_Time.y is duplicated and UNIX_time.x is not duplicated in successive rows. {then Keep the rows having minimum TT value. And also keep the rows having 2nd min.TT value, 3rd min.TT value, and 4th min.TT value and 5th min.TT (which represents successive detection of Mac) if the difference between [1st min.TT value and 2nd min.TT value =60]; [2nd min.TT value and 3rd min.TT value) =60 & (1st min.TT value and 3rd min.TT value)=120]; [3rd min.TT value and 4th min.TT value = 60 & (1st min.TT value and 4th min.TT value)=180]; [4th min.TT value and 5th min.TT value = 60 & (1st min.TT value and 5th min.TT value)=240]; and keep all other rows
.
This is same as previous case but I add another conditions to get my expected results. Your filtering range [TT<lag(TT)+60] is absolutely amazing. For the successive detection, I need to connect with 1st min.TT and 5th min.TT as i mentioned.
I need to do another operation. I want to keep the rows having 90th percentile of TT for each 5 minutes from beginning. I mean,I want to calculate 95th percentile TT for first 5 minutes and keep the rows having 95th percentile TT within the first 5-Minutes. Then proceed to the next 5 minutes successively.
for introducing time periods to your data you are going to want to introduce a new variable , I would lean heavily on the lubridate package for the practical functions, however there is a slight subtlety depending on if you are just wanting 5 minute periods relative to UTC (easy) or from the arbitrary starting point in your data (harder).
For the easy option, you can just use lubridate's floor_date(x, "5 mins") function.
For the harder option, one approach to make sure that you time range is normalised to the same starting point as the floor_date function, by subtracting the difference between the first entry and the midnight preceding it from each entry.
example <- data.frame(x = c(ISOdatetime(2018,10,1,1,2,3, tz="UTC"),ISOdatetime(2018,10,1,14,26,38, tz="UTC")))
example %>% mutate(date_chunk = floor_date(x - (min(x) - ISOdatetime(year(x), month(x), day(x), 0, 0, 0)), "5 mins")
Once you have a variable of what time group you are in there are 2 general strategies for combining filtered summarised data with the full rows of the data:
the commonest strategy would be to work out the summary percentile values for each group (I am guess the upper and lower percentiles), save that data set, then you a database type join operation to stick the summary value to each row of the original data, so you can then compare each row to see if it is in the desired range.
The other approach is more of a set theory one where you imagine adjusting the boundaries of a Venn diagram- one calculates the time groups, then groups the data to the time groups and uses mutate to introduce the summary values to each group. You can then ungroup() or group_by() to different grouping levels in the data.
One trick related to this that might be needed subsequently, depending on approach, is if you want to set all members of a current group to the first member of a group you can go mutate(x2 = rep(x[1], n())) which repeats the first value of x for the entire group.
`@thoughtfulnz, I want to make group, sequentially,
for each 5minutes from the starting to finishing time of my data.
I want to keep the rows having travel time less than 95th percentile for each group.
The 95th percentile will be counted for each group and will be applied for the same group only, not through the data.
There is no need of arbitrary time period. I have tried this code but does not work but it counts 95th percentile for all data;`
tt$UNIX_Time.x<- as.POSIXct(tt$UNIX_Time.x, origin = "1970-01-01", tz = "GMT")
tt<- tt%>% mutate(interval = floor_date(tt$UNIX_Time.x, "5mins")) %>%
group_by(interval)%>%
mutate(keepme = (TT <= quantile(TT, 0.95)),
keepme = ifelse(is.na(keepme), TRUE, kepme)) %>%
ungroup() %>%
filter(keepme)
Could you assist me? I will be highly grateful to you.
If you are looking at setting time periods or working on time series, you should definitely check out the tibbletime package
I am working on the time of a single day using lubridate package.
I want to make sequentially group for each 5minutes, then calculate 95th percentile of TT for each group.
then keep the rows having TT<96th percentile of TT for each group.
Could you make an example regarding my problem?
@thoughtfulnz, Thanks from my heart. I am sorry for misunderstading.
I have tried the following code and it works well.
tt$UNIX_Time.y<- as.POSIXct(tt$UNIX_Time.x, origin = "1970-01-01", tz = "GMT")
tt5<- tt%>% mutate(interval = floor_date(tt$UNIX_Time.y, "3mins")) %>%
group_by(interval)%>%
filter(TT <= quantile(TT, 0.90))
Here I just made 3mins time bin. It starts from the earlier than my data.
But I just want to get from starting of my data.
my data starts at 6:37am but interval starts at 6:36. You can see in fig.
Could you give me further suggestion?
floor_date uses midnight of the first day as the starting point, and makes intervals from that point forwards. For starting at a point during the day, you need to adjust the times so that they are all moved by the distance from midnight from the first date, this makes the starting point a "fake midnight" to have the intervals from.
See my earlier comments about the harder strategy for setting intervals.
@thoughtfulnz, I have tried the following code to make group for each 5 minutes of my data. The time starts from the 6:37:00 and finishes at 20:30:01.
tt$UNIX_Time.x<- data.frame(UNIX_Time.x= c(ISOdatetime(2017,4,20,6,37,1, tz="UTC"), ISOdatetime(2017,4,20,20,30,1, tz="UTC"))) x<- tt$UNIX_Time.x tt<- tt %>% mutate(INTVL = floor_date(x - (min(x) - ISOdatetime(year(x), month(x), day(x), 6, 37, 0)), "5 mins"))%>% group_by(INTVL)
The output is like as attached image. I can't identify error in my code. Could you help me , please?
Looking back, I see at one point I should have used the minimum date when writing an example instead of the date. Have a look at this example. I have also spaced out the steps so it is easier to see what each step is doing
library (lubridate)
ex1 <- data.frame(tm = seq.POSIXt(from= ISOdatetime(2018,10,1,1,2,3, tz="UTC"),
to= ISOdatetime(2018,10,1,2,13,3, tz="UTC"),
by="min"))
# find earliest date
date1 <- min(ex1$tm)
# work out offset to midnight as floor_date is based on midnight
offset_for_floor <- date1 - ISOdatetime(year(date1), month(date1), day(date1), 0, 0, 0)
# calculate the fllor with the offset to get starting point of first value
ex2 <- ex1 %>%
mutate(date_chunk = floor_date(tm - offset_for_floor, "5 mins") + offset_for_floor)
I think tibbletime
can do exactly this. Note that what I show here may change a little bit in the next CRAN release. This package is still in early dev. With version 0.1.0
:
# install.packages("tibbletime")
library(tibbletime)
library(dplyr)
# Minutely series
ex1 <- create_series("2018-10-01 01:02:03" ~ "2018-10-01 02:13:03",
period = "1 min", tz = "UTC")
ex1 %>%
mutate(date_group = collapse_index(date, "5 min"))
#> # A time tibble: 72 x 2
#> # Index: date
#> date date_group
#> <dttm> <dttm>
#> 1 2018-10-01 01:02:03 2018-10-01 01:06:03
#> 2 2018-10-01 01:03:03 2018-10-01 01:06:03
#> 3 2018-10-01 01:04:03 2018-10-01 01:06:03
#> 4 2018-10-01 01:05:03 2018-10-01 01:06:03
#> 5 2018-10-01 01:06:03 2018-10-01 01:06:03
#> 6 2018-10-01 01:07:03 2018-10-01 01:11:03
#> 7 2018-10-01 01:08:03 2018-10-01 01:11:03
#> 8 2018-10-01 01:09:03 2018-10-01 01:11:03
#> 9 2018-10-01 01:10:03 2018-10-01 01:11:03
#> 10 2018-10-01 01:11:03 2018-10-01 01:11:03
#> # ... with 62 more rows
@davis, thanks for your answer.
@thoughtfulnz and @davis
I have solved previous problem by using lubridate package. I need some operations within the group. I am adding my current data format as image bellow.
Firstly, I want to define the upper threshold of each group as Tuthr, for example
Tuthr= 06:42:28 (Upper threshold of a group) and
Tlthr=06:37:29 (Lower threshold of a group) . Please see graph.
I tried the code:
Tuthr= lag(date_chunk) #but it doesn't work.
Tlthr= min(date_chunk) # it doesn't work.
Secondly, my data frame is defined as tt,
tt %>% group by(date_chunk) %>% #grouped based on thoughtfulnz's answer
Tuthr= lag(date_chunk) %>% #Tuthr= uper threshold, Tlthr= Lower threshold
filter(T2<Tuthr)%>%
if T1>=Tuthr & T2<=Tlthr
mutate (T= Tuthr - Tlthr)
if T1<=Tuthr & T2<=Tlthr
mutate (T= T1 - Tlthr)
if T1>=Tuthr & T2>=Tlthr
mutate (T= Tuthr -T2)
if T1<Tuthr & T2>Tlthr
else
mutate (T= T1-T2)
I want to include some data from previous group to the next successive group based on the following conditions, dynamically. if T1>Tuthr (uper threshold) of a group, it should be included in both current and next group. Please consider the graph for better understanding. This graph represent movement of vehicles from P2 location to P3 location, C1 means camera position.
Your helping hand is highly appreciated.
Making some observations about your code based on my understanding of it, if defining members of a group is being done on the basis of sharing the same date_chunk, then all members of the group share the same date_chunk, so and lag of date_chunk or lead of date_chunk is not going to work as it is the shared group identifier. Do you actually want the lag or lead of the original time within the group.
If you want information to cross between groups , the the questions becomes is there some other arrangement of group that could be distinguished, otherwise you may need to work without a grouping level (note the ungroup() command) and work out the crossover rules.
@thoughtfulnz,
Firstly, I don't want lag or lead of date_chunk. In some conditions, I need to use upper threshold or lower threshold of each group as I mentioned in last question and graph.
if T1>=Tuthr & T2<=Tlthr
mutate (T= Tuthr - Tlthr)
if T1<=Tuthr & T2<=Tlthr
mutate (T= T1 - Tlthr)
if T1>Tuthr & T2>=Tlthr
mutate (T= Tuthr -T2)
else
mutate (T= T1 -T2)
Secondly, I want information to cross between groups if T1>Tuthr of a group. I need to fulfil both conditions for my analysis. I will be highly benefited with your solution.
Stating at the outset that the kind of problem you are approaching is not one I have tackled (so have no innate familiarity with the domain), and I do not have your data, here are a few thoughts that might be useful:
If for an individual entry your have its chunk and can derive its upper threshold by a matter of rule, it should be that here is some pathway of either:
creating a new grouping variable that is the orginal date_chunk, but has time added to it to displace it to the next chunk if it meets the rule.
A alternative form of that, as a general strategy) is creating a new grouping variable, and then using a self join between the old and new data sets to carry information across. But in those cases you need to be careful how many entries you are joining as you seem to be moving between aggregate and individual data.
As for the multiple if statements, use case_when, as in this example
library(dplyr)
data(iris)
iris %>%
mutate(
Tval = case_when(
Sepal.Width >= Petal.Width | Sepal.Length >= Petal.Length ~ Petal.Width - Sepal.Width,
Sepal.Width >= Petal.Width | Sepal.Length <= Petal.Length ~ Petal.Length - Sepal.Length,
Sepal.Width <= Petal.Width | Sepal.Length <= Petal.Length ~ Sepal.Length - Petal.Length,
TRUE ~ Sepal.Width - Petal.Width
)
)
@thoughtfulnz, I could not understand your last codes. I have just added reproducible dataset with grouping variable. I tried the following code but it produce zero. Could you re-code by using my data? date_chunk represents the grouping variable.
tt<- tt %>%
mutate(date_chunk = floor_date(Time.x - offset_for_floor, "5 mins") + offset_for_floor)%>%
group_by(date_chunk)%>%
mutate(T= case_when(
T2<=min(date_chunk) | T1>=max(date_chunk) ~ max(date_chunk) - min(date_chunk),
T2<=min(date_chunk) | T1<=max(date_chunk) ~ T1 - min(date_chunk),
T2>=min(date_chunk) | T1>=max(date_chunk) ~ max(date_chunk)- T2,
T2>=max(date_chunk) | T1>=max(date_chunk) ~ 0,
T2>=min(date_chunk) | T1<=max(date_chunk) ~ T1- T2
)
)
Mac_address UNIX_T.x UTC_T.x Loc.x UNIX_T.y UTC_T.y Loc.y Link_1 TT T1 T2 date_chunk
6174ba6829753695bb111bfd3a1288d6 1492670249 4/20/2017 6:37 P2 1492670369 4/20/2017 6:39 P3 P2P3 120 1492670323 1492670323 4/20/2017 6:37
9ca4cf09f1caf2de82e3ec982892df7a 1492670253 4/20/2017 6:37 P2 1492674921 4/20/2017 7:55 P3 P2P3 4668 1492673132 1492673127 4/20/2017 6:37
2521a4d2e65100d0fe6c57f111e0109c 1492670258 4/20/2017 6:37 P2 1492671709 4/20/2017 7:01 P3 P2P3 1451 1492671153 1492671151 4/20/2017 6:37
2521a4d2e65100d0fe6c57f111e0109c 1492670290 4/20/2017 6:38 P2 1492671709 4/20/2017 7:01 P3 P2P3 1419 1492671165 1492671164 4/20/2017 6:37
a4f70f558c1f3479a36d62a8a22ab77f 1492670354 4/20/2017 6:39 P2 1492670614 4/20/2017 6:43 P3 P2P3 260 1492670514 1492670514 4/20/2017 6:37
2f15e55a9c3cd70349c1cc5b0d5db00c 1492670432 4/20/2017 6:40 P2 1492670584 4/20/2017 6:43 P3 P2P3 152 1492670526 1492670526 4/20/2017 6:37
35571d9eb84fc322e868ba0f42936a99 1492670489 4/20/2017 6:41 P2 1492670629 4/20/2017 6:43 P3 P2P3 140 1492670575 1492670575 4/20/2017 6:37
5ee52129ae79c183c194ede0b934aa53 1492670520 4/20/2017 6:42 P2 1492670651 4/20/2017 6:44 P3 P2P3 131 1492670601 1492670601 4/20/2017 6:37
2521a4d2e65100d0fe6c57f111e0109c 1492670557 4/20/2017 6:42 P2 1492671709 4/20/2017 7:01 P3 P2P3 1152 1492671268 1492671266 4/20/2017 6:42
e4297c1687b3fea3da2392564720c443 1492670564 4/20/2017 6:42 P2 1492691151 4/20/2017 12:25 P3 P2P3 20587 1492683263 1492683241 4/20/2017 6:42
c27b544af1552a77cd7a38b36923d037 1492670566 4/20/2017 6:42 P2 1492670707 4/20/2017 6:45 P3 P2P3 141 1492670653 1492670653 4/20/2017 6:42
4e68c4a8ff6a71e1fc4df59b587556a7 1492670595 4/20/2017 6:43 P2 1492670742 4/20/2017 6:45 P3 P2P3 147 1492670686 1492670686 4/20/2017 6:42
4bb0326fd40a4a159296cc20a5cc196c 1492670640 4/20/2017 6:44 P2 1492670772 4/20/2017 6:46 P3 P2P3 132 1492670721 1492670721 4/20/2017 6:42
badbb2a52973196ace7cb66c50aa39b2 1492670700 4/20/2017 6:45 P2 1492671974 4/20/2017 7:06 P3 P2P3 1274 1492671486 1492671485 4/20/2017 6:42
badbb2a52973196ace7cb66c50aa39b2 1492670761 4/20/2017 6:46 P2 1492671889 4/20/2017 7:04 P3 P2P3 1128 1492671457 1492671456 4/20/2017 6:42
badbb2a52973196ace7cb66c50aa39b2 1492670820 4/20/2017 6:47 P2 1492671880 4/20/2017 7:04 P3 P2P3 1060 1492671474 1492671473 4/20/2017 6:42
badbb2a52973196ace7cb66c50aa39b2 1492670849 4/20/2017 6:47 P2 1492671989 4/20/2017 7:06 P3 P2P3 1140 1492671552 1492671551 4/20/2017 6:47
badbb2a52973196ace7cb66c50aa39b2 1492670849 4/20/2017 6:47 P2 1492682285 4/20/2017 9:58 P3 P2P3 11436 1492677903 1492677891 4/20/2017 6:47
badbb2a52973196ace7cb66c50aa39b2 1492670849 4/20/2017 6:47 P2 1492671882 4/20/2017 7:04 P3 P2P3 1033 1492671486 1492671485 4/20/2017 6:47
badbb2a52973196ace7cb66c50aa39b2 1492670883 4/20/2017 6:48 P2 1492671872 4/20/2017 7:04 P3 P2P3 989 1492671493 1492671492 4/20/2017 6:47
badbb2a52973196ace7cb66c50aa39b2 1492670941 4/20/2017 6:49 P2 1492671872 4/20/2017 7:04 P3 P2P3 931 1492671515 1492671514 4/20/2017 6:47
badbb2a52973196ace7cb66c50aa39b2 1492671001 4/20/2017 6:50 P2 1492671889 4/20/2017 7:04 P3 P2P3 888 1492671549 1492671548 4/20/2017 6:47
badbb2a52973196ace7cb66c50aa39b2 1492671061 4/20/2017 6:51 P2 1492671880 4/20/2017 7:04 P3 P2P3 819 1492671566 1492671565 4/20/2017 6:47
badbb2a52973196ace7cb66c50aa39b2 1492671120 4/20/2017 6:52 P2 1492671979 4/20/2017 7:06 P3 P2P3 859 1492671650 1492671649 4/20/2017 6:47
badbb2a52973196ace7cb66c50aa39b2 1492671149 4/20/2017 6:52 P2 1492671989 4/20/2017 7:06 P3 P2P3 840 1492671667 1492671666 4/20/2017 6:52
badbb2a52973196ace7cb66c50aa39b2 1492671149 4/20/2017 6:52 P2 1492671991 4/20/2017 7:06 P3 P2P3 842 1492671668 1492671667 4/20/2017 6:52
badbb2a52973196ace7cb66c50aa39b2 1492671149 4/20/2017 6:52 P2 1492682285 4/20/2017 9:58 P3 P2P3 11136 1492678018 1492678006 4/20/2017 6:52
badbb2a52973196ace7cb66c50aa39b2 1492671182 4/20/2017 6:53 P2 1492671882 4/20/2017 7:04 P3 P2P3 700 1492671614 1492671613 4/20/2017 6:52
badbb2a52973196ace7cb66c50aa39b2 1492671240 4/20/2017 6:54 P2 1492671987 4/20/2017 7:06 P3 P2P3 747 1492671701 1492671700 4/20/2017 6:52
badbb2a52973196ace7cb66c50aa39b2 1492671306 4/20/2017 6:55 P2 1492671882 4/20/2017 7:04 P3 P2P3 576 1492671661 1492671661 4/20/2017 6:52
badbb2a52973196ace7cb66c50aa39b2 1492671361 4/20/2017 6:56 P2 1492671889 4/20/2017 7:04 P3 P2P3 528 1492671687 1492671686 4/20/2017 6:52
badbb2a52973196ace7cb66c50aa39b2 1492671450 4/20/2017 6:57 P2 1492671989 4/20/2017 7:06 P3 P2P3 539 1492671782 1492671782 4/20/2017 6:57
badbb2a52973196ace7cb66c50aa39b2 1492671450 4/20/2017 6:57 P2 1492682285 4/20/2017 9:58 P3 P2P3 10835 1492678133 1492678122 4/20/2017 6:57
badbb2a52973196ace7cb66c50aa39b2 1492671480 4/20/2017 6:58 P2 1492671884 4/20/2017 7:04 P3 P2P3 404 1492671729 1492671729 4/20/2017 6:57
badbb2a52973196ace7cb66c50aa39b2 1492671541 4/20/2017 6:59 P2 1492671889 4/20/2017 7:04 P3 P2P3 348 1492671756 1492671755 4/20/2017 6:57
badbb2a52973196ace7cb66c50aa39b2 1492671602 4/20/2017 7:00 P2 1492682285 4/20/2017 9:58 P3 P2P3 10683 1492678192 1492678180 4/20/2017 6:57
badbb2a52973196ace7cb66c50aa39b2 1492671662 4/20/2017 7:01 P2 1492671991 4/20/2017 7:06 P3 P2P3 329 1492671865 1492671865 4/20/2017 6:57
badbb2a52973196ace7cb66c50aa39b2 1492671722 4/20/2017 7:02 P2 1492671991 4/20/2017 7:06 P3 P2P3 269 1492671888 1492671888 4/20/2017 6:57