Grouping similar times together

delisol · February 6, 2021, 1:28am

Post revised to help with clarity. I have a dataset of systolic blood pressure and date/time taken. I have created a fictional example dataset for an individual below. The readings were sometimes taken within minutes of each other, sometimes not.

library(tidyverse)

date_time <- c("Jan 29 2020 13:46:08" , 
               "Jan 29 2020 13:42:53" , 
               "Jan 29 2020 12:13:27" ,
               "Jan 29 2020 12:11:19" , 
               "Jan 29 2020 12:09:21" , 
               "Jan 28 2020 12:22:26" , 
               "Jan 27 2020 8:22:20"  , 
               "Jan 25 2020 14:34:22" , 
               "Jan 25 2020 14:31:13" ,
               "Jan 23 2020 12:16:16" ,
               "Jan 23 2020 12:13:30" ,
               "Jan 20 2020 12:12:59" ,
               "Jan 20 2020 12:05:30" ,
               "Jan 20 2020 12:01:54")

systol <- c(132  , 132  , 118  , 115  , 110 , 148 , 120 , 
            115 , 117 , 134 , 136 , 131 , 132 , 137)

df <- data.frame(date_time , systol) %>%
  mutate(dtetime = lubridate::mdy_hms(date_time)) %>%
  arrange(dtetime)

df
#>               date_time systol             dtetime
#> 1  Jan 20 2020 12:01:54    137 2020-01-20 12:01:54
#> 2  Jan 20 2020 12:05:30    132 2020-01-20 12:05:30
#> 3  Jan 20 2020 12:12:59    131 2020-01-20 12:12:59
#> 4  Jan 23 2020 12:13:30    136 2020-01-23 12:13:30
#> 5  Jan 23 2020 12:16:16    134 2020-01-23 12:16:16
#> 6  Jan 25 2020 14:31:13    117 2020-01-25 14:31:13
#> 7  Jan 25 2020 14:34:22    115 2020-01-25 14:34:22
#> 8   Jan 27 2020 8:22:20    120 2020-01-27 08:22:20
#> 9  Jan 28 2020 12:22:26    148 2020-01-28 12:22:26
#> 10 Jan 29 2020 12:09:21    110 2020-01-29 12:09:21
#> 11 Jan 29 2020 12:11:19    115 2020-01-29 12:11:19
#> 12 Jan 29 2020 12:13:27    118 2020-01-29 12:13:27
#> 13 Jan 29 2020 13:42:53    132 2020-01-29 13:42:53
#> 14 Jan 29 2020 13:46:08    132 2020-01-29 13:46:08

I would like to group systolic readings together that took place in a span of 10 minutes. This is the result I would want:
Group 1: rows 1, 2
Group 2: row 3
Group 3: rows 4,5
G4: 6, 7
G4: 8
G5: 9
G6: 10, 11, 12
G7: 13, 14

The ultimate object is to then average readings that were taken close to each other in time. The criterion for "close in time" is readings that are all taken within 10 minutes--i.e,. the last reading in any group has to be within 10 minutes of the first reading in that group. Groups can have any number of readings, as long as the group meets the 10-minutes-from-start-to-finish criterion.

I've tried to solve this with the lag function, but it is awkward and possibly inaccurate. I suspect this might be addressed by detecting time "clusters" but I'm not familiar with those methods. I asked this question on stack overflow, but after a couple of weeks I haven't received any answers that I felt addressed the issue.

I'm a fairly new R user most familiar with the tidyverse approach. I suspect this is not an unusual example since blood pressure readings taken close together in time are often averaged.

Thanks for any help!

technocrat · February 6, 2021, 1:59am

What rounding convention will be you be using: ceiling, floor or 5/4?

delisol · February 6, 2021, 2:52am

Floor is the rounding convention, on minutes.

technocrat · February 6, 2021, 5:50am

First part is flooring the times

 suppressPackageStartupMessages({
   library(lubridate)
 })
 
decimate <- function(x) floor_date(x,unit = "10 minutes")

a_time <- Sys.time()
a_time
#> [1] "2021-02-05 21:48:27 PST"
decimate(a_time)
#> [1] "2021-02-05 21:40:00 PST"

See the FAQ: How to do a minimal repoducible xample reprx for beginners for posting back with some representative data, and we can go forward to the grouping.

delisol · February 6, 2021, 5:09pm

I see from your first steps that the rounding question is misunderstood. I do not want to lose precision in the readings' times, so no rounding there. As to the example data, there was one potentially troublesome data combination that was not included in the original, and I expanded it slightly if that will help.

technocrat · February 6, 2021, 8:59pm

Thanks for the reprex. Rounding is only for grouping, so no data discard.

So, we're only doing grouping and no lumping, meaning we are just categorizing and not, say, taking min, mean, median, max.

Edge case: observations at 0,5,10,15: 0,5,10 and 5,10,15 overlap. I think the method I'm working on will assign 0 to group A and the rest to group B. If that's a problem, we'll need to go back and decide how to classify the odd man out.

delisol · February 6, 2021, 10:19pm

Thanks for the response. The ultimate goal is to calculate the mean systolic reading of a group. The time element is one way from first to last. There are 7, non-overlapping groups in the example. Each group starts at a time (x). Any readings taking within 10 minutes after x is included in that group. The next observation after the end of a group starts a new group. (For example, row 3 is within 10 minutes after an earlier observation (row 2) but because it is more than 10 minutes after row 1, it starts a new group, in this case, a group of 1 row. This happens rarely.) Although, in the example, the maximum number of observations in a group is 3 (group 6), in the actual dataset the 10-minute group may contain any number of readings. My apologies, I should not have assumed directionality was understood.

technocrat · February 6, 2021, 11:27pm

Got it, thanks. Here's a very ugly solution.

Je n'ai fait celle-ci plus longue que parce que je n'ai pas eu le loisir de la faire plus courte.

suppressPackageStartupMessages({
  library(dplyr)
  library(magrittr)
})

date_time <- c(
  "Jan 29 2020 13:46:08",
  "Jan 29 2020 13:42:53",
  "Jan 29 2020 12:13:27",
  "Jan 29 2020 12:11:19",
  "Jan 29 2020 12:09:21",
  "Jan 28 2020 12:22:26",
  "Jan 27 2020 8:22:20",
  "Jan 25 2020 14:34:22",
  "Jan 25 2020 14:31:13",
  "Jan 23 2020 12:16:16",
  "Jan 23 2020 12:13:30",
  "Jan 20 2020 12:12:59",
  "Jan 20 2020 12:05:30",
  "Jan 20 2020 12:01:54"
)

systol <- c(
  132, 132, 118, 115, 110, 148, 120,
  115, 117, 134, 136, 131, 132, 137
)

# df and data are also names of built-in functions
# Usually , other objects of the same names peacefully
# co-exist, but there are some operations in which
# the built-in takes precedence, so I avoid 

DF <- data.frame(date_time, systol) %>%
  mutate(dtetime = lubridate::mdy_hms(date_time)) %>%
  arrange(dtetime)

# creates difftime object; no provision for DST

DF %<>% mutate(hiatus = dtetime - lag(dtetime,1))

# create a plug value for first row's hiatus

DF[1,4] <- (DF[2,4] - DF[2,4])

# identifiers for grouping, extend as needed

grp <- c(LETTERS,c(paste(LETTERS,LETTERS,sep = "")))

# codes whether a new group is starting and assign a group number

DF %<>% mutate(newone = ifelse(hiatus <= 10,0,1))
DF %<>% mutate(thegrp = cumsum(newone))

# assign a group name

DF %<>% mutate(thegrp = case_when(
              thegrp == 0 ~ grp[1],
              thegrp == 1 ~ grp[2],
              thegrp == 2 ~ grp[3],
              thegrp == 3 ~ grp[4],
              thegrp == 4 ~ grp[5],
              thegrp == 5 ~ grp[6],
              thegrp == 6 ~ grp[7]
              ))
# cleanup

DF %<>% select(dtetime,systol,thegrp)

# group and summarize
DF %>% 
  group_by(thegrp) %>%
  summarize(meansys = mean(systol))
#> # A tibble: 7 x 2
#>   thegrp meansys
#> * <chr>    <dbl>
#> 1 A         133.
#> 2 B         135 
#> 3 C         116 
#> 4 D         120 
#> 5 E         148 
#> 6 F         114.
#> 7 G         132

The %<>% operator is from the {magritte} package—it's just like %>% only it overwrites the left-hand expression.

delisol · February 7, 2021, 9:33pm

@technocrat, thanks for your explanation! I am pulled away at present but will get back to this today, tomorrow latest.

delisol · February 8, 2021, 6:07am

This does not resolve all instances where dtetime is more than 10 minutes after the initial value of dtetime for the group. The statement:

DF %<>% mutate(hiatus = dtetime - lag(dtetime,1))

only looks back one row. It needs to look back to dtetime in the row that starts the group no matter how many rows back--in this instance, to row 1. dtetime in row 3 is >10 minutes past dtetime in row 1 so it should start a new group. It is grouped with rows 1 and 2 in your code because it is less than 10 minutes after row 2.
The program needs to
(1) establish the start dtetime of the first group in row 1,
(2) advance to the next row,
(3) evaluate whether dtetime in that row is more than 10 minutes after the start dtetime in (step 1).
If "no", go to next row and repeat (step 3). If "yes", establish that row's dtetime as the start of a new group and repeat steps 1-3.
Of course, it needs to incorporate something like your LETTERS approach, too, to uniquely identify the groups. I am trying to find a solution using loops, but I'm not good at them yet and I haven't found one that will work.

technocrat · February 12, 2021, 7:38am

After some misadventure trying to do this functionally, I checked §11.6 of Advanced R, 1st ed. on loops and took that approach. For such an eventually straightforward solution, it proves surprisingly difficult.

suppressPackageStartupMessages({
  library(dplyr)
  library(lubridate)
})

# FUNCTIONS

# assign group identifiers

# create a store of identifiers

make_groups <- function(x) c(x,c(paste0(x,x)))

# assign group identifiers to observations

match_groups <- function(x,y) x[cumsum(y)]

# CONSTANTS

# the store of identifiers, here 52, size to greater than expected number of groups

G <- make_groups(LETTERS)

# time window, in minutes, containing groups of observations

tranche = 10 

# DATA

date_time <- c(
  "Jan 29 2020 13:46:08",
  "Jan 29 2020 13:42:53",
  "Jan 29 2020 12:13:27",
  "Jan 29 2020 12:11:19",
  "Jan 29 2020 12:09:21",
  "Jan 28 2020 12:22:26",
  "Jan 27 2020 8:22:20",
  "Jan 25 2020 14:34:22",
  "Jan 25 2020 14:31:13",
  "Jan 23 2020 12:16:16",
  "Jan 23 2020 12:13:30",
  "Jan 20 2020 12:12:59",
  "Jan 20 2020 12:05:30",
  "Jan 20 2020 12:01:54"
)
systol <- c(
  132, 132, 118, 115, 110, 148, 120,
  115, 117, 134, 136, 131, 132, 137
)
# PREPROCESSING

DF <- data.frame(date_time, systol) %>%
  mutate(dtetime = mdy_hms(date_time)) %>%
  select(dtetime,systol) %>%
  arrange(-desc(dtetime))

# MAIN

# take datetime objects as a vector of differences

diff(DF[,1]) -> d

# insert first value (no difference) and convert to numeric

dn <- c(0,as.numeric(d))

# object to hold results

receiver <- vector("double", length = length(DF[,1]))

# initialize index position of dn object; it increments to the index each time a 10-minute window passes

flag = 1

# no return value; used for side effects in updating flag and
# receiver

# find indices of observations 

for(i in seq_along(dn)) {
  if(sum(flag:dn[i]) <= tranche) receiver[i] <- 0
  if(sum(flag:dn[i]) >  tranche) receiver[i] <- 1
  if(sum(flag:dn[1]) >  tranche) flag <- i
}

# first index is always the beginning of a new group
# simpler to do separately than to provide one logic for
# first and different for remainder

receiver[1] <- 1

# match groups to character identifier

group <- match_groups(G,receiver)

# add to DF as column

DF[,3] <- group

names(DF) <- c("dtetime","systol","group")

# inspect result

DF
#>                dtetime systol group
#> 1  2020-01-20 12:01:54    137     A
#> 2  2020-01-20 12:05:30    132     A
#> 3  2020-01-20 12:12:59    131     B
#> 4  2020-01-23 12:13:30    136     C
#> 5  2020-01-23 12:16:16    134     C
#> 6  2020-01-25 14:31:13    117     D
#> 7  2020-01-25 14:34:22    115     D
#> 8  2020-01-27 08:22:20    120     E
#> 9  2020-01-28 12:22:26    148     F
#> 10 2020-01-29 12:09:21    110     G
#> 11 2020-01-29 12:11:19    115     G
#> 12 2020-01-29 12:13:27    118     G
#> 13 2020-01-29 13:42:53    132     H
#> 14 2020-01-29 13:46:08    132     H

delisol · February 12, 2021, 11:01pm

@technocrat, the result is perfect in the example. When I substituted the actual dataset, limited to the first 30 rows, all readings ended up in the first group, A. I double-checked my substitutions but nothing came up that would have caused that. I don't follow with complete understanding a fair amount of your looping code but it looks like a systematic approach that is not dependent on the sample dataset. I'd like to try to troubleshoot this and I will get back to you as soon as I can. (Injured hand this morning and can only spend a limited time at the keyboard.) Thanks for following through and for your patience.

technocrat · February 13, 2021, 1:20am

Ouch! I was worried about missing edge cases. Post or DM the longer series?

system · April 24, 2024, 4:14pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.