How to iterate over rows of data, find duration, and group, based on certain conditions

tanishahudson1 · February 27, 2020, 9:35pm

Here is my Objective

I have a dataset that tracks when someone was editing a message, left, and came back to resume editing that same message. I am trying to group this one message together and find its total duration.

I need to first group the message together if the following conditions apply:
if the Folder == 'out' OR drafts, the Message == "", Edit == "T", and if the contents in the
Subject and Re columns are consecutively the same.

Ideally, this would give me group A along with its duration. For example, this first 'block' of data would be labeled 'Group A' , would have a start time of 1/2/2020 1:00:01 AM and an end time of 1/2/2020 1:00:30 AM

Furthermore, I would like to 'match' group A with another 'block' of data if the last row of the Subject, Re and Length column matches its first row. So the second Group A would have a start time of 1/2/2020 1:02:00 AM and an end time of 1/2/2020 1:02:05 AM.

Subject Re                    Length         Folder      Message   Date                   Edit     
        a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:01 AM     T                               
        a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:05 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:10 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:15 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:30 AM     T 



hey     a@mail.com,b@mail.com 80            draft                  1/2/2020 1:02:00 AM     T                        
hey     a@mail.com,b@mail.com 80            draft                  1/2/2020 1:02:05 AM     T                        




hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:03:10 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:03:20 AM     T

Desired Output

 Start                  End                        Duration          Group
 1/2/2020 1:00:01 AM    1/2/2020 1:00:30 AM        20                A
 1/2/2020 1:02:00 AM    1/2/2020 1:02:05 AM        5                 A
 1/2/2020 1:03:10 AM    1/2/2020 1:03:20 AM        10                A

Here is the structure of my data:

 structure(list(Subject = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 
 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("", "hey"), class = "factor"), 
 Recipient = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 
 1L, 1L, 2L, 2L), .Label = c("", "sarah@mail.com,gee@mail.com"
 ), class = "factor"), Length = c(80L, 80L, 80L, 80L, 80L, 
 NA, NA, 80L, 80L, NA, NA, 100L, 100L), Folder = structure(c(3L, 
 3L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("", 
 "draft", "out"), class = "factor"), Message = c(NA, NA, NA, 
 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Date = structure(c(2L, 
 3L, 4L, 5L, 6L, 1L, 1L, 7L, 8L, 1L, 1L, 9L, 10L), .Label = c("", 
 "1/2/2020 1:00:01 AM", "1/2/2020 1:00:05 AM", "1/2/2020 1:00:10 AM", 
 "1/2/2020 1:00:15 AM", "1/2/2020 1:00:30 AM", "1/2/2020 1:02:00 AM", 
 "1/2/2020 1:02:05 AM", "1/2/2020 1:03:00 AM", "1/2/2020 1:03:20 AM"
 ), class = "factor"), Edit = c(TRUE, TRUE, TRUE, TRUE, TRUE, 
 NA, NA, TRUE, TRUE, NA, NA, TRUE, TRUE)), class = "data.frame", row.names = c(NA, 
-13L))

I am thinking I may have to do a loop or some sort of filter. Any help/suggestion or advice would be greatly appreciated.

  library(tidyverse)
  library(lubridate)



  df1<-df2 %>%
  mutate(Date = lubridate::mdy_hms(Date), 
  cond = Edit == "T" & ItemFolderName == "out" | Folder == "drafts" &     Message == "" & Subject ==  ?   & Re ==   ?     & Length == ?   , 
     grp = cumsum(!cond)) %>%
  filter(cond) %>%
  group_by(grp) %>%

nirgrahamuk · February 27, 2020, 10:23pm

I took your structure, named it as df for myself, and was able to write code to produce df2 with the following contents:

> df2
# A tibble: 3 x 11
  GROUP BREAK_DETECTOR start               end                 duration min_subject max_subject min_recipient max_recipient
  <chr> <chr>          <dttm>              <dttm>              <drtn>   <chr>       <chr>       <chr>         <chr>        
1 a     1              2020-02-01 01:00:01 2020-02-01 01:00:30 29 secs  ""          hey         ""            sarah@mail.c~
2 a     2              2020-02-01 01:02:00 2020-02-01 01:02:05  5 secs  "hey"       hey         "sarah@mail.~ sarah@mail.c~
3 a     3              2020-02-01 01:03:00 2020-02-01 01:03:20 20 secs  "hey"       hey         "sarah@mail.~ sarah@mail.c~
# ... with 2 more variables: min_length <int>, max_length <int>

However, I think there would need to be thought given to the rules for distinguishing groups, and indeed, the fact that your data structure example provided doesnt cover more than 1 group, its going to be hard to code for that, or even think through whether there is a good rule to catch the move from one group to another. I'll leave that for you to think about. I labelled in my code comments, where I think you could add a condition to test to advance to the next group.

here is the full code:

library(tidyverse)
library(lubridate)


# dates should be dates not factors
# are they day month year dmy or month day year mdy ??? first of feb, or 2nd of jan ?
df$Date <- lubridate::dmy_hms(df$Date)

# actually i think factors in general are awkward for this data
df <- mutate_if(df, is.factor, as.character)


df$GROUP <- ""
df$BREAK_DETECTOR <- ""
group_count <- 0
break_count <- 0
for (i in 1:nrow(df)) {

  ## this stuff is for grouping, you need to determine a rule to advance the group count, what is such a rule ?
  ## the example data contains no other groups so it wont be possible to test an implementation until the data is extended
  if (i == 1) {
    group_count <- group_count + 1
    df$GROUP[[i]] <- letters[[group_count]]
  }
  if (i > 1) {
    if (df$GROUP[[i - 1]] != "") {
      df$GROUP[[i]] <- df$GROUP[[i - 1]]
    } else {
      group_count <- group_count + 1
      df$GROUP[[i]] <- letters[[group_count]]
    }
  }
  if (i == 1) {
    break_count <- break_count + 1
    df$BREAK_DETECTOR[[i]] <- break_count
  } else { #rules for detecting breaks - I chose to make it depend on NA values in the Length field
    if (is.na(df$Length[[i]])) {
      if (!is.na(df$Length[[i - 1]])) { # and only if the previous line isnt also NA for Length
        break_count <- break_count + 1
      }
    }
    df$BREAK_DETECTOR[[i]] <- break_count
  }
}


df2 <- df %>%
  filter(!is.na(Length)) %>%
  group_by(
    GROUP, BREAK_DETECTOR
  ) %>%
  summarise(
    start = min(Date),
    end = max(Date),
    duration = difftime(end, start, units = "secs"),
    min_subject = min(Subject),
    max_subject = max(Subject),
    min_recipient = min(Recipient),
    max_recipient = max(Recipient),
    min_length = min(Length),
    max_length = max(Length)
  ) %>%
  ungroup()

tanishahudson1 · February 28, 2020, 12:08am

Thank you so much
I will try this
You are greatly appreciated!!

tanishahudson1 · February 28, 2020, 3:04am

Hello! your code is extremely helpful, so in my original full dataset, there are no blank rows. It looks more like this:

But I wish to determine a 'group' by the following cond: if the Folder is "out" or "draft", if the Edit is T, if the Message is ''", and if the Length is identical consecutive values. I suppose I can add this to the rule?I am just not sure how to df$GROUP <- Folder == 'out'|'draft', Edit == 'T', Message == "", Length == RLE$Length ???

nirgrahamuk · February 28, 2020, 3:09am

Hello, I'm happy to help.
I can probably try to code more tomorrow, but just for now, I'll ask you.
Your example data shows all rows with Edit=T , so is using that as a condition, going to add anything in discriminating one row from the next? Similarly , all the data you shared has blank Message, so it seems irrelevant also. Also, in your second example where you shared the dput() structure, your length changes from 80 to 100, so is stability in length really a condition ? Finally, if Folder is always either out, or draft, then again, what good does testing that it be out or draft do for you ?

I'll check in with you later. best wishes.

tanishahudson1 · February 28, 2020, 3:21am

Ok yes it would be greatly appreciated if you can help tomorrow- I am stuck on this, and I feel your code is working in finding a solution. I will add, this is what the true data looks like:

So I am basically trying to group if these conditions apply:
I would like to keep the Subject row, even if it is blank (so the first starttime would be at: 1/2/2020 1:00:01AM)
The Folder must be 'out' or 'drafts'
The Message must be ""
The Edit must be T, when these conditions are all met, this will be a Group, and if these same conditions are met and the last row of the subject, recipients and length, match the next groups FIRST row of the subject, recipients and length, then these are in a group together

Thank you for your time, Ill continue to research

nirgrahamuk · February 28, 2020, 6:49pm

Hello,
I'm afraid I have more questions for you than answers...
My suspicion is that you have identified necessary, but not sufficient conditions for identifying an email 'group'. By that I mean that by applying the filter, you can elimate lots of content that you dont care about, however, all the content that remains, should probably be many different groups, and we are missing a rule for telling one group apart from another, unless we can rely on some heuristics which we know based on context of what the data represents or how it was extracted. My experience with emails generally, tells me that emails get drafted, parked, continued later, and this can cause an interleaving pattern, where the same email might be on two sides of another email that has nothing to do with it. Without knowing there are some rules on your email data (perhaps its a simple email system that wont allow more than 1 draft at a time) it will be a faulty assumption...

Do you have any context about the background to this work you are doing? Could you perhaps get a data extract with more information included in it. For example it is typical for an email system to generate unique ID's so that it can maintain coherent history over the activity of a given mail, if you could have your data include this, that would serve as an effective grouping variable, and the other summarising calculations of edit durations and the like, would become tractable, maybe even trivial.
Something to think about ?

tanishahudson1 · February 28, 2020, 7:22pm

Thank you for your response. The background on this:

I am trying to figure out the compose times for each unique email, so we have a tracking system for this, which tracks from the moment the email is open, to the moment it is sent. Unfortunately there are no messageIDs generated while the email is being edited, so I do not have this use. The scenario is this: someone is editing an email for 10 seconds, then walks away, then resumes that same email. I am trying to capture the total duration for that single email. So all the emails with a Length of 80 is one single email. I am trying to see how to group them. I came up with this, with some help:

  df1<-df %>% 

  mutate_if(is.factor, as.character) %>% 

 mutate_at(c( "Recipient"), ~if_else(is.na(.), "", stringr::str_trim(.))) %>%

 mutate(Date = as.POSIXct(Date, format = '%m/%d/%Y %H:%M:%OS')) %>%
 mutate(cond = Edit & Folder %in% c('out', 'draft') & Message == '') %>% 
 mutate(segment = cumsum(!cond)) %>%
 filter(cond) %>%  


 group_by( Recipient, Length, segment) %>%
 summarize(Start = min(Date),
        End = max(Date),
        Duration = End - Start) %>%


mutate(new_group = 
       (Recipient != lag(Recipient, 1, "")) *
       (Length    != lag(Length, 1, ""))) %>%
ungroup() %>%
mutate(group = LETTERS[cumsum(new_group)]),

which gives me this:

, but I still need to modify the code so that email with the letter k, is in the A group and not a separate group. The row with 'k' actually belongs to group A. I am struggling with finding out how to do that

nirgrahamuk · February 28, 2020, 7:29pm

Is the tracking system a commercial solution ? in which case I would file an issue
is it an inhouse solution, is there maybe a developer you could speak to about that. It seems so unlikely,.. I'm sorry for being super-sceptical, its just how I was raised

So can you explain to me, what exactly is 'length' in this context ? is it measuring some fact about the email ?

What did you think of the issue I raised which is that emails can be worked on out of order? how do you conceptualise that ?

Unfortunately I simply don't think the data you have is capable of being used for the purpose you want. As a human, I can't tell that the k line, should be considered the same email as the following 3 to sarah@... I can only take it on faith, that you say it is so, but if someone else posted here, they could say the opposite, and would there be a reasonable argument that could be had to settle the issue one way or another ?

I simply can't conceive of a system that is designed to track emails that... doesn't have id's to track them with. I can only apologise if this is frustrating for you to read. Is there anyone you can go back to on any of this ? in terms of what they are giving you ?

tanishahudson1 · February 29, 2020, 1:34am

The ‘k’ line is the same email because while typing an email, one can change the subject midway, or even add or remove recipients.
I am just trying to incorporate this logic in your code, or the code I have up here.
I like the idea of groups or breaks in your code. Is there a way for me to add the conditions of
When edit is T , message is ‘’, folder is out or drafts, and there being a consecutive identical body length ??

I need to play around w it-but I am unsure
Please advise

tanishahudson1 · February 29, 2020, 1:45am

I am still learning R and am trying to add conditions to your lines

df$GROUP <- ""
df$BREAK_DETECTOR <- ""

I was thinking I can add

df$GROUP <- "Edit == “T”, Message == “”, Folder == “out"| “draft”

And perhaps the Break Detector occurs whenever these conditions are not met . I am just not sure how to write this and am researching this.

df$BREAK_DETECTOR <- ""

Something like this?

nirgrahamuk · February 29, 2020, 1:48am

I feel like we're starting to go around in circles a little bit. Put simply, you don't have a systematic way of telling an email apart from another, especially if emails can be inbetween other emails. Anyways, I might be wrong, this is just my understanding. Feel free to continue, but I'm going to bow out of the conversation.

tanishahudson1 · February 29, 2020, 1:48am

Ok thank you for your input

Kill3rbee · March 4, 2020, 12:24pm

@tanishahudson1, I tried using data.tables which is my favourite got to. See if this helps with your issue.

library(lubridate)
library(data.table)
testdf <- setDT(structure(
  list(
    Subject = structure(
      c(1L, 1L, 2L, 2L, 2L, 1L,
        1L, 2L, 2L, 1L, 1L, 2L, 2L),
      .Label = c("", "hey"),
      class = "factor"
    ),
    Recipient = structure(
      c(1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L,
        1L, 1L, 2L, 2L),
      .Label = c("", "sarah@mail.com,gee@mail.com"),
      class = "factor"
    ),
    Length = c(80L, 80L, 80L, 80L, 80L,
               NA, NA, 80L, 80L, NA, NA, 100L, 100L),
    Folder = structure(
      c(3L,
        3L, 3L, 3L, 3L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L),
      .Label = c("",
                 "draft", "out"),
      class = "factor"
    ),
    Message = c(NA, NA, NA,
                NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
    Date = structure(
      c(2L,
        3L, 4L, 5L, 6L, 1L, 1L, 7L, 8L, 1L, 1L, 9L, 10L),
      .Label = c(
        "",
        "1/2/2020 1:00:01 AM",
        "1/2/2020 1:00:05 AM",
        "1/2/2020 1:00:10 AM",
        "1/2/2020 1:00:15 AM",
        "1/2/2020 1:00:30 AM",
        "1/2/2020 1:02:00 AM",
        "1/2/2020 1:02:05 AM",
        "1/2/2020 1:03:00 AM",
        "1/2/2020 1:03:20 AM"
      ),
      class = "factor"
    ),
    Edit = c(TRUE, TRUE, TRUE, TRUE, TRUE,
             NA, NA, TRUE, TRUE, NA, NA, TRUE, TRUE)
  ),
  class = "data.frame",
  row.names = c(NA,-13L)
))


testdf[((Folder =="out") | (Folder =="draft")) & is.na(Message) & Edit =="TRUE", Group:= "A" ]

mintest <-testdf[!is.na(Length), .SD[unique(c(1,.N))], by=c("Length","Folder")]

odd<-mintest[seq(1, nrow(mintest), by=2)]
setnames(odd, "Date", "Start")
odd[, End:=mintest[seq(2, nrow(mintest), by=2)]$Date]
odd[, Duration:=difftime(mdy_hms(odd$End), mdy_hms(odd$Start), units = "secs")]

Desired_Output <- odd[, c(6,9:10,8)]

Desired_Output

Let me know what you think

tanishahudson1 · March 4, 2020, 1:10pm

I will try, thank you!

system · March 25, 2020, 1:10pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.