Cumulative Sum with Conditions

Hi!

Having some trouble with cumulative sums, any guidance appreciated.
I have some customer counter data, and I need to understand due to COVID how many customers are in the shop at any given moment. After the day is complete, I need it to reset to 0.

Below is the data.

tibble::tribble(
  ~DateValue, ~TimeValue, ~CustomersIn, ~CustomersOut, ~DoorsOpenPlan, ~DoorsOpenActual,
  "1/1/2020",     "4:00",           57,            18,         "NULL",                1,
  "1/1/2020",     "4:15",          107,           115,         "NULL",                6,
  "1/1/2020",     "4:30",           99,           119,         "NULL",                8,
  "1/1/2020",     "4:45",           93,            78,         "NULL",                7,
  "1/2/2020",     "4:00",           82,            36,            "2",                2,
  "1/2/2020",     "4:15",          165,           146,            "5",                7,
  "1/2/2020",     "4:30",          165,           199,            "7",                9,
  "1/2/2020",     "4:45",          151,           150,            "8",                8,
  "1/3/2020",     "4:00",           66,            21,            "4",                1,
  "1/3/2020",     "4:15",          102,           119,            "6",                7,
  "1/3/2020",     "4:30",          178,           175,            "6",                9,
  "1/3/2020",     "4:45",          186,           184,            "7",                9,
  "1/3/2020",     "5:00",          224,           206,            "8",                9,
  "1/3/2020",     "5:15",          202,           213,            "9",               10,
  "1/3/2020",     "5:30",          256,           244,            "9",               11,
  "1/3/2020",     "5:45",          225,           233,            "9",               10,
  "1/3/2020",     "6:00",          213,           226,           "11",               11,
  "1/3/2020",     "6:15",          260,           278,           "11",               11,
  "1/3/2020",     "6:30",          320,           275,           "11",               12
  )

The first step, I need is a check. Are columns 3,4,5,6 = NULL and if they are it should return a "TRUE" else "FALSE".

Following this, if the check column is TRUE it should return NULL else do a cumulative sum of the difference between Customers IN and Customers OUT.
I would like it to reset to 0 when the the Time is less than the row above.

Desired Output:

tibble::tribble(
    ~DateValue, ~TimeValue, ~CustomersIn, ~CustomersOut, ~DoorsOpenPlan, ~DoorsOpenActual, ~Flag, ~CustomersinStore,
  "01/01/2020",    "04:00",           57,            18,         "NULL",                1,  TRUE,            "NULL",
  "01/01/2020",    "04:15",          107,           115,         "NULL",                6,  TRUE,            "NULL",
  "01/01/2020",    "04:30",           99,           119,         "NULL",                8,  TRUE,            "NULL",
  "01/01/2020",    "04:45",           93,            78,         "NULL",                7,  TRUE,            "NULL",
  "02/01/2020",    "04:00",           82,            36,            "2",                2, FALSE,               "0",
  "02/01/2020",    "04:15",          165,           146,            "5",                7, FALSE,              "46",
  "02/01/2020",    "04:30",          165,           199,            "7",                9, FALSE,              "65",
  "02/01/2020",    "04:45",          151,           150,            "8",                8, FALSE,              "31",
  "03/01/2020",    "04:00",           66,            21,            "4",                1, FALSE,               "0",
  "03/01/2020",    "04:15",          102,           119,            "6",                7, FALSE,              "45",
  "03/01/2020",    "04:30",          178,           175,            "6",                9, FALSE,              "28",
  "03/01/2020",    "04:45",          186,           184,            "7",                9, FALSE,              "31",
  "03/01/2020",    "05:00",          224,           206,            "8",                9, FALSE,              "33",
  "03/01/2020",    "05:15",          202,           213,            "9",               10, FALSE,              "51",
  "03/01/2020",    "05:30",          256,           244,            "9",               11, FALSE,              "40",
  "03/01/2020",    "05:45",          225,           233,            "9",               10, FALSE,              "52",
  "03/01/2020",    "06:00",          213,           226,           "11",               11, FALSE,              "44",
  "03/01/2020",    "06:15",          260,           278,           "11",               11, FALSE,              "31",
  "03/01/2020",    "06:30",          320,           275,           "11",               12, FALSE,              "13"
  )

Thanks all!

Hello,

Here is a way to to this with the Tidyverse

CODE

library(dplyr)

myData = tibble::tribble(
  ~DateValue, ~TimeValue, ~CustomersIn, ~CustomersOut, ~DoorsOpenPlan, ~DoorsOpenActual,
  "1/1/2020",     "4:00",           57,            18,         "NULL",                1,
  "1/1/2020",     "4:15",          107,           115,         "NULL",                6,
  "1/1/2020",     "4:30",           99,           119,         "NULL",                8,
  "1/1/2020",     "4:45",           93,            78,         "NULL",                7,
  "1/2/2020",     "4:00",           82,            36,            "2",                2,
  "1/2/2020",     "4:15",          165,           146,            "5",                7,
  "1/2/2020",     "4:30",          165,           199,            "7",                9,
  "1/2/2020",     "4:45",          151,           150,            "8",                8,
  "1/3/2020",     "4:00",           66,            21,            "4",                1,
  "1/3/2020",     "4:15",          102,           119,            "6",                7,
  "1/3/2020",     "4:30",          178,           175,            "6",                9,
  "1/3/2020",     "4:45",          186,           184,            "7",                9,
  "1/3/2020",     "5:00",          224,           206,            "8",                9,
  "1/3/2020",     "5:15",          202,           213,            "9",               10,
  "1/3/2020",     "5:30",          256,           244,            "9",               11,
  "1/3/2020",     "5:45",          225,           233,            "9",               10,
  "1/3/2020",     "6:00",          213,           226,           "11",               11,
  "1/3/2020",     "6:15",          260,           278,           "11",               11,
  "1/3/2020",     "6:30",          320,           275,           "11",               12
) 

#Convert the "NULL" into NA
myData = myData %>% mutate(across(c(CustomersIn, CustomersOut, DoorsOpenPlan), 
                                  function(x) ifelse(x == "NULL", NA, x)))

#Create the Flag column
myData = myData  %>% 
  mutate(Flag = is.na(CustomersIn) | is.na(CustomersOut) | is.na(DoorsOpenPlan)) 

#Create the CustomersInStore column
myData = myData %>% group_by(DateValue, Flag) %>% 
  mutate(CustomersInStore = ifelse(Flag, NA, cumsum(CustomersIn) - cumsum(CustomersOut)),
         CustomersInStore = ifelse(Flag, NA, c(0, CustomersInStore[-n()])))

RESULT

# A tibble: 19 x 8
# Groups:   DateValue, Flag [3]
   DateValue TimeValue CustomersIn CustomersOut DoorsOpenPlan DoorsOpenActual Flag  CustomersInStore
   <chr>     <chr>           <dbl>        <dbl> <chr>                   <dbl> <lgl>            <dbl>
 1 1/1/2020  4:00               57           18 NA                          1 TRUE                NA
 2 1/1/2020  4:15              107          115 NA                          6 TRUE                NA
 3 1/1/2020  4:30               99          119 NA                          8 TRUE                NA
 4 1/1/2020  4:45               93           78 NA                          7 TRUE                NA
 5 1/2/2020  4:00               82           36 2                           2 FALSE                0
 6 1/2/2020  4:15              165          146 5                           7 FALSE               46
 7 1/2/2020  4:30              165          199 7                           9 FALSE               65
 8 1/2/2020  4:45              151          150 8                           8 FALSE               31
 9 1/3/2020  4:00               66           21 4                           1 FALSE                0
10 1/3/2020  4:15              102          119 6                           7 FALSE               45
11 1/3/2020  4:30              178          175 6                           9 FALSE               28
12 1/3/2020  4:45              186          184 7                           9 FALSE               31
13 1/3/2020  5:00              224          206 8                           9 FALSE               33
14 1/3/2020  5:15              202          213 9                          10 FALSE               51
15 1/3/2020  5:30              256          244 9                          11 FALSE               40
16 1/3/2020  5:45              225          233 9                          10 FALSE               52
17 1/3/2020  6:00              213          226 11                         11 FALSE               44
18 1/3/2020  6:15              260          278 11                         11 FALSE               31
19 1/3/2020  6:30              320          275 11                         12 FALSE               13

EXPLANATION

In the first step I converted the "NULL" into NA because it's a bit easier to work with than a string of NULL.

In the second step, Flag is created by checking if the value of any of the 3 columns is NA

In the 3rd step, I assume that every day the time goes up and the data is sorted, so I group the data by day then get the cumsum() of each day using the built in function. Since you want the results to be offset with the CustomersInStore displaying the value of the previous day, I shift everything one place down and put a 0 at the start, loosing the last value of every day.

Hope this helps,
PJ

1 Like

Thanks PJ!

Works great, thanks for your comments, makes it easier to understand!

Sometimes our customer counters are a bit "funny" and we often see the CustomersOut more than what comes in...

How can I reset the cumulative count to 0 again when CustomersInStore < 0?

Hi,

Is the CustomersOut always supposed to be smaller or equal to CustomersIn at a given time, or can it be that there are customers leaving that came in at previous timepoints hence making the CustomersOut > CustomersIn, but the total still positive.

If that's the case, we have to attack this problem at the level of CustomersInStore, if not, we can simply make sure CustomersOut <= CustomersIn at any given time point.

PJ

Hi PJ,

I think I've solved it based on your guidance previously, so thank you. Here is the solution..:slight_smile:

Where column 4 = CustomerOut
column 6 = CustomerIn
Column 15 = CumulativeColumn.

for (i in 1:nrow(df)) {
  
  if (i == 1) {df[i,15] = 0
                CQ = df[i,6] - df[i,4]}
  
  else if (df[i,1] != df[i-1,1]) {df[i,15] = 0
  CQ =  df[i,6] - df[i,4]} 
  
  else {
    
    if (CQ + df[i,6] - df[i,4] < 0) {CQ = 0} else {CQ = CQ + df[i,6] - df[i,4]}
    
    df[i,15] = CQ
    
    }
}

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.