How to increase efficiency using group by and mutate

Hi, I am currently experiencing an exponential increase in time it takes to perform a command using tidyverse package.

Consider the following structure (simplified):

data <- data.frame(name = c("a","b","c","d","e","f"),
                  ID =c(1,1,1,2,2,2),
                  sales = c(100, 250, 300, 50, 600, 390),
                  t   = c(0.1,0.3,0.4,0.05,0.15,0.2),
                  n=c(1,2,3,1,2,3),
                  correct_result = c(-221.4,-27.8,69.1,-143.71,-19.11,43.19))

data$ID <- as.integer(data$ID)

I found that it is more efficient to group by ID as integer, rather than factor.

The formula I am trying to calculate implies that for a given name, say, "a", I want to take the sum of sales of all other related names (by their ID) and divide by 1-t for the respective names.
To get a sense of what I am trying to compute for each ID & and name:

(data$sales[2]/(1-data$t[2]))*(data$t[1]-data$t[2]) + (data$sales[3]/(1-data$t[3]))*(data$t[1]-data$t[3])
(data$sales[1]/(1-data$t[1]))*(data$t[2]-data$t[1]) + (data$sales[3]/(1-data$t[3]))*(data$t[2]-data$t[3])
(data$sales[1]/(1-data$t[1]))*(data$t[3]-data$t[1]) + (data$sales[1]/(1-data$t[1]))*(data$t[3]-data$t[1])
library(tidyverse)

# The Model:
data <- data %>%
  mutate(ovt=sales/(1-t))

sumforgoup1 <-function(forname , groupid){   # Create the function: 
  
  key_t <- dplyr::filter(data,
                         ID == groupid,
                         name==forname) %>% pull(t)
  
  temp <- dplyr::filter(data,
                        ID == groupid,
                        name!=forname) %>% mutate(diff_key_t=
                                                    key_t - t)
  
  sum(temp$ovt*temp$diff_key_t)
}

mutate(rowwise(data),
       result = sumforgoup1(name,ID))          # Store result in a new column.

So, the function works fine in this dataset. However, when i apply this function over a larger dataset with, say, 300 rows, the formula takes approximately 6 seconds. Increasing the number of rows with 300 more (i.e., 600 rows) it takes around 35 seconds..
I have around 30.000 rows, so this would take hours..

In the full dataset i converted ID to factor so you can get a sense of the levels (sub here = name):

 $ ID   : Factor w/ 9097 levels "1","2","3","4",..: 1 2 2 3 4 5 5 5 5 5 ...
 $ sub  : Factor w/ 40 levels "1","2","3","4",..: 1 1 2 1 1 1 2 3 4 5 ...

Any recommendations/tips is appreciated,
Thanks!

Your code looks ok but I'm trying to see if the function and general approach can be simplified to increase efficiency. It's odd that you found a factor variable slower to work with than a character or integer vector - factors should be quicker in general, I believe.

One quick question, because I'm concerned about your key_t pull in the function, though I haven't tested it yet: do you know that the result of the filter there (the combination of ID and name) is always going to produce a single row? Because if it doesn't, then pull will produce a vector, not a single value. Is that what you want?

I at least think so.. As far as I'm concerned, the tests i have performed has always produces one single row in the result column. So, the idea with key_t is to take out "t" for the specific row I'm looking at, and use this further to get the differential between all other "t" with same ID.

1 Like

Here's a totally useless response, because it doesn't even work!

But my reason for posting this is to try to get a clearer (for me) sense of the logic of what you're trying to do. I found your description of your desired process a little confusing. But anyway, here's some code that tries to achieve your intention that

for a given name, say, "a", I want to take the sum of sales of all other related names (by their ID) and divide by 1-t for the respective names.

library(dplyr)

df <- tibble(
  name = letters[1:6],
  id = rep(1:2, each = 3),
  sales = c(100, 250, 300, 50, 600, 390),
  t = c(0.1, 0.3, 0.4, 0.05, 0.15, 0.2),
  n = rep(1:3, times = 2)
)

add_result <- function(df, name = name) {
  id <- df %>%
    dplyr::filter(name == name) %>%
    pull(id)

  filt <- df %>%
    dplyr::filter(id == id)

  t <- filt %>%
    dplyr::filter(name == name) %>%
    dplyr::pull(t) %>%
    head(1) # or mean() ... ensure that t is a single value

  sales_sum <- filt %>%
    dplyr::filter(!name == name) %>%
    dplyr::pull(sales) %>%
    sum()

  sales_sum / (1 - t)
}

df %>%
  dplyr::mutate(result = add_result(df = df))
#> # A tibble: 6 x 6
#>   name     id sales     t     n result
#>   <chr> <int> <dbl> <dbl> <int>  <dbl>
#> 1 a         1   100  0.1      1      0
#> 2 b         1   250  0.3      2      0
#> 3 c         1   300  0.4      3      0
#> 4 d         2    50  0.05     1      0
#> 5 e         2   600  0.15     2      0
#> 6 f         2   390  0.2      3      0

Created on 2020-09-24 by the reprex package (v0.3.0)

OK that didn't work, but I'm not sure why. Now I looked more carefully at your intentions, and tweaked the function a bit:

library(dplyr)

df <- tibble(
  name = letters[1:6],
  id = rep(1:2, each = 3),
  sales = c(100, 250, 300, 50, 600, 390),
  t = c(0.1, 0.3, 0.4, 0.05, 0.15, 0.2),
  n = rep(1:3, times = 2)
)


add_result <- function(df, name = name) {
  id <- df %>%
    dplyr::filter(name == name) %>%
    dplyr::pull(id)

  filt <- df %>%
    dplyr::filter(id == id)

  key_t <- filt %>%
    dplyr::filter(name == name) %>%
    dplyr::pull(t) %>%
    head(1) # or mean() ... ensure that t is a single value

  filt %>%
    dplyr::filter(!name == name) %>%
    dplyr::mutate(result = (sales/(1-t)) * (key_t-t)) %>% 
    dplyr::pull(result) %>% 
    sum()
}

df %>%
  dplyr::mutate(result = add_result(df = df))
#> # A tibble: 6 x 6
#>   name     id sales     t     n result
#>   <chr> <int> <dbl> <dbl> <int>  <dbl>
#> 1 a         1   100  0.1      1      0
#> 2 b         1   250  0.3      2      0
#> 3 c         1   300  0.4      3      0
#> 4 d         2    50  0.05     1      0
#> 5 e         2   600  0.15     2      0
#> 6 f         2   390  0.2      3      0

Created on 2020-09-24 by the reprex package (v0.3.0)

That still doesn't work. So I tried something different:

library(dplyr)

df <- tibble(
  name = letters[1:6],
  id = rep(1:2, each = 3),
  sales = c(100, 250, 300, 50, 600, 390),
  t = c(0.1, 0.3, 0.4, 0.05, 0.15, 0.2),
  n = rep(1:3, times = 2)
)

df %>% 
  dplyr::filter(id == id[1]) %>% 
  dplyr::filter(!name == name[1]) %>% 
  dplyr::mutate(result = (sales/(1-t)) * (df$t[1]-t)) %>% 
  dplyr::pull(result) %>% 
  sum()
#> [1] -221.4286

Created on 2020-09-24 by the reprex package (v0.3.0)

This approach gives you the result you're expecting, for row 1 only, but if I then try to generalise it:

library(dplyr)

df <- tibble(
  name = letters[1:6],
  id = rep(1:2, each = 3),
  sales = c(100, 250, 300, 50, 600, 390),
  t = c(0.1, 0.3, 0.4, 0.05, 0.15, 0.2),
  n = rep(1:3, times = 2)
)

calc_result <- function(df, n) {
  df %>% 
    dplyr::filter(id == id[n]) %>% 
    dplyr::filter(!name == name[n]) %>% 
    dplyr::mutate(result = (sales/(1-t)) * (df$t[n]-t)) %>% 
    dplyr::pull(result) %>% 
    sum()
}

purrr::map_dbl(1:nrow(df), ~ calc_result(df = df, .)) %>% 
  dplyr::bind_cols(df, result = .)
#> # A tibble: 6 x 6
#>   name     id sales     t     n result
#>   <chr> <int> <dbl> <dbl> <int>  <dbl>
#> 1 a         1   100  0.1      1      0
#> 2 b         1   250  0.3      2      0
#> 3 c         1   300  0.4      3      0
#> 4 d         2    50  0.05     1      0
#> 5 e         2   600  0.15     2      0
#> 6 f         2   390  0.2      3      0

Created on 2020-09-24 by the reprex package (v0.3.0)

we still get the zeroes in the result column. I'm a bit perplexed.

Anyway, this is all not much use to you at all in investigating the efficiency issue, but perhaps here might be something of value in the difficulties I've been experiencing in getting this "simpler" (haha) approach to work.

1 Like

This is reminiscent of the problem posted in this thread. As @nirgrahamuk stated in his post, the tidyverse functions aren't ideal for such algorithm-based calculations.

You may have to write a C++ function using Rcpp as he did to gain some real performance benefits. I've tagged him in this post to see if he can offer any additional advice.

2 Likes

I appreciate the effort! And, yes it is puzzling that your last generalization only returns zero's...

1 Like

I'm very interested to understand more. I don't know anything about programming in C :slight_smile:

Still think it's odd that turning the pipe that works into function like this, doesn't work:

library(dplyr)

df <- tibble(
  name = letters[1:6],
  id = rep(1:2, each = 3),
  sales = c(100, 250, 300, 50, 600, 390),
  t = c(0.1, 0.3, 0.4, 0.05, 0.15, 0.2),
  n = rep(1:3, times = 2)
)


df %>% 
  dplyr::filter(id == id[1]) %>% 
  dplyr::filter(!name == name[1]) %>% 
  dplyr::mutate(result = (sales/(1-t)) * (df[["t"]][1]-t)) %>% 
  dplyr::pull(result) %>% 
  sum()
#> [1] -221.4286


calc_result <- function(df, n) {
  df %>% 
    dplyr::filter(id == id[n]) %>% 
    dplyr::filter(!name == name[n]) %>% 
    dplyr::mutate(result = (sales/(1-t)) * (df[["t"]][n]-t)) %>% 
    dplyr::pull(result) %>% 
    sum()
}

df %>% 
  calc_result(n = 1)
#> [1] 0

Created on 2020-09-24 by the reprex package (v0.3.0)
It feels like I'm missing something obvious here like a typo.Or something else embarrassingly obvious.

I wanted to take the time to post a solution that worked for me:

library(tidyverse)

data <- tibble(
  name = letters[1:6],
  id = rep(1:2, each = 3),
  sales = c(100, 250, 300, 50, 600, 390),
  t = c(0.1, 0.3, 0.4, 0.05, 0.15, 0.2),
  n = rep(1:3, times = 2),
  correct_result = c(-221.4,-27.8,69.1,-143.71,-19.11,43.19))

library(dplyr)
library(purrr)

data %>% 
  group_by(id) %>% 
  mutate(result = map_dbl(seq_along(id), ~ sum((sales[-.x] / (1 - t[-.x]) * (t[.x] - t[-.x])))))

This code, applied on my original dataset of 200000 rows took maybe 2 seconds, compared to hours with the first code above.

3 Likes

That looks very elegant :clap:t3:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.