Multiplication by levels

Lucas11 · May 18, 2020, 5:39pm

Hi, I am stuck with this problem

I want to multiply my data but each proportion of each observation for a different percentage.
As example: if the first observation of my Var_1 has 5000 value.
I want to multiply the proportion between 100 and 1000 by 2% (in this particular case 900 x 2%).
The proportion between 1000 and 2000 by 3% (in this case 1000 x 3%).
And the proportion >2000 by 5% (in this case 3000 x 5%).

And add the sum of this process for each observation in a new variable.
Any idea of how to proceed?

I have these data like example:

library(tidyverse)
my_data <- tibble(Var_1 = c(5000, 1500, 350, 1200, 750, 1000,1250, 2500))

siddharthprabhu · May 18, 2020, 7:15pm

We can write a function using basic if...else flow control statements. The function can then be applied to each element of Var_1.

Note: You haven't specified what should happen to numbers smaller than 100. This function just returns 0 for such values.

library(tidyverse)

my_data <- tibble(Var_1 = c(5000, 1500, 350, 1200, 750, 1000, 1250, 2500))

foo <- function(x) {
  if (x <= 100) {
    return (0)
  } else {
    if (x <= 1000) {
      return ((x - 100) * 0.02)
    } else if (x <= 2000) {
      return ((900 * 0.02) + ((x - 1000) * 0.03))
    } else {
      return ((900 * 0.02) + (1000 * 0.03) + ((x - 2000) * 0.05))
    }
  }
}

my_data$Var_2 <- map_dbl(my_data$Var_1, foo)

print(my_data)
#> # A tibble: 8 x 2
#>   Var_1 Var_2
#>   <dbl> <dbl>
#> 1  5000 198  
#> 2  1500  33  
#> 3   350   5  
#> 4  1200  24  
#> 5   750  13  
#> 6  1000  18  
#> 7  1250  25.5
#> 8  2500  73

^{Created on 2020-05-19 by the reprex package (v0.3.0)}

thelatemail · May 19, 2020, 6:03am

If you pre-calculate all the intervals and then only calculate the remainder in each interval for each value, I think you can do this safely in a vectorised fashion without requiring branching. This will then also be extendable to adding as many cutoffs and rates as required without having to adjust the body of the function.

propmult <- function(x, rate, coff) {
    spot <- findInterval(x, coff)
    base <- cumsum(rate * c(diff(coff), 0))[spot-1]
    base + ((x - coff[spot]) * rate[spot])
}

Comparing results with other answer:

propmult(my_data$Var_1, rate = c(0, 0.02, 0.03, 0.05), coff = c(0, 100, 1000, 2000))
##[1] 198.0  33.0   5.0  24.0  13.0  18.0  25.5  73.0
map_dbl(my_data$Var_1, foo)
##[1] 198.0  33.0   5.0  24.0  13.0  18.0  25.5  73.0

As a side benefit, it should also be quick to run on huge data if that is a concern:

datbig <- my_data[rep(1:8,1e6),]
nrow(datbig)
# 8 million rows - 8,000,000 - to test it out
system.time({ map_dbl(datbig$Var_1, foo) })
##   user  system elapsed 
##  5.049   0.092   5.140 

system.time({propmult(datbig$Var_1, rate = c(0, 0.02, 0.03, 0.05), coff = c(0, 100, 1000, 2000))})
##   user  system elapsed 
##  0.138   0.000   0.138

system · May 26, 2020, 6:03am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.