sum() returning a vector

dg.aragones · September 18, 2019, 5:50pm

I am doing some calculations using R and I was wondering if there's a way to perform a sum() with multiple criteria in a vectorized way.

As a simple example:

a=c(-2,-1,1,2,3);
sum(a<0) #Returns 2
sum(a<2) #Returns 3
sum(a<3) #Returns 4

Can this operation be done in a simple way so that it returns a list (2,3,4), defining another vector b=c(0,2,3)?

Thank you in advance for any comments!

cderv · September 18, 2019, 5:56pm

You could be interested in purrr and functional programming

library(purrr)
a <- c(-2,-1,1,2,3)
map_int(c(0, 2, 3), ~ sum(a < .x))
#> [1] 2 3 4

You can also use base R but purrr has a lot of advantage

a <- c(-2,-1,1,2,3)
lapply(c(0, 2, 3), function(x) sum(a < x))
#> [[1]]
#> [1] 2
#> 
#> [[2]]
#> [1] 3
#> 
#> [[3]]
#> [1] 4
sapply(c(0, 2, 3), function(x) sum(a < x))
#> [1] 2 3 4

^{Created on 2019-09-18 by the reprex package (v0.3.0)}

nwerth · September 18, 2019, 7:33pm

The functional programming solution is definitely elegant and easy to understand. If I were writing a program, it's what I'd go with.

But, just for fun, here's another solution. You can use your vector of break points to divide the input into bins and count the number of values in each bin. If a value is below the first break point, then it's below the second, and so on. So the number below each break is just a cumulative sum.

I also threw in a couple calls to order, because findInterval demands a sorted vector of break points, but that shouldn't bind our hands.

under_breaks <- function(x, breaks) {
  break_order <- order(breaks)
  intervals <- 1 + findInterval(x, breaks[break_order])
  interval_counts <- tabulate(intervals)
  cumsum(interval_counts)[order(break_order)]
}

a <- c(-2,-1,1,2,3)
under_breaks(a, c(0, 2, 3))
# [1] 2 3 4
under_breaks(a, c(3, 0, 2))
# [1] 4 2 3

This way avoids a lot of repetitive checking, so it doesn't slow down as much with more breaks.

library(microbenchmark)
library(purrr)

func_purrr <- function(x, breaks) {
  map_int(breaks, ~ sum(x < .x))
}

func_vapply <- function(x, breaks) {
  vapply(breaks, function(b) sum(x < b), integer(1))
}

set.seed(007)
big_a <- rpois(1000, 20)
big_breaks <- rpois(100, 20)
microbenchmark(
  interval = under_breaks(big_a, big_breaks),
  purrr = func_purrr(big_a, big_breaks),
  vapply = func_vapply(big_a, big_breaks)
)
# Unit: microseconds
#      expr   min     lq    mean median     uq    max neval
#  interval  64.0  73.95  99.303  81.05 104.35  539.6   100
#     purrr 301.6 318.10 456.909 343.45 429.75 6590.5   100
#    vapply 236.1 239.95 307.319 248.05 355.65 2274.1   100

Of course, optimization's only worth the sacrifice if the task was actually taking too long.

system · September 25, 2019, 7:36pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.