Parallelized sampling using exponential variates
How can the seemingly iterative process of weighted sampling without replacement be transformed into something highly parallelizable? Turns out a well-known technique based on exponential variates accomplishes exactly that.
Yitao Li - July 28, 2020
As part of our recent work to support weighted sampling of Spark data frames in
sparklyr , we embarked on a journey searching for algorithms that can perform weighted sampling, especially sampling without replacement, in efficient and scalable ways within a distributed cluster-computing framework, such as Apache Spark.
In the interest of brevity, “weighted sampling without replacement” shall be shortened into SWoR for the remainder of this blog post.
In the following sections, we will explain and illustrate what SWoR means probability-wise, briefly outline some alternative solutions we have considered but were not completely satisfied with, and then deep-dive into exponential variates, a simple mathematical construct that made the ideal solution for this problem possible.
If you cannot wait to jump into action, there is also a section in which we showcase example usages of
sparklyr . In addition, you can examine the implementation detail of
sparklyr::sdf_weighted_sample() in this pull request.