Based on the adaptive counting approach of: Fast and Accurate Traffic Matrix Measurement Using Adaptive Cardinality Counting
by: Cai, Pan, Kwok, and Hwang
The following calculations are taken from:
http://www.cs.wisc.edu/~cao/papers/summary-cache/node8.html
"Bloom Filters - the math"
This class's static methods are meant to facilitate the use of the Bloom
Filter class by helping to choose correct values of 'bits per element' and
'number of hash functions, k'.
Given a maximum tolerable false positive probability, compute a Bloom
specification which will give less than the specified false positive rate,
but minimize the number of buckets per element and the number of hash
functions used.
Based on the Space-Saving algorithm and the Stream-Summary
data structure as described in:
Efficient Computation of Frequent and Top-k Elements in Data Streams
by Metwally, Agrawal, and Abbadi
Ideally used in multithreaded applications, otherwise see StreamSummary
Java implementation of HyperLogLog (HLL) algorithm from this paper:
http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
HLL is an improved version of LogLog that is capable of estimating
the cardinality of a set with accuracy = 1.04/sqrt(m) where
m = 2^b.
Merge this HLL++ with a bunch of others! The power of minions!
Most of the logic consists of case analysis about the state of this HLL++ and each one it wants to merge
with.
For cardinalities less than 4.25M, obyCount provides a LinearCounting Builder
(see LinearCounting.Builder.onePercentError() ) using only the
space required to provide estimates within 1% of the actual cardinality,
up to ~65k.
Returns a LinearCounting.Builder that generates an LC
estimator which keeps estimates below 1% error on average and has
a low likelihood of saturation (0.7%) for any stream with
cardinality less than maxCardinality
Based on the Space-Saving algorithm and the Stream-Summary
data structure as described in:
Efficient Computation of Frequent and Top-k Elements in Data Streams
by Metwally, Agrawal, and Abbadi
Simple TopK command line utility
Usage:
> topk [capacity] [update-rate]
capacity : size of top / k (defaults to 1000)
update-rate: output results after every update-rate elements/lines
Example:
> cat elements.txt | topk 10
TopK() - Constructor for class com.clearspring.analytics.util.TopK