@Beta @GwtIncompatible public final class Quantiles extends Object
To compute the median:
double myMedian = median().compute(myDataset);
where median()
has been statically imported.
To compute the 99th percentile:
double myPercentile99 = percentiles().index(99).compute(myDataset);
where percentiles()
has been statically imported.
To compute median and the 90th and 99th percentiles:
Map<Integer, Double> myPercentiles =
percentiles().indexes(50, 90, 99).compute(myDataset);
where percentiles()
has been statically imported: myPercentiles
maps the keys
50, 90, and 99, to their corresponding quantile values.
To compute quartiles, use quartiles()
instead of percentiles()
. To compute
arbitrary q-quantiles, use scale(q)
.
These examples all take a copy of your dataset. If you have a double array, you are okay with
it being arbitrarily reordered, and you want to avoid that copy, you can use
computeInPlace
instead of compute
.
The definition of the kth q-quantile of N values is as follows: define x = k * (N - 1) / q; if
x is an integer, the result is the value which would appear at index x in the sorted dataset
(unless there are NaN
values, see below); otherwise, the result is the average
of the values which would appear at the indexes floor(x) and ceil(x) weighted by (1-frac(x)) and
frac(x) respectively. This is the same definition as used by Excel and by S, it is the Type 7
definition in
R, and it is
described by
wikipedia as providing "Linear interpolation of the modes for the order statistics for the
uniform distribution on [0,1]."
If any values in the input are NaN
then all values returned are
NaN
. (This is the one occasion when the behaviour is not the same as you'd get
from sorting with Arrays.sort(double[])
or
Collections.sort(List<Double>)
and
selecting the required value(s). Those methods would sort NaN
as if it is
greater than any other value and place them at the end of the dataset, even after
POSITIVE_INFINITY
.)
Otherwise, NEGATIVE_INFINITY
and
POSITIVE_INFINITY
sort to the beginning and the end of the
dataset, as you would expect.
If required to do a weighted average between an infinity and a finite value, or between an
infinite value and itself, the infinite value is returned. If required to do a weighted average
between NEGATIVE_INFINITY
and POSITIVE_INFINITY
, NaN
is returned (note that this will only happen if the
dataset contains no finite values).
The average time complexity of the computation is O(N) in the size of the dataset. There is a worst case time complexity of O(N^2). You are extremely unlikely to hit this quadratic case on randomly ordered data (the probability decreases faster than exponentially in N), but if you are passing in unsanitized user data then a malicious user could force it. A light shuffle of the data using an unpredictable seed should normally be enough to thwart this attack.
The time taken to compute multiple quantiles on the same dataset using indexes
is generally less than the total time taken to compute each of them separately, and
sometimes much less. For example, on a large enough dataset, computing the 90th and 99th
percentiles together takes about 55% as long as computing them separately.
When calling Quantiles.ScaleAndIndex.compute(java.util.Collection<? extends java.lang.Number>)
(in either
form), the memory requirement is 8*N bytes for the copy of the dataset plus an overhead which is
independent of N (but depends on the quantiles being computed). When calling
computeInPlace
(in
either form), only the overhead is required. The
number of object allocations is independent of N in both cases.
Modifier and Type | Class and Description |
---|---|
static class |
Quantiles.Scale
Describes the point in a fluent API chain where only the scale (i.e.
|
static class |
Quantiles.ScaleAndIndex
Describes the point in a fluent API chain where the scale and a single quantile index (i.e.
|
static class |
Quantiles.ScaleAndIndexes
Describes the point in a fluent API chain where the scale and a multiple quantile indexes (i.e.
|
Constructor and Description |
---|
Quantiles() |
Modifier and Type | Method and Description |
---|---|
static Quantiles.ScaleAndIndex |
median()
Specifies the computation of a median (i.e.
|
static Quantiles.Scale |
percentiles()
Specifies the computation of percentiles (i.e.
|
static Quantiles.Scale |
quartiles()
Specifies the computation of quartiles (i.e.
|
static Quantiles.Scale |
scale(int scale)
Specifies the computation of q-quantiles.
|
public Quantiles()
public static Quantiles.ScaleAndIndex median()
public static Quantiles.Scale quartiles()
public static Quantiles.Scale percentiles()
public static Quantiles.Scale scale(int scale)
scale
- the scale for the quantiles to be calculated, i.e. the q of the q-quantiles, which
must be positiveCopyright © 2010-2016. All Rights Reserved.