PDF version (NAG web site
, 64-bit version, 64-bit version)
NAG Toolbox: nag_stat_quantiles_stream_arbitrary (g01ap)
Purpose
nag_stat_quantiles_stream_arbitrary (g01ap) finds approximate quantiles from a large arbitrary-sized data stream using an out-of-core algorithm.
Syntax
[
ind,
np,
qv,
rcomm,
icomm,
ifail] = g01ap(
ind,
rv,
eps,
q,
rcomm,
icomm, 'nb',
nb, 'nq',
nq, 'lrcomm',
lrcomm, 'licomm',
licomm)
[
ind,
np,
qv,
rcomm,
icomm,
ifail] = nag_stat_quantiles_stream_arbitrary(
ind,
rv,
eps,
q,
rcomm,
icomm, 'nb',
nb, 'nq',
nq, 'lrcomm',
lrcomm, 'licomm',
licomm)
Description
A quantile is a value which divides a frequency distribution such that there is a given proportion of data values below the quantile. For example, the median of a dataset is the quantile because half the values are less than or equal to it.
nag_stat_quantiles_stream_arbitrary (g01ap) uses a slightly modified version of an algorithm described in a paper by
Zhang and Wang (2007) to determine
-approximate quantiles of a large arbitrary-sized data stream of real values, where
is a user-defined approximation factor. Let
denote the number of data elements processed so far then, given any quantile
, an
-approximate quantile is defined as an element in the data stream whose rank falls within
. In case of more than one
-approximate quantile being available, the one closest to
is used.
References
Zhang Q and Wang W (2007) A fast algorithm for approximate quantiles in high speed data streams Proceedings of the 19th International Conference on Scientific and Statistical Database Management IEEE Computer Society 29
Parameters
Compulsory Input Parameters
- 1:
– int64int32nag_int scalar
-
On initial entry: must be set to .
Indicates the action required in the current call to
nag_stat_quantiles_stream_arbitrary (g01ap).
- Initialize the communication arrays and attempt to process the first nb values from the data stream. eps, rv and nb must be set and licomm must be at least .
- Attempt to process the next block of nb values from the data stream. The calling program must update rv and (if required) nb, and re-enter nag_stat_quantiles_stream_arbitrary (g01ap) with all other parameters unchanged.
- Continue calculation following the reallocation of either or both of the communication arrays rcomm and icomm.
- Calculate the nq -approximate quantiles specified in q. The calling program must set q and nq and re-enter nag_stat_quantiles_stream_arbitrary (g01ap) with all other parameters unchanged. This option can be chosen only when .
Constraint:
, , or .
- 2:
– double array
-
The dimension of the array
rv
must be at least
if
,
or
If
,
or
, the vector containing the current block of data, otherwise
rv is not referenced.
- 3:
– double scalar
-
Approximation factor .
Constraint:
.
- 4:
– double array
-
The dimension of the array
q
must be at least
if
If
, the quantiles to be calculated, otherwise
q is not referenced. Note that
, corresponds to the minimum value and
to the maximum value.
Constraint:
if , , for .
- 5:
– double array
-
If
or
then the first
elements of
rcomm as supplied to
nag_stat_quantiles_stream_arbitrary (g01ap) must be identical to the first
elements of
rcomm returned from the last call to
nag_stat_quantiles_stream_arbitrary (g01ap), where
is the value of
lrcomm used in the last call. In other words, the contents of
rcomm must not be altered between calls to this function. If
rcomm needs to be reallocated then its contents must be preserved. If
then
rcomm need not be set.
- 6:
– int64int32nag_int array
-
If
or
then the first
elements of
icomm as supplied to
nag_stat_quantiles_stream_arbitrary (g01ap) must be identical to the first
elements of
icomm returned from the last call to
nag_stat_quantiles_stream_arbitrary (g01ap), where
is the value of
licomm used in the last call. In other words, the contents of
icomm must not be altered between calls to this function. If
icomm needs to be reallocated then its contents must be preserved. If
then
icomm need not be set.
Optional Input Parameters
- 1:
– int64int32nag_int scalar
-
Default:
the dimension of the array
rv.
If
,
or
, the size of the current block of data. The size of blocks of data in array
rv can vary; therefore
nb can change between calls to
nag_stat_quantiles_stream_arbitrary (g01ap).
Constraint:
if , or , .
- 2:
– int64int32nag_int scalar
-
Default:
the dimension of the array
q.
If
, the number of quantiles requested, otherwise
nq is not referenced.
Constraint:
if , .
- 3:
– int64int32nag_int scalar
-
Default:
the dimension of the array
rcomm.
The dimension of the array
rcomm.
Constraints:
- if , ;
- otherwise .
- 4:
– int64int32nag_int scalar
-
Default:
the dimension of the array
icomm.
The dimension of the array
icomm.
Constraints:
- if , ;
- otherwise .
Output Parameters
- 1:
– int64int32nag_int scalar
-
Indicates output from the call.
- nag_stat_quantiles_stream_arbitrary (g01ap) has processed np data points and expects to be called again with additional data.
- Either one or more of the communication arrays rcomm and icomm is too small. The new minimum lengths of rcomm and icomm have been returned in and respectively. If the new minimum length is greater than the current length then the corresponding communication array needs to be reallocated, its contents preserved and nag_stat_quantiles_stream_arbitrary (g01ap) called again with all other parameters unchanged.
If there is more data to be processed, it is recommended that
lrcomm and
licomm are made significantly bigger than the minimum to limit the number of reallocations.
- nag_stat_quantiles_stream_arbitrary (g01ap) has returned the requested -approximate quantiles in qv. These quantiles are based on np data points.
- 2:
– int64int32nag_int scalar
-
, the number of elements processed so far.
- 3:
– double array
-
The dimension of the array
qv will be
if
If , contains the -approximate quantiles specified by the value provided in .
- 4:
– double array
-
rcomm holds information required by subsequent calls to
nag_stat_quantiles_stream_arbitrary (g01ap)
- 5:
– int64int32nag_int array
-
holds the minimum required length for
rcomm and
holds the minimum required length for
icomm. The remaining elements of
icomm are used for communication between subsequent calls to
nag_stat_quantiles_stream_arbitrary (g01ap).
- 6:
– int64int32nag_int scalar
unless the function detects an error (see
Error Indicators and Warnings).
As an out-of-core function
nag_stat_quantiles_stream_arbitrary (g01ap) will only perform certain argument checks when a data checkpoint (including completion of data input) is signaled. As such it will usually be inappropriate to halt program execution when an error is detected since any errors may be subsequently resolved without losing any processing already carried out. Therefore setting
ifail to a value of
is recommended. If the output of error messages is undesirable, then the value
is recommended.
When the value is used it is essential to test the value of ifail on exit.
Error Indicators and Warnings
Errors or warnings detected by the function:
-
-
Constraint: , , or .
-
-
Constraint: .
-
-
Constraint: if , or then .
-
-
Constraint: .
-
-
Constraint: .
-
-
The contents of
icomm have been altered between calls to this function.
-
-
The contents of
rcomm have been altered between calls to this function.
-
-
Number of data elements streamed,
is not sufficient for a quantile query when .
Supply more data or reprocess the data with a higher
eps value.
-
-
Constraint: if then .
-
-
Constraint: if then for all .
-
An unexpected error has been triggered by this routine. Please
contact
NAG.
-
Your licence key may have expired or may not have been installed correctly.
-
Dynamic memory allocation failed.
Accuracy
Not applicable.
Further Comments
The average time taken by nag_stat_quantiles_stream_arbitrary (g01ap) scales as .
It is not possible to determine in advance the final size of the communication arrays
rcomm and
icomm without knowing the size of the dataset. However, if a rough size (
) is known, the speed of the computation can be increased if the sizes of the communication arrays are not smaller than
where
Example
This example computes a list of -approximate quantiles. The data is processed in blocks of observations at a time to simulate a situation in which the data is made available in a piecemeal fashion.
Open in the MATLAB editor:
g01ap_example
function g01ap_example
fprintf('g01ap example results\n\n');
n = 0;
tol = 0.2;
q = [0.25 0.5 1.0];
nb = 20;
datastream = ['34.01 57.95 44.88 22.04 28.84 4.43 0.32 20.82 ' ...
'20.53 13.08 7.99 54.03 23.21 26.73 39.72 0.97 ' ...
'39.05 38.78 19.38 51.34 24.08 12.41 58.11 35.90 ' ...
'40.38 27.41 19.80 6.02 45.33 36.34 43.14 53.84 ' ...
'39.49 9.04 36.74 58.72 59.95 15.41 33.05 39.54 ' ...
'33.24 58.67 54.12 39.48 43.73 24.15 55.72 8.87 ' ...
'40.47 46.18 20.36 6.95 36.86 49.24 56.83 43.87 ' ...
'29.86 22.49 25.29 33.17'];
rcomm = zeros(100, 1);
icomm = zeros(400, 1, 'int64');
ind = int64(0);
repeat = true;
pos = 0;
while (repeat)
if (ind==0 || ind==1)
[rv, new_pos] = textscan(datastream(pos+1:end), '%f', nb);
pos = pos+new_pos;
nd = numel(rv{1});
if nd == 0
break;
elseif nd < nb
nb = nd;
repeat = false;
elseif pos == numel(datastream)
repeat = false;
end
n = n+nb;
end
[ind, np, qv, rcomm, icomm, ifail] = ...
g01ap(ind, rv{1}, tol, q, rcomm, icomm);
if ind == 2
if numel(rcomm) < icomm(1)
rcomm(icomm(1)) = 0;
end
if numel(icomm) < icomm(2)
icomm(icomm(2)) = 0;
end
end
end
ind = int64(3);
[ind, np, qv, rcomm, icomm, ifail] = ...
g01ap(ind, 0, tol, q, rcomm, icomm);
fprintf('\nInput Data:\n %d observations\n eps = %5.2f\n\n', n, tol);
fprintf('Quantile Result\n');
fprintf('%7.2f %7.2f\n', [q; qv']);
g01ap example results
Input Data:
60 observations
eps = 0.20
Quantile Result
0.25 22.49
0.50 39.54
1.00 59.95
PDF version (NAG web site
, 64-bit version, 64-bit version)
© The Numerical Algorithms Group Ltd, Oxford, UK. 2009–2015