NAG Library Routine Document

G01APF

A quantile is a value which divides a frequency distribution such that there is a given proportion of data values below the quantile. For example, the median of a dataset is the

0.5

quantile because half the values are less than or equal to it.

G01APF uses a slightly modified version of an algorithm described in a paper by Zhang and Wang (2007) to determine

ε

-approximate quantiles of a large arbitrary-sized data stream of real values, where

ε

is a user-defined approximation factor. Let

m

denote the number of data elements processed so far then, given any quantile

q \in [0.0, 1.0]

, an

ε

-approximate quantile is defined as an element in the data stream whose rank falls within

[(q - ε) m, (q + ε) m]

. In case of more than one

ε

-approximate quantile being available, the one closest to

q m

is used.

4 References

Zhang Q and Wang W (2007) A fast algorithm for approximate quantiles in high speed data streams Proceedings of the 19th International Conference on Scientific and Statistical Database Management IEEE Computer Society 29

5 Parameters

1: $IND$ – INTEGERInput/Output

On initial entry: must be set to

0

On entry: indicates the action required in the current call to G01APF.

$IND = 0$: Initialize the communication arrays and attempt to process the first NB values from the data stream. EPS, RV and NB must be set and LICOMM must be at least $10$ .
$IND = 1$: Attempt to process the next block of NB values from the data stream. The calling program must update RV and (if required) NB, and re-enter G01APF with all other parameters unchanged.
$IND = 2$: Continue calculation following the reallocation of either or both of the communication arrays RCOMM and ICOMM.
$IND = 3$: Calculate the NQ $ε$ -approximate quantiles specified in Q. The calling program must set Q and NQ and re-enter G01APF with all other parameters unchanged. This option can be chosen only when $NP \geq ⌈\exp (1.0) / EPS⌉$ .

On exit: indicates output from the call.

$IND = 1$: G01APF has processed NP data points and expects to be called again with additional data.
$IND = 2$: Either one or more of the communication arrays RCOMM and ICOMM is too small. The new minimum lengths of RCOMM and ICOMM have been returned in $ICOMM (1)$ and $ICOMM (2)$ respectively. If the new minimum length is greater than the current length then the corresponding communication array needs to be reallocated, its contents preserved and G01APF called again with all other parameters unchanged.
If there is more data to be processed, it is recommended that LRCOMM and LICOMM are made significantly bigger than the minimum to limit the number of reallocations.
$IND = 3$: G01APF has returned the requested $ε$ -approximate quantiles in QV. These quantiles are based on NP data points.

Constraint:

IND = 0

1

2

3

2: $RV (*)$ – REAL (KIND=nag_wp) arrayInput

Note: the dimension of the array RV must be at least

NB

IND = 0

1

2

On entry: if

IND = 0

1

2

, the vector containing the current block of data, otherwise RV is not referenced.

3: $NB$ – INTEGERInput

On entry: if

IND = 0

1

2

, the size of the current block of data. The size of blocks of data in array RV can vary; therefore NB can change between calls to G01APF.

Constraint: if

IND = 0

1

2

NB > 0

4: $EPS$ – REAL (KIND=nag_wp)Input

On entry: approximation factor

ε

Constraint:

EPS > 0.0 ​ and ​ EPS \leq 1.0

5: $NP$ – INTEGEROutput

On exit:

m

, the number of elements processed so far.

6: $Q (*)$ – REAL (KIND=nag_wp) arrayInput

Note: the dimension of the array Q must be at least

NQ

IND = 3

On entry: if

IND = 3

, the quantiles to be calculated, otherwise Q is not referenced. Note that

Q (i) = 0.0

, corresponds to the minimum value and

Q (i) = 1.0

to the maximum value.

Constraint: if

IND = 3

0.0 \leq Q (i) \leq 1.0

, for

i = 1, 2, \dots, NQ

7: $QV (*)$ – REAL (KIND=nag_wp) arrayOutput

Note: the dimension of the array QV must be at least

NQ

IND = 3

On exit: if

IND = 3

QV (i)

contains the

ε

-approximate quantiles specified by the value provided in

Q (i)

8: $NQ$ – INTEGERInput

On entry: if

IND = 3

, the number of quantiles requested, otherwise NQ is not referenced.

Constraint: if

IND = 3

NQ > 0

9: $RCOMM (LRCOMM)$ – REAL (KIND=nag_wp) arrayCommunication Array

On entry: if

IND = 1

2

then the first

l

elements of RCOMM as supplied to G01APF must be identical to the first

l

elements of RCOMM returned from the last call to G01APF, where

l

is the value of LRCOMM used in the last call. In other words, the contents of RCOMM must not be altered between calls to this routine. If RCOMM needs to be reallocated then its contents must be preserved. If

IND = 0

then RCOMM need not be set.

On exit: RCOMM holds information required by subsequent calls to G01APF

10: $LRCOMM$ – INTEGERInput

On entry: the dimension of the array RCOMM as declared in the (sub)program from which G01APF is called.

Constraints:

if $IND = 0$ , $LRCOMM \geq 1$ ;
otherwise $LRCOMM \geq ICOMM (1)$ .

11: $ICOMM (LICOMM)$ – INTEGER arrayCommunication Array

On entry: if

IND = 1

2

then the first

l

elements of ICOMM as supplied to G01APF must be identical to the first

l

elements of ICOMM returned from the last call to G01APF, where

l

is the value of LICOMM used in the last call. In other words, the contents of ICOMM must not be altered between calls to this routine. If ICOMM needs to be reallocated then its contents must be preserved. If

IND = 0

then ICOMM need not be set.

On exit:

ICOMM (1)

holds the minimum required length for RCOMM and

ICOMM (2)

holds the minimum required length for ICOMM. The remaining elements of ICOMM are used for communication between subsequent calls to G01APF.

12: $LICOMM$ – INTEGERInput

On entry: the dimension of the array ICOMM as declared in the (sub)program from which G01APF is called.

Constraints:

if $IND = 0$ , $LICOMM \geq 10$ ;
otherwise $LICOMM \geq ICOMM (2)$ .

13: $IFAIL$ – INTEGERInput/Output

On entry: IFAIL must be set to

0

- 1 ​ or ​ 1

. If you are unfamiliar with this parameter you should refer to Section 3.3 in the Essential Introduction for details.

On exit:

IFAIL = 0

unless the routine detects an error (see Section 6).

As an out-of-core routine G01APF will only perform certain parameter checks when a data checkpoint (including completion of data input) is signaled. As such it will usually be inappropriate to halt program execution when an error is detected since any errors may be subsequently resolved without losing any processing already carried out. Therefore setting IFAIL to a value of

- 1 ​ or ​ 1

is recommended. If the output of error messages is undesirable, then the value

1

is recommended. When the value $- 1 or 1$ is used it is essential to test the value of IFAIL on exit.

6 Error Indicators and Warnings

If on entry

IFAIL = 0

- 1

, explanatory error messages are output on the current error message unit (as defined by X04AAF).

Errors or warnings detected by the routine:

$IFAIL = 1$: On entry, $IND = 〈value〉$ .
Constraint: $IND = 0$ , $1$ , $2$ or $3$ .

$IFAIL = 2$: On entry, $EPS = 〈value〉$ .
Constraint: $0.0 < EPS \leq 1.0$ .

$IFAIL = 3$: On entry, $IND = 0$ , $1$ or $2$ and $NB = 〈value〉$ .
Constraint: if $IND = 0$ , $1$ or $2$ then $NB > 0$ .

$IFAIL = 4$: On entry, $LICOMM = 〈value〉$ .
Constraint: $LICOMM \geq 10$ .

$IFAIL = 5$: On entry, $LRCOMM = 〈value〉$ .
Constraint: $LRCOMM \geq 1$ .

$IFAIL = 6$: The contents of ICOMM have been altered between calls to this routine.

$IFAIL = 7$: The contents of RCOMM have been altered between calls to this routine.

$IFAIL = 8$: Number of data elements streamed, $〈value〉$ is not sufficient for a quantile query when $EPS = 〈value〉$ .
Supply more data or reprocess the data with a higher EPS value.

$IFAIL = 9$: On entry, $IND = 3$ and $NQ = 〈value〉$ .
Constraint: if $IND = 3$ then $NQ > 0$ .

$IFAIL = 10$: On entry, $IND = 3$ and $Q (〈value〉) = 〈value〉$ .
Constraint: if $IND = 3$ then $0.0 \leq Q (i) \leq 1.0$ for all $i$ .

$IFAIL = - 99$: An unexpected error has been triggered by this routine. Please contact NAG.
See Section 3.8 in the Essential Introduction for further information.

$IFAIL = - 399$: Your licence key may have expired or may not have been installed correctly.
See Section 3.7 in the Essential Introduction for further information.

$IFAIL = - 999$: Dynamic memory allocation failed.
See Section 3.6 in the Essential Introduction for further information.

7 Accuracy

Not applicable.

8 Parallelism and Performance

G01APF is threaded by NAG for parallel execution in multithreaded implementations of the NAG Library.

Please consult the X06 Chapter Introduction for information on how to control and interrogate the OpenMP environment used within this routine. Please also consult the Users' Note for your implementation for any additional implementation-specific information.

9 Further Comments

The average time taken by G01APF scales as

NP \log (1 / ε \log (ε NP))

It is not possible to determine in advance the final size of the communication arrays RCOMM and ICOMM without knowing the size of the dataset. However, if a rough size (

n

) is known, the speed of the computation can be increased if the sizes of the communication arrays are not smaller than

\begin{array}{l} LRCOMM & = & (\log_{2} (n \times EPS + 1.0) - 2) \times ⌈1.0 / EPS⌉ + 1 + x + 2 \times \min (x, ⌈x / 2.0⌉ + 1) \times y + 1 \\ LICOMM & = & (\log_{2} (n \times EPS + 1.0) - 2) \times (2 \times (⌈1.0 / EPS⌉ + 1) + 1) + \\ 2 \times (x + 2 \times \min (x, ⌈x / 2.0⌉ + 1) \times y) + y + 11 \end{array}

where

\begin{array}{l} x = \max (1, ⌊\log (EPS \times n) / EPS⌋) \\ y = \log_{2} (n / x + 1.0) + 1 . \end{array}

10 Example

This example computes a list of

ε

-approximate quantiles. The data is processed in blocks of

20

observations at a time to simulate a situation in which the data is made available in a piecemeal fashion.

NAG Library Routine DocumentG01APF

▸▿ Contents

1 Purpose

2 Specification

3 Description

4 References

5 Parameters

6 Error Indicators and Warnings

7 Accuracy

8 Parallelism and Performance

9 Further Comments

10 Example

10.1 Program Text

10.2 Program Data

10.3 Program Results

NAG Library Routine Document

G01APF