Integer, Intent (In)	::	n1, n2, ntype
Integer, Intent (Inout)	::	ifail
Real (Kind=nag_wp), Intent (In)	::	x(n1), y(n2)
Real (Kind=nag_wp), Intent (Out)	::	d, z, p, sx(n1), sy(n2)

C Header Interface

#include <nag.h>

void	g08cdf_ (const Integer n1, const double x[], const Integer n2, const double y[], const Integer ntype, double d, double z, double p, double sx[], double sy[], Integer *ifail)

The routine may be called by the names g08cdf or nagf_nonpar_test_ks_2sample.

3 Description

The data consists of two independent samples, one of size

n_{1}

, denoted by

x_{1}, x_{2}, \dots, x_{n_{1}}

, and the other of size

n_{2}

denoted by

y_{1}, y_{2}, \dots, y_{n_{2}}

. Let

F (x)

and

G (x)

represent their respective, unknown, distribution functions. Also let

S_{1} (x)

and

S_{2} (x)

denote the values of the sample cumulative distribution functions at the point

x

for the two samples respectively.

The Kolmogorov–Smirnov test provides a test of the null hypothesis

H_{0}

F (x) = G (x)

against one of the following alternative hypotheses:

(i) $H_{1}$ : $F (x) \neq G (x)$ .
(ii) $H_{2}$ : $F (x) > G (x)$ . This alternative hypothesis is sometimes stated as, ‘The $x$ 's tend to be smaller than the $y$ 's’, i.e., it would be demonstrated in practical terms if the values of $S_{1} (x)$ tended to exceed the corresponding values of $S_{2} (x)$ .
(iii) $H_{3}$ : $F (x) < G (x)$ . This alternative hypothesis is sometimes stated as, ‘The $x$ 's tend to be larger than the $y$ 's’, i.e., it would be demonstrated in practical terms if the values of $S_{2} (x)$ tended to exceed the corresponding values of $S_{1} (x)$ .

One of the following test statistics is computed depending on the particular alternative null hypothesis specified (see the description of the argument ntype in Section 5).

For the alternative hypothesis

H_{1}

$D_{n_{1}, n_{2}}$ – the largest absolute deviation between the two sample cumulative distribution functions.

For the alternative hypothesis

H_{2}

$D_{n_{1}, n_{2}}^{+}$ – the largest positive deviation between the sample cumulative distribution function of the first sample, $S_{1} (x)$ , and the sample cumulative distribution function of the second sample, $S_{2} (x)$ . Formally $D_{n_{1}, n_{2}}^{+} = \max \{S_{1} (x) - S_{2} (x), 0\}$ .

For the alternative hypothesis

H_{3}

$D_{n_{1}, n_{2}}^{-}$ – the largest positive deviation between the sample cumulative distribution function of the second sample, $S_{2} (x)$ , and the sample cumulative distribution function of the first sample, $S_{1} (x)$ . Formally $D_{n_{1}, n_{2}}^{-} = \max \{S_{2} (x) - S_{1} (x), 0\}$ .

g08cdf also returns the standardized statistic

Z = \sqrt{\frac{n_{1} + n_{2}}{n_{1} n_{2}}} \times D

, where

D

may be

D_{n_{1}, n_{2}}

D_{n_{1}, n_{2}}^{+}

D_{n_{1}, n_{2}}^{-}

depending on the choice of the alternative hypothesis. The distribution of this statistic converges asymptotically to a distribution given by Smirnov as

n_{1}

and

n_{2}

increase; see Feller (1948), Kendall and Stuart (1973), Kim and Jenrich (1973), Smirnov (1933) or Smirnov (1948).

The probability, under the null hypothesis, of obtaining a value of the test statistic as extreme as that observed, is computed. If

\max (n_{1}, n_{2}) \leq 2500

and

n_{1} n_{2} \leq 10000

then an exact method given by Kim and Jenrich (see Kim and Jenrich (1973)) is used. Otherwise

p

is computed using the approximations suggested by Kim and Jenrich (1973). Note that the method used is only exact for continuous theoretical distributions. This method computes the two-sided probability. The one-sided probabilities are estimated by halving the two-sided probability. This is a good estimate for small

p

, that is

p \leq 0.10

, but it becomes very poor for larger

p

4 References

Conover W J (1980) Practical Nonparametric Statistics Wiley

Feller W (1948) On the Kolmogorov–Smirnov limit theorems for empirical distributions Ann. Math. Statist. 19 179–181

Kendall M G and Stuart A (1973) The Advanced Theory of Statistics (Volume 2) (3rd Edition) Griffin

Kim P J and Jenrich R I (1973) Tables of exact sampling distribution of the two sample Kolmogorov–Smirnov criterion

D_{m n} (m < n)

Selected Tables in Mathematical Statistics 1 80–129 American Mathematical Society

Siegel S (1956) Non-parametric Statistics for the Behavioral Sciences McGraw–Hill

Smirnov N (1933) Estimate of deviation between empirical distribution functions in two independent samples Bull. Moscow Univ. 2(2) 3–16

Smirnov N (1948) Table for estimating the goodness of fit of empirical distributions Ann. Math. Statist. 19 279–281

5 Arguments

1: $n1$ – Integer Input

On entry: the number of observations in the first sample,

n_{1}

Constraint:

n1 \geq 1

2: $x (n1)$ – Real (Kind=nag_wp) array Input

On entry: the observations from the first sample,

x_{1}, x_{2}, \dots, x_{n_{1}}

3: $n2$ – Integer Input

On entry: the number of observations in the second sample,

n_{2}

Constraint:

n2 \geq 1

4: $y (n2)$ – Real (Kind=nag_wp) array Input

On entry: the observations from the second sample,

y_{1}, y_{2}, \dots, y_{n_{2}}

5: $ntype$ – Integer Input

On entry: the statistic to be computed, i.e., the choice of alternative hypothesis.

$ntype = 1$: Computes $D_{n_{1} n_{2}}$ , to test against $H_{1}$ .
$ntype = 2$: Computes $D_{n_{1} n_{2}}^{+}$ , to test against $H_{2}$ .
$ntype = 3$: Computes $D_{n_{1} n_{2}}^{-}$ , to test against $H_{3}$ .

Constraint:

ntype = 1

2

3

6: $d$ – Real (Kind=nag_wp) Output

On exit: the Kolmogorov–Smirnov test statistic (

D_{n_{1} n_{2}}

D_{n_{1} n_{2}}^{+}

D_{n_{1} n_{2}}^{-}

according to the value of ntype).

7: $z$ – Real (Kind=nag_wp) Output

On exit: a standardized value,

Z

, of the test statistic,

D

, without any correction for continuity.

8: $p$ – Real (Kind=nag_wp) Output

On exit: the tail probability associated with the observed value of

D

, where

D

may be

D_{n_{1}, n_{2}}, D_{n_{1}, n_{2}}^{+}

D_{n_{1}, n_{2}}^{-}

depending on the value of ntype (see Section 3).

9: $sx (n1)$ – Real (Kind=nag_wp) array Output

On exit: the observations from the first sample sorted in ascending order.

10: $sy (n2)$ – Real (Kind=nag_wp) array Output

On exit: the observations from the second sample sorted in ascending order.

11: $ifail$ – Integer Input/Output

On entry: ifail must be set to

0

- 1

1

to set behaviour on detection of an error; these values have no effect when no error is detected.

A value of

0

causes the printing of an error message and program execution will be halted; otherwise program execution continues. A value of

- 1

means that an error message is printed while a value of

1

means that it is not.

If halting is not appropriate, the value

- 1

1

is recommended. If message printing is undesirable, then the value

1

is recommended. Otherwise, the value

0

is recommended. When the value $- 1$ or $1$ is used it is essential to test the value of ifail on exit.

On exit:

ifail = 0

unless the routine detects an error or a warning has been flagged (see Section 6).

6 Error Indicators and Warnings

If on entry

ifail = 0

- 1

, explanatory error messages are output on the current error message unit (as defined by x04aaf).

Errors or warnings detected by the routine:

$ifail = 1$: On entry, $n1 = 〈value〉$ .
Constraint: $n1 \geq 1$ .

On entry, $n2 = 〈value〉$ .
Constraint: $n2 \geq 1$ .

$ifail = 2$: On entry, $ntype = 〈value〉$ .
Constraint: $ntype = 1$ , $2$ or $3$ .

$ifail = 3$: The iterative process used in the approximation of the probability for large $n_{1}$ and $n_{2}$ did not converge. For the two sided test $p = 1$ is returned. For the one-sided test $p = 0.5$ is returned.

$ifail = - 99$: An unexpected error has been triggered by this routine. Please contact NAG.
See Section 7 in the Introduction to the NAG Library FL Interface for further information.

$ifail = - 399$: Your licence key may have expired or may not have been installed correctly.
See Section 8 in the Introduction to the NAG Library FL Interface for further information.

$ifail = - 999$: Dynamic memory allocation failed.
See Section 9 in the Introduction to the NAG Library FL Interface for further information.

7 Accuracy

The large sample distributions used as approximations to the exact distribution should have a relative error of less than 5% for most cases.

8 Parallelism and Performance

g08cdf is threaded by NAG for parallel execution in multithreaded implementations of the NAG Library.

Please consult the X06 Chapter Introduction for information on how to control and interrogate the OpenMP environment used within this routine. Please also consult the Users' Note for your implementation for any additional implementation-specific information.

9 Further Comments

The time taken by g08cdf increases with

n_{1}

and

n_{2}

, until

n_{1} n_{2} > 10000

\max (n_{1}, n_{2}) \geq 2500

. At this point one of the approximations is used and the time decreases significantly. The time then increases again modestly with

n_{1}

and

n_{2}

10 Example

This example computes the two-sided Kolmogorov–Smirnov test statistic for two independent samples of size

100

and

50

respectively. The first sample is from a uniform distribution

U (0, 2)

. The second sample is from a uniform distribution

U (0.25, 2.25)

. The test statistic,

D_{n_{1}, n_{2}}

, the standardized test statistic,

Z

, and the tail probability,

p

, are computed and printed.

g08cd: FL CL CPP AD

NAG FL Interfaceg08cdf (test_​ks_​2sample)

▸▿ Contents

1 Purpose

2 Specification

3 Description

4 References

5 Arguments

6 Error Indicators and Warnings

7 Accuracy

8 Parallelism and Performance

9 Further Comments

10 Example

10.1 Program Text

10.2 Program Data

10.3 Program Results

NAG FL Interface
g08cdf (test_ks_2sample)