g08cd:: Nonparametric Statistics (NAG Toolbox)

The data consists of two independent samples, one of size

n_{1}

, denoted by

x_{1}, x_{2}, \dots, x_{n_{1}}

, and the other of size

n_{2}

denoted by

y_{1}, y_{2}, \dots, y_{n_{2}}

. Let

F (x)

and

G (x)

represent their respective, unknown, distribution functions. Also let

S_{1} (x)

and

S_{2} (x)

denote the values of the sample cumulative distribution functions at the point

x

for the two samples respectively.

The Kolmogorov–Smirnov test provides a test of the null hypothesis

H_{0}

F (x) = G (x)

against one of the following alternative hypotheses:

(i)	$H_{1}$ : $F (x) \neq G (x)$ .
(ii)	$H_{2}$ : $F (x) > G (x)$ . This alternative hypothesis is sometimes stated as, ‘The $x$ 's tend to be smaller than the $y$ 's’, i.e., it would be demonstrated in practical terms if the values of $S_{1} (x)$ tended to exceed the corresponding values of $S_{2} (x)$ .
(iii)	$H_{3}$ : $F (x) < G (x)$ . This alternative hypothesis is sometimes stated as, ‘The $x$ 's tend to be larger than the $y$ 's’, i.e., it would be demonstrated in practical terms if the values of $S_{2} (x)$ tended to exceed the corresponding values of $S_{1} (x)$ .

For the alternative hypothesis

H_{2}

$D_{n_{1}, n_{2}}^{+}$ – the largest positive deviation between the sample cumulative distribution function of the first sample, $S_{1} (x)$ , and the sample cumulative distribution function of the second sample, $S_{2} (x)$ . Formally $D_{n_{1}, n_{2}}^{+} = \max \{S_{1} (x) - S_{2} (x), 0\}$ .

For the alternative hypothesis

H_{3}

$D_{n_{1}, n_{2}}^{-}$ – the largest positive deviation between the sample cumulative distribution function of the second sample, $S_{2} (x)$ , and the sample cumulative distribution function of the first sample, $S_{1} (x)$ . Formally $D_{n_{1}, n_{2}}^{-} = \max \{S_{2} (x) - S_{1} (x), 0\}$ .

nag_nonpar_test_ks_2sample (g08cd) also returns the standardized statistic

Z = \sqrt{\frac{n_{1} + n_{2}}{n_{1} n_{2}}} \times D

, where

D

may be

D_{n_{1}, n_{2}}

D_{n_{1}, n_{2}}^{+}

D_{n_{1}, n_{2}}^{-}

depending on the choice of the alternative hypothesis. The distribution of this statistic converges asymptotically to a distribution given by Smirnov as

n_{1}

and

n_{2}

increase; see Feller (1948), Kendall and Stuart (1973), Kim and Jenrich (1973), Smirnov (1933) or Smirnov (1948)

The probability, under the null hypothesis, of obtaining a value of the test statistic as extreme as that observed, is computed. If

\max (n_{1}, n_{2}) \leq 2500

and

n_{1} n_{2} \leq 10000

then an exact method given by Kim and Jenrich (see Kim and Jenrich (1973)) is used. Otherwise

p

is computed using the approximations suggested by Kim and Jenrich (1973). Note that the method used is only exact for continuous theoretical distributions. This method computes the two-sided probability. The one-sided probabilities are estimated by halving the two-sided probability. This is a good estimate for small

p

, that is

p \leq 0.10

, but it becomes very poor for larger

p

References

Parameters

Compulsory Input Parameters

Optional Input Parameters

Output Parameters

Error Indicators and Warnings

Accuracy

Further Comments

The time taken by nag_nonpar_test_ks_2sample (g08cd) increases with

n_{1}

and

n_{2}

, until

n_{1} n_{2} > 10000

\max (n_{1}, n_{2}) \geq 2500

. At this point one of the approximations is used and the time decreases significantly. The time then increases again modestly with

n_{1}

and

n_{2}

Example

This example computes the two-sided Kolmogorov–Smirnov test statistic for two independent samples of size

100

and

50

respectively. The first sample is from a uniform distribution

U (0, 2)

. The second sample is from a uniform distribution

U (0.25, 2.25)

. The test statistic,

D_{n_{1}, n_{2}}

, the standardized test statistic,

Z

, and the tail probability,

p

, are computed and printed.

function g08cd_example


fprintf('g08cd example results\n\n');

x = [ 1.160 1.785 0.322 1.437 1.695 1.770 1.209 0.479 1.122 0.974 ...
      0.290 1.155 0.218 1.595 1.053 1.058 1.282 1.278 1.066 0.725 ...
      0.113 1.516 1.329 1.907 0.101 0.387 1.392 0.613 0.692 1.397 ...
      1.627 0.417 1.079 0.607 0.899 0.493 0.381 1.660 0.233 0.718 ...
      1.376 1.395 1.557 1.610 1.632 0.851 1.824 0.921 0.139 0.618 ...
      0.050 0.956 0.669 1.109 1.882 1.462 1.465 0.201 1.036 1.127 ...
      0.907 0.876 1.199 1.667 1.141 0.820 0.488 0.732 0.725 0.753 ...
      0.760 1.833 0.074 1.101 0.620 1.858 0.681 0.705 0.876 1.096 ...
      1.870 1.597 0.990 0.430 0.410 0.399 1.693 0.492 1.318 0.883 ...
      1.291 1.051 1.934 1.314 1.496 0.391 1.079 0.881 0.983 1.306];

y = [ 1.695 1.452 0.997 1.771 1.114 1.624 2.005 0.782 1.870 0.954 ...
      1.606 2.059 0.774 0.741 1.040 0.521 2.163 0.818 1.781 1.420 ...
      0.558 1.437 2.004 1.325 0.398 0.582 2.047 0.332 1.186 0.890 ...
      1.825 1.324 1.334 0.261 0.299 1.733 1.172 1.000 1.663 1.093 ...
      1.045 2.022 1.174 0.670 1.143 1.189 0.494 1.275 1.122 1.823];

ntype = int64(1);
[d, z, p, sx, sy, ifail] = g08cd(...
                                 x, y, ntype);

fprintf('Test statistic D = %8.4f\n', d);
fprintf('Z statistic      = %8.4f\n', z);
fprintf('Tail probability = %8.4f\n', p);

On entry,	$n1 < 1$ ,
or	$n2 < 1$ .

NAG Toolbox: nag_nonpar_test_ks_2sample (g08cd)

▸▿ Contents

Purpose

Syntax

Description