Integer, Intent (In)	::	n, p, ldiff
Integer, Intent (Inout)	::	ifail
Integer, Intent (Out)	::	iout(n), niout
Real (Kind=nag_wp), Intent (In)	::	y(n), mean, var
Real (Kind=nag_wp), Intent (Out)	::	diff(ldiff), llamb(ldiff)

C Header Interface

#include <nag.h>

void	g07gaf_ (const Integer n, const Integer p, const double y[], const double mean, const double var, Integer iout[], Integer niout, const Integer ldiff, double diff[], double llamb[], Integer *ifail)

The routine may be called by the names g07gaf or nagf_univar_outlier_peirce_1var.

3 Description

g07gaf flags outlying values in data using Peirce's criterion. Let

$y$ denote a vector of $n$ observations (for example the residuals) obtained from a model with $p$ parameters,
$m$ denote the number of potential outlying values,
$μ$ and $σ^{2}$ denote the mean and variance of $y$ respectively,
$\tilde{y}$ denote a vector of length $n - m$ constructed by dropping the $m$ values from $y$ with the largest value of $|y_{i} - μ|$ ,
${\tilde{σ}}^{2}$ denote the (unknown) variance of $\tilde{y}$ ,
$λ$ denote the ratio of $\tilde{σ}$ and $σ$ with $λ = \frac{\tilde{σ}}{σ}$ .

Peirce's method flags

y_{i}

as a potential outlier if

|y_{i} - μ| \geq x

, where

x = σ^{2} z

and

z

is obtained from the solution of

R^{m} = λ^{m - n} \frac{m^{m} {(n - m)}^{n - m}}{n^{n}}

(1)

where

R = 2 \exp ((\frac{z^{2} - 1}{2}) (1 - Φ (z)))

(2)

and

Φ

is the cumulative distribution function for the standard Normal distribution.

{\tilde{σ}}^{2}

is unknown an assumption is made that the relationship between

{\tilde{σ}}^{2}

and

σ^{2}

, hence

λ

, depends only on the sum of squares of the rejected observations and the ratio estimated as

λ^{2} = \frac{n - p - m z^{2}}{n - p - m}

which gives

z^{2} = 1 + \frac{n - p - m}{m} (1 - λ^{2})

(3)

A value for the cutoff

x

is calculated iteratively. An initial value of

R = 0.2

is used and a value of

λ

is estimated using equation (1). Equation (3) is then used to obtain an estimate of

z

and then equation (2) is used to get a new estimate for

R

. This process is then repeated until the relative change in

z

between consecutive iterations is

\leq \sqrt{ε}

, where

ε

is machine precision.

By construction, the cutoff for testing for

m + 1

potential outliers is less than the cutoff for testing for

m

potential outliers. Therefore Peirce's criterion is used in sequence with the existence of a single potential outlier being investigated first. If one is found, the existence of two potential outliers is investigated etc.

If one of a duplicate series of observations is flagged as an outlier, then all of them are flagged as outliers.

4 References

Gould B A (1855) On Peirce's criterion for the rejection of doubtful observations, with tables for facilitating its application The Astronomical Journal 45

Peirce B (1852) Criterion for the rejection of doubtful observations The Astronomical Journal 45

5 Arguments

1: $n$ – Integer Input: On entry: $n$ , the number of observations.

Constraint: $n \geq 3$ .
2: $p$ – Integer Input: On entry: $p$ , the number of parameters in the model used in obtaining the $y$ . If $y$ is an observed set of values, as opposed to the residuals from fitting a model with $p$ parameters, then $p$ should be set to $1$ , i.e., as if a model just containing the mean had been used.

Constraint: $1 \leq p \leq n - 2$ .
3: $y (n)$ – Real (Kind=nag_wp) array Input: On entry: $y$ , the data being tested.
4: $mean$ – Real (Kind=nag_wp) Input: On entry: if $var > 0.0$ , mean must contain $μ$ , the mean of $y$ , otherwise mean is not referenced and the mean is calculated from the data supplied in y.
5: $var$ – Real (Kind=nag_wp) Input: On entry: if $var > 0.0$ , var must contain $σ^{2}$ , the variance of $y$ , otherwise the variance is calculated from the data supplied in y.
6: $iout (n)$ – Integer array Output: On exit: the indices of the values in y sorted in descending order of the absolute difference from the mean, therefore $|y (iout (i - 1)) - μ| \geq |y (iout (i)) - μ|$ , for $i = 2, 3, \dots, n$ .
7: $niout$ – Integer Output: On exit: the number of potential outliers. The indices for these potential outliers are held in the first niout elements of iout. By construction there can be at most $n - p - 1$ values flagged as outliers.
8: $ldiff$ – Integer Input: On entry: the maximum number of values to be returned in arrays diff and llamb.
If $ldiff \leq 0$ , arrays diff and llamb are not referenced.
9: $diff (ldiff)$ – Real (Kind=nag_wp) array Output: On exit: $diff (i)$ holds $|y - μ| - σ^{2} z$ for observation $y (iout (i))$ , for $i = 1, 2, \dots, \min (ldiff, niout + 1, n - p - 1)$ .
10: $llamb (ldiff)$ – Real (Kind=nag_wp) array Output: On exit: $llamb (i)$ holds $\log (λ^{2})$ for observation $y (iout (i))$ , for $i = 1, 2, \dots, \min (ldiff, niout + 1, n - p - 1)$ .
11: $ifail$ – Integer Input/Output: On entry: ifail must be set to $0$ , $- 1$ or $1$ to set behaviour on detection of an error; these values have no effect when no error is detected.
A value of $0$ causes the printing of an error message and program execution will be halted; otherwise program execution continues. A value of $- 1$ means that an error message is printed while a value of $1$ means that it is not.

If halting is not appropriate, the value $- 1$ or $1$ is recommended. If message printing is undesirable, then the value $1$ is recommended. Otherwise, the value $0$ is recommended. When the value $- 1$ or $1$ is used it is essential to test the value of ifail on exit.

On exit: $ifail = 0$ unless the routine detects an error or a warning has been flagged (see Section 6).

6 Error Indicators and Warnings

If on entry

ifail = 0

- 1

, explanatory error messages are output on the current error message unit (as defined by x04aaf).

Errors or warnings detected by the routine:

$ifail = 1$: On entry, $n = 〈value〉$ .
Constraint: $n \geq 3$ .

$ifail = 2$: On entry, $p = 〈value〉$ and $n = 〈value〉$ .
Constraint: $1 \leq p \leq n - 2$ .

$ifail = - 99$: An unexpected error has been triggered by this routine. Please contact NAG.
See Section 7 in the Introduction to the NAG Library FL Interface for further information.

$ifail = - 399$: Your licence key may have expired or may not have been installed correctly.
See Section 8 in the Introduction to the NAG Library FL Interface for further information.

$ifail = - 999$: Dynamic memory allocation failed.
See Section 9 in the Introduction to the NAG Library FL Interface for further information.

7 Accuracy

Not applicable.

8 Parallelism and Performance

g07gaf is not threaded in any implementation.

9 Further Comments

One problem with Peirce's algorithm as implemented in g07gaf is the assumed relationship between

σ^{2}

, the variance using the full dataset, and

{\tilde{σ}}^{2}

, the variance with the potential outliers removed. In some cases, for example if the data

y

were the residuals from a linear regression, this assumption may not hold as the regression line may change significantly when outlying values have been dropped resulting in a radically different set of residuals. In such cases g07gbf should be used instead.

10 Example

This example reads in a series of data and flags any potential outliers.

The dataset used is from Peirce's original paper and consists of fifteen observations on the vertical semidiameter of Venus.

g07ga: FL CL CPP AD

NAG FL Interfaceg07gaf (outlier_​peirce_​1var)

▸▿ Contents

1 Purpose

2 Specification

3 Description

4 References

5 Arguments

6 Error Indicators and Warnings

7 Accuracy

8 Parallelism and Performance

9 Further Comments

10 Example

10.1 Program Text

10.2 Program Data

10.3 Program Results

NAG FL Interface
g07gaf (outlier_peirce_1var)