Integer type: int32 int64 nag_int show int32 show int32 show int64 show int64 show nag_int show nag_int

PDF version (NAG web site, 64-bit version, 64-bit version)

Chapter Contents

Chapter Introduction

NAG Toolbox

NAG Toolbox: nag_mv_discrim_group (g03dc)

▸▿ Contents

1 Purpose

2 Syntax

3 Description

4 References

▸▿ 5 Parameters

5.1 Compulsory Input Parameters

5.2 Optional Input Parameters

5.3 Output Parameters

6 Error Indicators and Warnings

7 Accuracy

8 Further Comments

9 Example

Purpose

nag_mv_discrim_group (g03dc) allocates observations to groups according to selected rules. It is intended for use after nag_mv_discrim (g03da).

Syntax

[prior, p, iag, ati, ifail] = g03dc(typ, equal, priors, nig, gmn, gc, det, isx, x, prior, atiq, 'nvar', nvar, 'ng', ng, 'nobs', nobs, 'm', m)

[prior, p, iag, ati, ifail] = nag_mv_discrim_group(typ, equal, priors, nig, gmn, gc, det, isx, x, prior, atiq, 'nvar', nvar, 'ng', ng, 'nobs', nobs, 'm', m)

Note: the interface to this routine has changed since earlier releases of the toolbox:

At Mark 22:

nobs was made optional

Description

Discriminant analysis is concerned with the allocation of observations to groups using information from other observations whose group membership is known,

X_{t}

; these are called the training set. Consider

p

variables observed on

n_{g}

populations or groups. Let

{\bar{x}}_{j}

be the sample mean and

S_{j}

the within-group variance-covariance matrix for the

j

th group; these are calculated from a training set of

n

observations with

n_{j}

observations in the

j

th group, and let

x_{k}

be the

k

th observation from the set of observations to be allocated to the

n_{g}

groups. The observation can be allocated to a group according to a selected rule. The allocation rule or discriminant function will be based on the distance of the observation from an estimate of the location of the groups, usually the group means. A measure of the distance of the observation from the

j

th group mean is given by the Mahalanobis distance,

D_{k j}

D_{k j}^{2} = {(x_{k} - {\bar{x}}_{j})}^{T} S_{j}^{- 1} (x_{k} - {\bar{x}}_{j}) .

(1)

If the pooled estimate of the variance-covariance matrix

S

is used rather than the within-group variance-covariance matrices, then the distance is:

D_{k j}^{2} = {(x_{k} - {\bar{x}}_{j})}^{T} S^{- 1} (x_{k} - {\bar{x}}_{j}) .

(2)

Instead of using the variance-covariance matrices

S

and

S_{j}

, nag_mv_discrim_group (g03dc) uses the upper triangular matrices

R

and

R_{j}

supplied by nag_mv_discrim (g03da) such that

S = R^{T} R

and

S_{j} = R_{j}^{T} R_{j}

D_{k j}^{2}

can then be calculated as

z^{T} z

where

{R^{T}}_{j} z = (x_{k} - x_{j})

R^{T} z = (x_{k} - x)

as appropriate.

In addition to the distances, a set of prior probabilities of group membership,

π_{j}

, for

j = 1, 2, \dots, n_{g}

, may be used, with

\sum π_{j} = 1

. The prior probabilities reflect your view as to the likelihood of the observations coming from the different groups. Two common cases for prior probabilities are

π_{1} = π_{2} = \dots = π_{n_{g}}

, that is, equal prior probabilities, and

π_{j} = n_{j} / n

, for

j = 1, 2, \dots, n_{g}

, that is, prior probabilities proportional to the number of observations in the groups in the training set.

nag_mv_discrim_group (g03dc) uses one of four allocation rules. In all four rules the

p

variables are assumed to follow a multivariate Normal distribution with mean

μ_{j}

and variance-covariance matrix

Σ_{j}

if the observation comes from the

j

th group. The different rules depend on whether or not the within-group variance-covariance matrices are assumed equal, i.e.,

Σ_{1} = Σ_{2} = \dots = Σ_{n_{g}}

, and whether a predictive or estimative approach is used. If

p (x_{k} ∣ μ_{j}, Σ_{j})

is the probability of observing the observation

x_{k}

from group

j

, then the posterior probability of belonging to group

j

is:

p (j ∣ x_{k}, μ_{j}, Σ_{j}) \propto p (x_{k} ∣ μ_{j}, Σ_{j}) π_{j} .

(3)

In the estimative approach, the arguments

μ_{j}

and

Σ_{j}

in (3) are replaced by their estimates calculated from

X_{t}

. In the predictive approach, a non-informative prior distribution is used for the arguments and a posterior distribution for the arguments,

p (μ_{j}, Σ_{j} ∣ X_{t})

, is found. A predictive distribution is then obtained by integrating

p (j ∣ x_{k}, μ_{j}, Σ_{j}) p (μ_{j}, Σ_{j} ∣ X)

over the argument space. This predictive distribution then replaces

p (x_{k} ∣ μ_{j}, Σ_{j})

in (3). See Aitchison and Dunsmore (1975), Aitchison et al. (1977) and Moran and Murphy (1979) for further details.

The observation is allocated to the group with the highest posterior probability. Denoting the posterior probabilities,

p (j ∣ x_{k}, μ_{j}, Σ_{j})

, by

q_{j}

, the four allocation rules are:

(i)

Estimative with equal variance-covariance matrices – Linear Discrimination

\log q_{j} \propto - \frac{1}{2} D_{k j}^{2} + \log π_{j}

(ii)

Estimative with unequal variance-covariance matrices – Quadratic Discrimination

\log q_{j} \propto - \frac{1}{2} D_{k j}^{2} + \log π_{j} - \frac{1}{2} \log |S_{j}|

(iii)

Predictive with equal variance-covariance matrices

q_{j}^{- 1} \propto {((n_{j} + 1) / n_{j})}^{p / 2} {\{1 + [n_{j} / ((n - n_{g}) (n_{j} + 1))] D_{k j}^{2}\}}^{(n + 1 - n_{g}) / 2}

(iv)

Predictive with unequal variance-covariance matrices

q_{j}^{- 1} \propto C {\{((n_{j}^{2} - 1) / n_{j}) |S_{j}|\}}^{p / 2} {\{1 + (n_{j} / (n_{j}^{2} - 1)) D_{k j}^{2}\}}^{n_{j} / 2},

where

C = \frac{Γ (\frac{1}{2} (n_{j} - p))}{Γ (\frac{1}{2} n_{j})} .

In the above the appropriate value of

D_{k j}^{2}

from (1) or (2) is used. The values of the

q_{j}

are standardized so that,

\sum_{j = 1}^{n_{g}} q_{j} = 1 .

Moran and Murphy (1979) show the similarity between the predictive methods and methods based upon likelihood ratio tests.

In addition to allocating the observation to a group, nag_mv_discrim_group (g03dc) computes an atypicality index,

I_{j} (x_{k})

. The predictive atypicality index is returned, irrespective of the value of the parameter typ. This represents the probability of obtaining an observation more typical of group

j

than the observed

x_{k}

(see Aitchison and Dunsmore (1975) and Aitchison et al. (1977)). The atypicality index is computed for unequal within-group variance-covariance matrices as:

I_{j} (x_{k}) = P (B \leq z : \frac{1}{2} p, \frac{1}{2} (n_{j} - p))

where

P (B \leq β : a, b)

is the lower tail probability from a beta distribution and

z = D_{k j}^{2} / (D_{k j}^{2} + (n_{j}^{2} - 1) / n_{j}),

and for equal within-group variance-covariance matrices as:

I_{j} (x_{k}) = P (B \leq z : \frac{1}{2} p, \frac{1}{2} (n - n_{g} - p + 1)),

with

z = D_{k j}^{2} / (D_{k j}^{2} + (n - n_{g}) (n_{j} + 1) / n_{j}) .

I_{j} (x_{k})

is close to

1

for all groups it indicates that the observation may come from a grouping not represented in the training set. Moran and Murphy (1979) provide a frequentist interpretation of

I_{j} (x_{k})

References

Aitchison J and Dunsmore I R (1975) Statistical Prediction Analysis Cambridge

Aitchison J, Habbema J D F and Kay J W (1977) A critical comparison of two methods of statistical discrimination Appl. Statist. 26 15–25

Kendall M G and Stuart A (1976) The Advanced Theory of Statistics (Volume 3) (3rd Edition) Griffin

Krzanowski W J (1990) Principles of Multivariate Analysis Oxford University Press

Moran M A and Murphy B J (1979) A closer look at two alternative methods of statistical discrimination Appl. Statist. 28 223–232

Morrison D F (1967) Multivariate Statistical Methods McGraw–Hill

Parameters

Compulsory Input Parameters

1: $typ$ – string (length ≥ 1)

Whether the estimative or predictive approach is used.

$typ ='E'$: The estimative approach is used.
$typ ='P'$: The predictive approach is used.

Constraint:

typ ='E'

'P'

2: $equal$ – string (length ≥ 1)

Indicates whether or not the within-group variance-covariance matrices are assumed to be equal and the pooled variance-covariance matrix used.

$equal ='E'$: The within-group variance-covariance matrices are assumed equal and the matrix $R$ stored in the first $p (p + 1) / 2$ elements of gc is used.
$equal ='U'$: The within-group variance-covariance matrices are assumed to be unequal and the matrices $R_{i}$ , for $i = 1, 2, \dots, n_{g}$ , stored in the remainder of gc are used.

Constraint:

equal ='E'

'U'

3: $priors$ – string (length ≥ 1)

Indicates the form of the prior probabilities to be used.

$priors ='E'$: Equal prior probabilities are used.
$priors ='P'$: Prior probabilities proportional to the group sizes in the training set, $n_{j}$ , are used.
$priors ='I'$: The prior probabilities are input in prior.

Constraint:

priors ='E'

'I'

'P'

4: $nig (ng)$ – int64int32nag_int array

The number of observations in each group in the training set,

n_{j}

Constraints:

if $equal ='E'$ , $nig (j) > 0$ and $\sum_{j = 1}^{n_{g}} nig (j) > ng + nvar$ , for $j = 1, 2, \dots, n_{g}$ ;
if $equal ='U'$ , $nig (j) > nvar$ , for $j = 1, 2, \dots, n_{g}$ .

5: $gmn (ldgmn, nvar)$ – double array

ldgmn, the first dimension of the array, must satisfy the constraint

ldgmn \geq ng

The

j

th row of gmn contains the means of the

p

variables for the

j

th group, for

j = 1, 2, \dots, n_{j}

. These are returned by nag_mv_discrim (g03da).

6: $gc ((ng + 1) \times nvar \times (nvar + 1) / 2)$ – double array

The first

p (p + 1) / 2

elements of gc should contain the upper triangular matrix

R

and the next

n_{g}

blocks of

p (p + 1) / 2

elements should contain the upper triangular matrices

R_{j}

All matrices must be stored packed by column. These matrices are returned by nag_mv_discrim (g03da). If

equal ='E'

only the first

p (p + 1) / 2

elements are referenced, if

equal ='U'

only the elements

p (p + 1) / 2 + 1

(n_{g} + 1) p (p + 1) / 2

are referenced.

Constraints:

if $equal ='E'$ , the diagonal elements of $R$ must be $\neq 0.0$ ;
if $equal ='U'$ , the diagonal elements of the $R_{j}$ must be $\neq 0.0$ , for $j = 1, 2, \dots, n_{g}$ .

7: $\det (ng)$ – double array

equal ='U'

. the logarithms of the determinants of the within-group variance-covariance matrices as returned by nag_mv_discrim (g03da). Otherwise det is not referenced.

8: $isx (m)$ – int64int32nag_int array

isx (l)

indicates if the

l

th variable in x is to be included in the distance calculations.

isx (l) > 0

, the

l

th variable is included, for

l = 1, 2, \dots, m

; otherwise the

l

th variable is not referenced.

Constraint:

isx (l) > 0

for nvar values of

l

9: $x (ldx, m)$ – double array

ldx, the first dimension of the array, must satisfy the constraint

ldx \geq nobs

x (k, l)

must contain the

k

th observation for the

l

th variable, for

k = 1, 2, \dots, nobs

and

l = 1, 2, \dots, m

10: $prior (ng)$ – double array

priors ='I'

, the prior probabilities for the

n_{g}

groups.

Constraint: if

priors ='I'

prior (j) > 0.0

and

|1 - \sum_{j = 1}^{n_{g}} prior (j)| \leq 10 \times machine precision

, for

j = 1, 2, \dots, n_{g}

11: $atiq$ – logical scalar

atiq must be true if atypicality indices are required. If atiq is false the array ati is not set.

Optional Input Parameters

1: $nvar$ – int64int32nag_int scalar: Default: the second dimension of the array gmn.
$p$ , the number of variables in the variance-covariance matrices.

Constraint: $nvar \geq 1$ .
2: $ng$ – int64int32nag_int scalar: Default: the dimension of the arrays nig, det, prior and the first dimension of the array gmn. (An error is raised if these dimensions are not equal.)
The number of groups, $n_{g}$ .

Constraint: $ng \geq 2$ .
3: $nobs$ – int64int32nag_int scalar: Default: the first dimension of the arrays gmn, x. (An error is raised if these dimensions are not equal.)
The number of observations in x which are to be allocated.

Constraint: $nobs \geq 1$ .
4: $m$ – int64int32nag_int scalar: Default: the dimension of the array isx and the second dimension of the array x. (An error is raised if these dimensions are not equal.)
The number of variables in the data array x.

Constraint: $m \geq nvar$ .

Output Parameters

1: $prior (ng)$ – double array: If $priors ='P'$ , the computed prior probabilities in proportion to group sizes for the $n_{g}$ groups.
If $priors ='I'$ , the input prior probabilities will be unchanged.

If $priors ='E'$ , prior is not set.
2: $p (ldp, ng)$ – double array: $p (k, j)$ contains the posterior probability $p_{k j}$ for allocating the $k$ th observation to the $j$ th group, for $k = 1, 2, \dots, nobs$ and $j = 1, 2, \dots, n_{g}$ .
3: $iag (nobs)$ – int64int32nag_int array: The groups to which the observations have been allocated.
4: $ati (ldp :)$ – double array: The first dimension of the array ati will be $nobs$ .
The second dimension of the array ati will be $ng$ if $atiq = true$ and $1$ otherwise.

If atiq is true, $ati (k, j)$ will contain the predictive atypicality index for the $k$ th observation with respect to the $j$ th group, for $k = 1, 2, \dots, nobs$ and $j = 1, 2, \dots, n_{g}$ .
If atiq is false, ati is not set.
5: $ifail$ – int64int32nag_int scalar: $ifail = 0$ unless the function detects an error (see Error Indicators and Warnings).

Error Indicators and Warnings

Errors or warnings detected by the function:

$ifail = 1$

On entry,	$nvar < 1$ ,
or	$ng < 2$ ,
or	$nobs < 1$ ,
or	$m < nvar$ ,
or	$ldgmn < ng$ ,
or	$ldx < nobs$ ,
or	$ldp < nobs$ ,
or	$typ \neq'E'$ or ‘p’,
or	$equal \neq'E'$ or ‘U’,
or	$priors \neq'E'$ , ‘I’ or ‘p’.

$ifail = 2$

On entry,	the number of variables indicated by isx is not equal to nvar,
or	$equal ='E'$ and $nig (j) \leq 0$ , for some $j$ ,
or	$equal ='E'$ and $\sum_{j = 1}^{n_{g}} nig (j) \leq ng + nvar$ ,
or	$equal ='U'$ and $nig (j) \leq nvar$ for some $j$ .

$ifail = 3$

On entry,	$priors ='I'$ and $prior (j) \leq 0.0$ for some $j$ ,
or	$priors ='I'$ and $\sum_{j = 1}^{n_{g}} prior (j)$ is not within $10 \times machine precision$ of $1$ .

$ifail = 4$

On entry,	$equal ='E'$ and a diagonal element of $R$ is zero,
or	$equal ='U'$ and a diagonal element of $R_{j}$ for some $j$ is zero.

$ifail = - 99$: An unexpected error has been triggered by this routine. Please contact NAG.

$ifail = - 399$: Your licence key may have expired or may not have been installed correctly.

$ifail = - 999$: Dynamic memory allocation failed.

Accuracy

The accuracy of the returned posterior probabilities will depend on the accuracy of the input

R

R_{j}

matrices. The atypicality index should be accurate to four significant places.

Further Comments

The distances

D_{k j}^{2}

can be computed using nag_mv_discrim_mahal (g03db) if other forms of discrimination are required.

Example

The data, taken from Aitchison and Dunsmore (1975), is concerned with the diagnosis of three ‘types’ of Cushing's syndrome. The variables are the logarithms of the urinary excretion rates (mg/24hr) of two steroid metabolites. Observations for a total of

21

patients are input and the group means and

R

matrices are computed by nag_mv_discrim (g03da). A further six observations of unknown type are input and allocations made using the predictive approach and under the assumption that the within-group covariance matrices are not equal. The posterior probabilities of group membership,

q_{j}

, and the atypicality index are printed along with the allocated group. The atypicality index shows that observations

5

and

6

do not seem to be typical of the three types present in the initial

21

observations.

Open in the MATLAB editor: g03dc_example

function g03dc_example


fprintf('g03dc example results\n\n');

x = [1.1314,  2.4596;
     1.0986,  0.2624;
     0.6419, -2.3026;
     1.3350, -3.2189;
     1.4110,  0.0953;
     0.6419, -0.9163;
     2.1163,  0.0000;
     1.3350, -1.6094;
     1.3610, -0.5108;
     2.0541,  0.1823;
     2.2083, -0.5108;
     2.7344,  1.2809;
     2.0412,  0.4700;
     1.8718, -0.9163;
     1.7405, -0.9163;
     2.6101,  0.4700;
     2.3224,  1.8563;
     2.2192,  2.0669;
     2.2618,  1.1314;
     3.9853,  0.9163;
     2.7600,  2.0281];
[n,m] = size(x);
isx  = ones(m,1,'int64');
nvar = int64(m);
ing  = ones(n,1,'int64');
ing(7:16) = int64(2);
ing(17:n) = int64(3);
ng        = int64(3);

% Compute covariance matrix
[nig, gmean, det, gc, stat, df, sig, ifail] = ...
  g03da( ...
	 x, isx, nvar, ing, ng);

% Data to group
x = [1.6292, -0.9163;
     2.5572,  1.6094;
     2.5649, -0.2231;
     0.9555, -2.3026;
     3.4012, -2.3026;
     3.0204, -0.2231];

% Grouping parameters
typ    = 'P';
equal  = 'U';
priors = 'Equal priors';
prior  = zeros(3, 1);
atiq   = true;

[prior, p, iag, ati, ifail] = ...
  g03dc( ...
	 typ, equal, priors, nig, gmean, gc, det, isx, x, prior, atiq);

fprintf('   Obs       Posterior        Allocated     Atypicality\n');
fprintf('             probabilities    to group      index\n');
for i=1:6
  fprintf('%6d     ', i);
  fprintf('%6.3f', p(i,:));
  fprintf('%6d     ', iag(i));
  fprintf('%6.3f', ati(i,:));
  fprintf('\n');
 end

g03dc example results

   Obs       Posterior        Allocated     Atypicality
             probabilities    to group      index
     1      0.094 0.905 0.002     2      0.596 0.254 0.975
     2      0.005 0.168 0.827     3      0.952 0.836 0.018
     3      0.019 0.920 0.062     2      0.954 0.797 0.912
     4      0.697 0.303 0.000     1      0.207 0.860 0.993
     5      0.317 0.013 0.670     3      0.991 1.000 0.984
     6      0.032 0.366 0.601     3      0.981 0.978 0.887

PDF version (NAG web site, 64-bit version, 64-bit version)

Chapter Contents

Chapter Introduction

NAG Toolbox