variables that maximizes the ratio of between-group to within-group variation. The variables formed, the canonical variates can then be used to discriminate between groups.

The canonical variates can be calculated from the eigenvectors of the within-group sums of squares and cross-products matrix. However, nag_mv_canon_var (g03ac) calculates the canonical variates by means of a singular value decomposition (SVD) of a matrix

V

. Let the data matrix with variable (column) means subtracted be

X

, and let its rank be

k

; then the

k

by (

n_{g} - 1

) matrix

V

is given by:

V = Q_{X}^{T} Q_{g},

where

Q_{g}

is an

n

(n_{g} - 1)

orthogonal matrix that defines the groups and

Q_{X}

is the first

k

rows of the orthogonal matrix

Q

either from the

Q R

decomposition of

X

X = Q R

X

is of full column rank, i.e.,

k = n_{x}

, else from the SVD of

X

X = Q D P^{T} .

Let the SVD of

V

be:

V = U_{x} Δ U_{g}^{T}

then the nonzero elements of the diagonal matrix

Δ

δ_{i}

, for

i = 1, 2, \dots, l

, are the

l

canonical correlations associated with the

l = \min (k, n_{g} - 1)

canonical variates, where

l = \min (k, n_{g})

The eigenvalues,

λ_{i}^{2}

, of the within-group sums of squares matrix are given by:

λ_{i}^{2} = \frac{δ_{i}^{2}}{1 - δ_{i}^{2}}

and the value of

π_{i} = λ_{i}^{2} / \sum λ_{i}^{2}

gives the proportion of variation explained by the

i

th canonical variate. The values of the

π_{i}

's give an indication as to how many canonical variates are needed to adequately describe the data, i.e., the dimensionality of the problem.

To test for a significant dimensionality greater than

i

the

χ^{2}

statistic:

(n - 1 - n_{g} - \frac{1}{2} (k - n_{g})) \sum_{j = i + 1}^{l} \log (1 + λ_{j}^{2})

can be used. This is asymptotically distributed as a

χ^{2}

-distribution with

(k - i) (n_{g} - 1 - i)

degrees of freedom. If the test for

i = h

is not significant, then the remaining tests for

i > h

should be ignored.

The loadings for the canonical variates are calculated from the matrix

U_{x}

. This matrix is scaled so that the canonical variates have unit within-group variance.

In addition to the canonical variates loadings the means for each canonical variate are calculated for each group.

Weights can be used with the analysis, in which case the weighted means are subtracted from each column and then each row is scaled by an amount

\sqrt{w_{i}}

, where

w_{i}

is the weight for the

i

th observation (row).

References

Chatfield C and Collins A J (1980) Introduction to Multivariate Analysis Chapman and Hall

Gnanadesikan R (1977) Methods for Statistical Data Analysis of Multivariate Observations Wiley

Hammarling S (1985) The singular value decomposition in multivariate statistics SIGNUM Newsl. 20(3) 2–25

Kendall M G and Stuart A (1969) The Advanced Theory of Statistics (Volume 1) (3rd Edition) Griffin

Parameters

Compulsory Input Parameters

1: $weight$ – string (length ≥ 1)

Indicates if weights are to be used.

$weight ='U'$: No weights are used.
$weight ='W'$ or $'V'$: Weights are used and must be supplied in wt.

weight ='W'

, the weights are treated as frequencies and the effective number of observations is the sum of the weights.

weight ='V'

, the weights are treated as being inversely proportional to the variance of the observations and the effective number of observations is the number of observations with nonzero weights.

Constraint:

weight ='U'

'W'

'V'

2: $x (ldx, m)$ – double array

ldx, the first dimension of the array, must satisfy the constraint

ldx \geq n

x (i, j)

must contain the

i

th observation for the

j

th variable, for

i = 1, 2, \dots, n

and

j = 1, 2, \dots, m

3: $isx (m)$ – int64int32nag_int array

isx (j)

indicates whether or not the

j

th variable is to be included in the analysis.

isx (j) > 0

, the variables contained in the

j

th column of x is included in the canonical variate analysis, for

j = 1, 2, \dots, m

Constraint:

isx (j) > 0

for nx values of

j

4: $nx$ – int64int32nag_int scalar

The number of variables in the analysis,

n_{x}

Constraint:

nx \geq 1

5: $ing (n)$ – int64int32nag_int array

ing (i)

indicates which group the

i

th observation is in, for

i = 1, 2, \dots, n

. The effective number of groups is the number of groups with nonzero membership.

Constraint:

1 \leq ing (i) \leq ng

, for

i = 1, 2, \dots, n

6: $ng$ – int64int32nag_int scalar

The number of groups,

n_{g}

Constraint:

ng \geq 2

7: $wt (:)$ – double array

The dimension of the array wt must be at least

n

weight ='W'

'V'

, and at least

1

otherwise

weight ='W'

'V'

, the first

n

elements of wt must contain the weights to be used in the analysis.

wt (i) = 0.0

, the

i

th observation is not included in the analysis.

weight ='U'

, wt is not referenced.

Constraints:

$wt (i) \geq 0.0$ , for $i = 1, 2, \dots, n$ ;
$\sum_{1}^{n} wt (i) \geq nx + effective number of groups$ .

8: $tol$ – double scalar

The value of tol is used to decide if the variables are of full rank and, if not, what is the rank of the variables. The smaller the value of tol the stricter the criterion for selecting the singular value decomposition. If a non-negative value of tol less than machine precision is entered, the square root of machine precision is used instead.

Constraint:

tol \geq 0.0

Optional Input Parameters

1: $n$ – int64int32nag_int scalar: Default: the dimension of the array ing and the first dimension of the array x. (An error is raised if these dimensions are not equal.)
$n$ , the number of observations.

Constraint: $n \geq nx + ng$ .
2: $m$ – int64int32nag_int scalar: Default: the dimension of the array isx and the second dimension of the array x. (An error is raised if these dimensions are not equal.)
$m$ , the total number of variables.

Constraint: $m \geq nx$ .

Output Parameters

1: $nig (ng)$ – int64int32nag_int array

nig (j)

gives the number of observations in group

j

, for

j = 1, 2, \dots, n_{g}

2: $cvm (ldcvm, nx)$ – double array

cvm (i, j)

contains the mean of the

j

th canonical variate for the

i

th group, for

i = 1, 2, \dots, n_{g}

and

j = 1, 2, \dots, l

; the remaining columns, if any, are used as workspace.

3: $e (lde, 6)$ – double array

The statistics of the canonical variate analysis.

$e (i, 1)$: The canonical correlations, $δ_{i}$ , for $i = 1, 2, \dots, l$ .
$e (i, 2)$: The eigenvalues of the within-group sum of squares matrix, $λ_{i}^{2}$ , for $i = 1, 2, \dots, l$ .
$e (i, 3)$: The proportion of variation explained by the $i$ th canonical variate, for $i = 1, 2, \dots, l$ .
$e (i, 4)$: The $χ^{2}$ statistic for the $i$ th canonical variate, for $i = 1, 2, \dots, l$ .
$e (i, 5)$: The degrees of freedom for $χ^{2}$ statistic for the $i$ th canonical variate, for $i = 1, 2, \dots, l$ .
$e (i, 6)$: The significance level for the $χ^{2}$ statistic for the $i$ th canonical variate, for $i = 1, 2, \dots, l$ .

4: $ncv$ – int64int32nag_int scalar

The number of canonical variates,

l

. This will be the minimum of

n_{g} - 1

and the rank of x.

5: $cvx (ldcvx, ng - 1)$ – double array

The canonical variate loadings.

cvx (i, j)

contains the loading coefficient for the

i

th variable on the

j

th canonical variate, for

i = 1, 2, \dots, n_{x}

and

j = 1, 2, \dots, l

; the remaining columns, if any, are used as workspace.

6: $irankx$ – int64int32nag_int scalar

The rank of the dependent variables.

If the variables are of full rank then

irankx = nx

If the variables are not of full rank then irankx is an estimate of the rank of the dependent variables. irankx is calculated as the number of singular values greater than

tol \times (largest singular value)

7: $ifail$ – int64int32nag_int scalar

ifail = 0

unless the function detects an error (see Error Indicators and Warnings).

Error Indicators and Warnings

Errors or warnings detected by the function:

Cases prefixed with W are classified as warnings and do not generate an error of type NAG:error_n. See nag_issue_warnings.

$ifail = 1$

On entry,	$nx < 1$ ,
or	$ng < 2$ ,
or	$m < nx$ ,
or	$n < nx + ng$ ,
or	$ldx < n$ ,
or	$ldcvx < nx$ ,
or	$ldcvm < ng$ ,
or	$lde < \min (nx, ng - 1)$ ,
or	$nx \geq ng - 1$ and $iwk < n \times nx + \max (5 \times (nx - 1) + (nx + 1) \times nx, n)$ ,
or	$nx < ng - 1$ and $iwk < n \times nx + \max (5 \times (nx - 1) + (ng - 1) \times nx, n)$ ,
or	$weight \neq'U'$ , $'W'$ or $'V'$ ,
or	$tol < 0.0$ .

$ifail = 2$

On entry,

weight ='W'

'V'

and a value of

wt < 0.0

$ifail = 3$

On entry,	a value of $ing < 1$ ,
or	a value of $ing > ng$ .

$ifail = 4$: On entry, the number of variables to be included in the analysis as indicated by isx is not equal to nx.

$ifail = 5$: A singular value decomposition has failed to converge. This is an unlikely error exit.

W $ifail = 6$: A canonical correlation is equal to $1$ . This will happen if the variables provide an exact indication as to which group every observation is allocated.

$ifail = 7$

On entry,	less than two groups have nonzero membership, i.e., the effective number of groups is less than $2$ ,
or	the effective number of groups plus the number of variables, nx, is greater than the effective number of observations.

W $ifail = 8$: The rank of the variables is $0$ . This will happen if all the variables are constants.

$ifail = - 99$: An unexpected error has been triggered by this routine. Please contact NAG.

$ifail = - 399$: Your licence key may have expired or may not have been installed correctly.

$ifail = - 999$: Dynamic memory allocation failed.

Accuracy

As the computation involves the use of orthogonal matrices and a singular value decomposition rather than the traditional computing of a sum of squares matrix and the use of an eigenvalue decomposition, nag_mv_canon_var (g03ac) should be less affected by ill-conditioned problems.

Further Comments

None.

Example

This example uses a sample of nine observations, each consisting of three variables plus a group indicator. There are three groups. An unweighted canonical variate analysis is performed and the results printed.

Open in the MATLAB editor: g03ac_example

function g03ac_example


fprintf('g03ac example results\n\n');

x = [13.3, 10.6, 21.2;
     13.6, 10.2, 21.0;
     14.2, 10.7, 21.1;
     13.4,  9.4, 21.0;
     13.2,  9.6, 20.1;
     13.9, 10.4, 19.8;
     12.9, 10.0, 20.5;
     12.2,  9.9, 20.7;
     13.9, 11.0, 19.1];
n = size(x,2);
weight = 'U';
isx = ones(n,1,'int64');
nx  = int64(n);
ing = [int64(1);2;3; 1;2;3; 1;2;3];
ng  = int64(n);
wt  = [];
tol = 1e-06;

[nig, cvm, e, ncv, cvx, irankx, ifail] = ...
  g03ac( ...
	 weight, x, isx, nx, ing, ng, wt, tol);

fprintf('Rank of x = %d\n\n', irankx);
fprintf('Canonical    Eigenvalues Percentage     Chisq      DF     Sig\n');
fprintf('correlations              variation\n');
fprintf('%11.4f%12.4f%12.4f%10.4f%8.1f%8.4f\n',e');
fprintf('\n');

mtitle = 'Canonical Coefficients for x';
matrix = 'General';
diag   = ' ';

[ifail] = x04ca( ...
                 matrix, diag, cvx, mtitle);

fprintf('\n');
mtitle = 'Canonical variate means';
[ifail] = x04ca( ...
                 matrix, diag, cvm(:,1:ncv), mtitle);

g03ac example results

Rank of x = 3

Canonical    Eigenvalues Percentage     Chisq      DF     Sig
correlations              variation
     0.8826      3.5238      0.9795    7.9032     6.0  0.2453
     0.2623      0.0739      0.0205    0.3564     2.0  0.8368

 Canonical Coefficients for x
             1          2
 1     -1.7070     0.7277
 2     -1.3481     0.3138
 3      0.9327     1.2199

 Canonical variate means
             1          2
 1      0.9841     0.2797
 2      1.1805    -0.2632
 3     -2.1646    -0.0164

PDF version (NAG web site, 64-bit version, 64-bit version)

Chapter Contents

Chapter Introduction

NAG Toolbox