NAG FL Interface
g03aaf (prin_comp)
1
Purpose
g03aaf performs a principal component analysis on a data matrix; both the principal component loadings and the principal component scores are returned.
2
Specification
Fortran Interface
Subroutine g03aaf ( 
matrix, std, weight, n, m, x, ldx, isx, s, wt, nvar, e, lde, p, ldp, v, ldv, wk, ifail) 
Integer, Intent (In) 
:: 
n, m, ldx, isx(m), nvar, lde, ldp, ldv 
Integer, Intent (Inout) 
:: 
ifail 
Real (Kind=nag_wp), Intent (In) 
:: 
x(ldx,m), wt(*), wk(1) 
Real (Kind=nag_wp), Intent (Inout) 
:: 
s(m), e(lde,6), p(ldp,nvar), v(ldv,nvar) 
Character (1), Intent (In) 
:: 
matrix, std, weight 

C Header Interface
#include <nag.h>
void 
g03aaf_ (const char *matrix, const char *std, const char *weight, const Integer *n, const Integer *m, const double x[], const Integer *ldx, const Integer isx[], double s[], const double wt[], const Integer *nvar, double e[], const Integer *lde, double p[], const Integer *ldp, double v[], const Integer *ldv, const double wk[], Integer *ifail, const Charlen length_matrix, const Charlen length_std, const Charlen length_weight) 

C++ Header Interface
#include <nag.h> extern "C" {
void 
g03aaf_ (const char *matrix, const char *std, const char *weight, const Integer &n, const Integer &m, const double x[], const Integer &ldx, const Integer isx[], double s[], const double wt[], const Integer &nvar, double e[], const Integer &lde, double p[], const Integer &ldp, double v[], const Integer &ldv, const double wk[], Integer &ifail, const Charlen length_matrix, const Charlen length_std, const Charlen length_weight) 
}

The routine may be called by the names g03aaf or nagf_mv_prin_comp.
3
Description
Let
$X$ be an
$n$ by
$p$ data matrix of
$n$ observations on
$p$ variables
${x}_{1},{x}_{2},\dots ,{x}_{p}$ and let the
$p$ by
$p$ variancecovariance matrix of
${x}_{1},{x}_{2},\dots ,{x}_{p}$ be
$S$. A vector
${a}_{1}$ of length
$p$ is found such that:
The variable
${z}_{1}={\displaystyle \sum _{i=1}^{p}}{a}_{1i}{x}_{i}$ is known as the first principal component and gives the linear combination of the variables that gives the maximum variation. A second principal component,
${z}_{2}={\displaystyle \sum _{i=1}^{p}}{a}_{2i}{x}_{i}$, is found such that:
This gives the linear combination of variables that is orthogonal to the first principal component that gives the maximum variation. Further principal components are derived in a similar way.
The vectors
${a}_{1},{a}_{2},\dots ,{a}_{p}$, are the eigenvectors of the matrix
$S$ and associated with each eigenvector is the eigenvalue,
${\lambda}_{i}^{2}$. The value of
${\lambda}_{i}^{2}/\sum {\lambda}_{i}^{2}$ gives the proportion of variation explained by the
$i$th principal component. Alternatively, the
${a}_{i}$'s can be considered as the right singular vectors in a singular value decomposition with singular values
${\lambda}_{i}$ of the data matrix centred about its mean and scaled by
$1/\sqrt{\left(n1\right)}$,
${X}_{s}$. This latter approach is used in
g03aaf, with
where
$\Lambda $ is a diagonal matrix with elements
${\lambda}_{i}$,
$P$ is the
$p$ by
$p$ matrix with columns
${a}_{i}$ and
$V$ is an
$n$ by
$p$ matrix with
${V}^{\prime}V=I$, which gives the principal component scores.
Principal component analysis is often used to reduce the dimension of a dataset, replacing a large number of correlated variables with a smaller number of orthogonal variables that still contain most of the information in the original dataset.
The choice of the number of dimensions required is usually based on the amount of variation accounted for by the leading principal components. If
$k$ principal components are selected, then a test of the equality of the remaining
$pk$ eigenvalues is
which has, asymptotically, a
${\chi}^{2}$distribution with
$\frac{1}{2}\left(pk1\right)\left(pk+2\right)$ degrees of freedom.
Equality of the remaining eigenvalues indicates that if any more principal components are to be considered then they all should be considered.
Instead of the variancecovariance matrix the correlation matrix, the sums of squares and crossproducts matrix or a standardized sums of squares and crossproducts matrix may be used. In the last case $S$ is replaced by ${\sigma}^{\frac{1}{2}}S{\sigma}^{\frac{1}{2}}$ for a diagonal matrix $\sigma $ with positive elements. If the correlation matrix is used, the ${\chi}^{2}$ approximation for the statistic given above is not valid.
The principal component scores, $F$, are the values of the principal component variables for the observations. These can be standardized so that the variance of these scores for each principal component is $1.0$ or equal to the corresponding eigenvalue.
Weights can be used with the analysis, in which case the matrix $X$ is first centred about the weighted means then each row is scaled by an amount $\sqrt{{w}_{i}}$, where ${w}_{i}$ is the weight for the $i$th observation.
4
References
Chatfield C and Collins A J (1980) Introduction to Multivariate Analysis Chapman and Hall
Cooley W C and Lohnes P R (1971) Multivariate Data Analysis Wiley
Hammarling S (1985) The singular value decomposition in multivariate statistics SIGNUM Newsl. 20(3) 2–25
Kendall M G and Stuart A (1969) The Advanced Theory of Statistics (Volume 1) (3rd Edition) Griffin
Morrison D F (1967) Multivariate Statistical Methods McGraw–Hill
5
Arguments

1:
$\mathbf{matrix}$ – Character(1)
Input

On entry: indicates for which type of matrix the principal component analysis is to be carried out.
 ${\mathbf{matrix}}=\text{'C'}$
 It is for the correlation matrix.
 ${\mathbf{matrix}}=\text{'S'}$
 It is for a standardized matrix, with standardizations given by s.
 ${\mathbf{matrix}}=\text{'U'}$
 It is for the sums of squares and crossproducts matrix.
 ${\mathbf{matrix}}=\text{'V'}$
 It is for the variancecovariance matrix.
Constraint:
${\mathbf{matrix}}=\text{'C'}$, $\text{'S'}$, $\text{'U'}$ or $\text{'V'}$.

2:
$\mathbf{std}$ – Character(1)
Input

On entry: indicates if the principal component scores are to be standardized.
 ${\mathbf{std}}=\text{'S'}$
 The principal component scores are standardized so that ${F}^{\prime}F=I$, i.e., $F={X}_{s}P{\Lambda}^{1}=V$.
 ${\mathbf{std}}=\text{'U'}$
 The principal component scores are unstandardized, i.e., $F={X}_{s}P=V\Lambda $.
 ${\mathbf{std}}=\text{'Z'}$
 The principal component scores are standardized so that they have unit variance.
 ${\mathbf{std}}=\text{'E'}$
 The principal component scores are standardized so that they have variance equal to the corresponding eigenvalue.
Constraint:
${\mathbf{std}}=\text{'E'}$, $\text{'S'}$, $\text{'U'}$ or $\text{'Z'}$.

3:
$\mathbf{weight}$ – Character(1)
Input

On entry: indicates if weights are to be used.
 ${\mathbf{weight}}=\text{'U'}$
 No weights are used.
 ${\mathbf{weight}}=\text{'W'}$
 Weights are used and must be supplied in wt.
Constraint:
${\mathbf{weight}}=\text{'U'}$ or $\text{'W'}$.

4:
$\mathbf{n}$ – Integer
Input

On entry: $n$, the number of observations.
Constraint:
${\mathbf{n}}\ge 2$.

5:
$\mathbf{m}$ – Integer
Input

On entry: $m$, the number of variables in the data matrix.
Constraint:
${\mathbf{m}}\ge 1$.

6:
$\mathbf{x}\left({\mathbf{ldx}},{\mathbf{m}}\right)$ – Real (Kind=nag_wp) array
Input

On entry: ${\mathbf{x}}\left(\mathit{i},\mathit{j}\right)$ must contain the $\mathit{i}$th observation for the $\mathit{j}$th variable, for $\mathit{i}=1,2,\dots ,n$ and $\mathit{j}=1,2,\dots ,m$.

7:
$\mathbf{ldx}$ – Integer
Input

On entry: the first dimension of the array
x as declared in the (sub)program from which
g03aaf is called.
Constraint:
${\mathbf{ldx}}\ge {\mathbf{n}}$.

8:
$\mathbf{isx}\left({\mathbf{m}}\right)$ – Integer array
Input

On entry:
${\mathbf{isx}}\left(j\right)$ indicates whether or not the
$j$th variable is to be included in the analysis.
If
${\mathbf{isx}}\left(\mathit{j}\right)>0$, the variable contained in the
$\mathit{j}$th column of
x is included in the principal component analysis, for
$\mathit{j}=1,2,\dots ,m$.
Constraint:
${\mathbf{isx}}\left(j\right)>0$ for
nvar values of
$j$.

9:
$\mathbf{s}\left({\mathbf{m}}\right)$ – Real (Kind=nag_wp) array
Input/Output

On entry: the standardizations to be used, if any.
If
${\mathbf{matrix}}=\text{'S'}$, the first
$m$ elements of
s must contain the standardization coefficients, the diagonal elements of
$\sigma $.
Constraint:
if ${\mathbf{isx}}\left(j\right)>0$,
${\mathbf{s}}\left(\mathit{j}\right)>0.0$, for $\mathit{j}=1,2,\dots ,m$.
On exit: if
${\mathbf{matrix}}=\text{'S'}$,
s is unchanged on exit.
If
${\mathbf{matrix}}=\text{'C'}$,
s contains the variances of the selected variables.
${\mathbf{s}}\left(j\right)$ contains the variance of the variable in the
$j$th column of
x if
${\mathbf{isx}}\left(j\right)>0$.
If
${\mathbf{matrix}}=\text{'U'}$ or
$\text{'V'}$,
s is not referenced.

10:
$\mathbf{wt}\left(*\right)$ – Real (Kind=nag_wp) array
Input

Note: the dimension of the array
wt
must be at least
${\mathbf{n}}$ if
${\mathbf{weight}}=\text{'W'}$, and at least
$1$ otherwise.
On entry: if
${\mathbf{weight}}=\text{'W'}$, the first
$n$ elements of
wt must contain the weights to be used in the principal component analysis.
If ${\mathbf{wt}}\left(i\right)=0.0$, the $i$th observation is not included in the analysis. The effective number of observations is the sum of the weights.
If
${\mathbf{weight}}=\text{'U'}$,
wt is not referenced and the effective number of observations is
$n$.
Constraints:
 ${\mathbf{wt}}\left(\mathit{i}\right)\ge 0.0$, for $\mathit{i}=1,2,\dots ,n$;
 the sum of weights $\text{}\ge {\mathbf{nvar}}+1$.

11:
$\mathbf{nvar}$ – Integer
Input

On entry: $p$, the number of variables in the principal component analysis.
Constraint:
$1\le {\mathbf{nvar}}\le \mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({\mathbf{n}}1,{\mathbf{m}}\right)$.

12:
$\mathbf{e}\left({\mathbf{lde}},6\right)$ – Real (Kind=nag_wp) array
Output

On exit: the statistics of the principal component analysis.
 ${\mathbf{e}}\left(i,1\right)$
 The eigenvalues associated with the
$\mathit{i}$th principal component, ${\lambda}_{\mathit{i}}^{2}$, for $\mathit{i}=1,2,\dots ,p$.
 ${\mathbf{e}}\left(i,2\right)$
 The proportion of variation explained by the
$\mathit{i}$th principal component, for $\mathit{i}=1,2,\dots ,p$.
 ${\mathbf{e}}\left(\mathit{i},3\right)$
 The cumulative proportion of variation explained by the first
$\mathit{i}$th principal components, for $\mathit{i}=1,2,\dots ,p$.
 ${\mathbf{e}}\left(\mathit{i},4\right)$
 The ${\chi}^{2}$ statistics, for $i=1,2,\dots ,p$.
 ${\mathbf{e}}\left(i,5\right)$
 The degrees of freedom for the ${\chi}^{2}$ statistics, for $i=1,2,\dots ,p$.
If ${\mathbf{matrix}}\ne \text{'C'}$,
${\mathbf{e}}\left(\mathit{i},6\right)$ contains significance level for the ${\chi}^{2}$ statistic, for $\mathit{i}=1,2,\dots ,p$.
If ${\mathbf{matrix}}=\text{'C'}$, ${\mathbf{e}}\left(i,6\right)$ is returned as zero.

13:
$\mathbf{lde}$ – Integer
Input

On entry: the first dimension of the array
e as declared in the (sub)program from which
g03aaf is called.
Constraint:
${\mathbf{lde}}\ge {\mathbf{nvar}}$.

14:
$\mathbf{p}\left({\mathbf{ldp}},{\mathbf{nvar}}\right)$ – Real (Kind=nag_wp) array
Output

On exit: the first
nvar columns of
p contain the principal component loadings,
${a}_{i}$. The
$j$th column of
p contains the
nvar coefficients for the
$j$th principal component.

15:
$\mathbf{ldp}$ – Integer
Input

On entry: the first dimension of the array
p as declared in the (sub)program from which
g03aaf is called.
Constraint:
${\mathbf{ldp}}\ge {\mathbf{nvar}}$.

16:
$\mathbf{v}\left({\mathbf{ldv}},{\mathbf{nvar}}\right)$ – Real (Kind=nag_wp) array
Output

On exit: the first
nvar columns of
v contain the principal component scores. The
$j$th column of
v contains the
n scores for the
$j$th principal component.
If ${\mathbf{weight}}=\text{'W'}$, any rows for which ${\mathbf{wt}}\left(i\right)$ is zero will be set to zero.

17:
$\mathbf{ldv}$ – Integer
Input

On entry: the first dimension of the array
v as declared in the (sub)program from which
g03aaf is called.
Constraint:
${\mathbf{ldv}}\ge {\mathbf{n}}$.

18:
$\mathbf{wk}\left(1\right)$ – Real (Kind=nag_wp) array
Input

This argument is no longer accessed by g03aaf. Workspace is provided internally by dynamic allocation instead.

19:
$\mathbf{ifail}$ – Integer
Input/Output

On entry:
ifail must be set to
$0$,
$1\text{or}1$. If you are unfamiliar with this argument you should refer to
Section 4 in the Introduction to the NAG Library FL Interface for details.
For environments where it might be inappropriate to halt program execution when an error is detected, the value
$1\text{or}1$ is recommended. If the output of error messages is undesirable, then the value
$1$ is recommended. Otherwise, if you are not familiar with this argument, the recommended value is
$0$.
When the value $\mathbf{1}\text{or}\mathbf{1}$ is used it is essential to test the value of ifail on exit.
On exit:
${\mathbf{ifail}}={\mathbf{0}}$ unless the routine detects an error or a warning has been flagged (see
Section 6).
6
Error Indicators and Warnings
If on entry
${\mathbf{ifail}}=0$ or
$1$, explanatory error messages are output on the current error message unit (as defined by
x04aaf).
Errors or warnings detected by the routine:
 ${\mathbf{ifail}}=1$

On entry, ${\mathbf{lde}}=\u2329\mathit{\text{value}}\u232a$ and ${\mathbf{nvar}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{lde}}\ge {\mathbf{nvar}}$.
On entry, ${\mathbf{ldp}}=\u2329\mathit{\text{value}}\u232a$ and ${\mathbf{nvar}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{ldp}}\ge {\mathbf{nvar}}$.
On entry, ${\mathbf{ldv}}=\u2329\mathit{\text{value}}\u232a$ and ${\mathbf{n}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{ldv}}\ge {\mathbf{n}}$.
On entry, ${\mathbf{ldx}}=\u2329\mathit{\text{value}}\u232a$ and ${\mathbf{n}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{ldx}}\ge {\mathbf{n}}$.
On entry, ${\mathbf{m}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{m}}\ge 1$.
On entry, ${\mathbf{matrix}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{matrix}}=\text{'C'}$, $\text{'S'}$, $\text{'U'}$ or $\text{'V'}$.
On entry, ${\mathbf{n}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{n}}\ge 2$.
On entry, ${\mathbf{n}}=\u2329\mathit{\text{value}}\u232a$ and ${\mathbf{nvar}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{n}}>{\mathbf{nvar}}$.
On entry, ${\mathbf{nvar}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{nvar}}\ge 1$.
On entry, ${\mathbf{nvar}}=\u2329\mathit{\text{value}}\u232a$ and ${\mathbf{m}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{nvar}}\le {\mathbf{m}}$.
On entry, ${\mathbf{std}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{std}}=\text{'E'}$, $\text{'S'}$, $\text{'U'}$ or $\text{'Z'}$.
On entry, ${\mathbf{weight}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{weight}}=\text{'U'}$ or $\text{'W'}$.
 ${\mathbf{ifail}}=2$

On entry, $i=\u2329\mathit{\text{value}}\u232a$ and ${\mathbf{wt}}\left(i\right)<0.0$.
Constraint: ${\mathbf{wt}}\left(\mathit{i}\right)\ge 0.0$.
 ${\mathbf{ifail}}=3$

Number of selected variables $\text{}\ge \text{}$ effective number of observations.
On entry,
${\mathbf{nvar}}=\u2329\mathit{\text{value}}\u232a$ and
$\u2329\mathit{\text{value}}\u232a$ values of
${\mathbf{isx}}>0$.
Constraint: exactly
nvar elements of
${\mathbf{isx}}>0$.
 ${\mathbf{ifail}}=4$

On entry, $j=\u2329\mathit{\text{value}}\u232a$ and ${\mathbf{s}}\left(j\right)\le 0.0$.
Constraint: ${\mathbf{s}}\left(j\right)>0.0$.
 ${\mathbf{ifail}}=5$

The singular value decomposition has failed to converge. This is an unlikely error exit.
 ${\mathbf{ifail}}=6$

All eigenvalues/singular values are zero. This will be caused by all the variables being constant.
 ${\mathbf{ifail}}=99$
An unexpected error has been triggered by this routine. Please
contact
NAG.
See
Section 7 in the Introduction to the NAG Library FL Interface for further information.
 ${\mathbf{ifail}}=399$
Your licence key may have expired or may not have been installed correctly.
See
Section 8 in the Introduction to the NAG Library FL Interface for further information.
 ${\mathbf{ifail}}=999$
Dynamic memory allocation failed.
See
Section 9 in the Introduction to the NAG Library FL Interface for further information.
7
Accuracy
As g03aaf uses a singular value decomposition of the data matrix, it will be less affected by illconditioned problems than traditional methods using the eigenvalue decomposition of the variancecovariance matrix.
8
Parallelism and Performance
g03aaf makes calls to BLAS and/or LAPACK routines, which may be threaded within the vendor library used by this implementation. Consult the documentation for the vendor library for further information.
Please consult the
X06 Chapter Introduction for information on how to control and interrogate the OpenMP environment used within this routine. Please also consult the
Users' Note for your implementation for any additional implementationspecific information.
None.
10
Example
A dataset is taken from
Cooley and Lohnes (1971), it consists of ten observations on three variables. The unweighted principal components based on the variancecovariance matrix are computed and the principal component scores requested. The principal component scores are standardized so that they have variance equal to the corresponding eigenvalue.
10.1
Program Text
10.2
Program Data
10.3
Program Results