NAG CL Interface
g03aac (prin_​comp)

Settings help

CL Name Style:


1 Purpose

g03aac performs a principal component analysis on a data matrix; both the principal component loadings and the principal component scores are returned.

2 Specification

#include <nag.h>
void  g03aac (Nag_PrinCompMat pcmatrix, Nag_PrinCompScores scores, Integer n, Integer m, const double x[], Integer tdx, const Integer isx[], double s[], const double wt[], Integer nvar, double e[], Integer tde, double p[], Integer tdp, double v[], Integer tdv, NagError *fail)
The function may be called by the names: g03aac or nag_mv_prin_comp.

3 Description

Let X be an n × p data matrix of n observations on p variables x 1 , x 2 , , x p and let the p × p variance-covariance matrix of x 1 , x 2 , , x p be S . A vector a 1 of length p is found such that:
a1T Sa 1  
is maximized subject to
a1T a 1 = 1 .  
The variable z 1 = i=1 p a 1i x i is known as the first principal component and gives the linear combination of the variables that gives the maximum variation. A second principal component, z 2 = i=1 p a 2i x i , is found such that:
a2T Sa 2  
is maximized subject to
a2T a 2 = 1  
and
a2T a 1 = 0 .  
This gives the linear combination of variables that is orthogonal to the first principal component that gives the maximum variation. Further principal components are derived in a similar way.
The vectors a 1 , a 2 , , a p , are the eigenvectors of the matrix S and associated with each eigenvector is the eigenvalue, λ i 2 . The value of λ i 2 / λ i 2 gives the proportion of variation explained by the i th principal component. Alternatively, the a i 's can be considered as the right singular vectors in a singular value decomposition with singular values λ i of the data matrix centred about its mean and scaled by 1 / (n-1) , X s . This latter approach is used in g03aac, with
X s = V Λ P  
where Λ is a diagonal matrix with elements λ i , P is the p × p matrix with columns a i and V is an n × p matrix with V V = I , which gives the principal component scores.
Principal component analysis is often used to reduce the dimension of a dataset, replacing a large number of correlated variables with a smaller number of orthogonal variables that still contain most of the information in the original dataset.
The choice of the number of dimensions required is usually based on the amount of variation accounted for by the leading principal components. If k principal components are selected, then a test of the equality of the remaining p-k eigenvalues is
(n-(2p+5)/6) {- i = k + 1 p log( λ i 2 )+(p-k)log( i = k + 1 p λ i 2 /(p-k))}  
which has, asymptotically, a χ 2 distribution with 1 2 (p-k-1) (p-k+2) degrees of freedom.
Equality of the remaining eigenvalues indicates that if any more principal components are to be considered then they all should be considered.
Instead of the variance-covariance matrix the correlation matrix, the sums of squares and cross-products matrix or a standardized sums of squares and cross-products matrix may be used. In the last case S is replaced by σ - 1/2 S σ - 1/2 for a diagonal matrix σ with positive elements. If the correlation matrix is used, the χ 2 approximation for the statistic given above is not valid.
The principal component scores, F , are the values of the principal component variables for the observations. These can be standardized so that the variance of these scores for each principal component is 1.0 or equal to the corresponding eigenvalue.
Weights can be used with the analysis, in which case the matrix X is first centred about the weighted means then each row is scaled by an amount w i , where w i is the weight for the i th observation.

4 References

Chatfield C and Collins A J (1980) Introduction to Multivariate Analysis Chapman and Hall
Cooley W C and Lohnes P R (1971) Multivariate Data Analysis Wiley
Hammarling S (1985) The singular value decomposition in multivariate statistics SIGNUM Newsl. 20(3) 2–25
Kendall M G and Stuart A (1979) The Advanced Theory of Statistics (3 Volumes) (4th Edition) Griffin
Morrison D F (1967) Multivariate Statistical Methods McGraw–Hill

5 Arguments

1: pcmatrix Nag_PrinCompMat Input
On entry: indicates for which type of matrix the principal component analysis is to be carried out.
pcmatrix=Nag_MatCorrelation
It is for the correlation matrix.
pcmatrix=Nag_MatStandardised
It is for the standardized matrix, with standardizations given by s.
pcmatrix=Nag_MatSumSq
It is for the sums of squares and cross-products matrix.
pcmatrix=Nag_MatVarCovar
It is for the variance-covariance matrix.
Constraint: pcmatrix=Nag_MatCorrelation, Nag_MatStandardised, Nag_MatSumSq or Nag_MatVarCovar.
2: scores Nag_PrinCompScores Input
On entry: specifies the type of principal component scores to be used.
scores=Nag_ScoresStand
The principal component scores are standardized so that F F = I , i.e., F = X s P Λ −1 = V .
scores=Nag_ScoresNotStand
The principal component scores are unstandardized, i.e., F = X s P = V Λ .
scores=Nag_ScoresUnitVar
The principal component scores are standardized so that they have unit variance.
scores=Nag_ScoresEigenval
The principal component scores are standardized so that they have variance equal to the corresponding eigenvalue.
Constraint: scores=Nag_ScoresStand, Nag_ScoresNotStand, Nag_ScoresUnitVar or Nag_ScoresEigenval.
3: n Integer Input
On entry: the number of observations, n .
Constraint: n2 .
4: m Integer Input
On entry: the number of variables in the data matrix, m .
Constraint: m1 .
5: x[n×tdx] const double Input
On entry: x[(i-1)×tdx+j-1] must contain the i th observation for the j th variable, for i=1,2,,n and j=1,2,,m.
6: tdx Integer Input
On entry: the stride separating matrix column elements in the array x.
Constraint: tdxm .
7: isx[m] const Integer Input
On entry: isx[j-1] indicates whether or not the j th variable is to be included in the analysis. If isx[j-1] > 0 , then the variable contained in the j th column of x is included in the principal component analysis, for j=1,2,,m.
Constraint: isx[j-1] > 0 for nvar values of j .
8: s[m] double Input/Output
On entry: the standardizations to be used, if any.
If pcmatrix=Nag_MatStandardised, then the first m elements of s must contain the standardization coefficients, the diagonal elements of σ .
Constraint: if isx[j-1] > 0 , s[j-1] > 0.0 , for j=1,2,,m.
On exit: if pcmatrix=Nag_MatStandardised, then s is unchanged on exit.
If pcmatrix=Nag_MatCorrelation, then s contains the variances of the selected variables. s[j-1] contains the variance of the variable in the j th column of x if isx[j-1] > 0 .
If pcmatrix=Nag_MatSumSq or Nag_MatVarCovar, then s is not referenced.
9: wt[n] const double Input
On entry: optionally, the weights to be used in the principal component analysis.
If wt[i-1] = 0.0, then the ith observation is not included in the analysis. The effective number of observations is the sum of the weights.
If weights are not provided then wt must be set to NULL and the effective number of observations is n.
Constraints:
  • if wt is not NULL, wt[i-1] 0.0 , for i=1,2,,n;
  • if wt is not NULL, the sum of weights nvar + 1 .
10: nvar Integer Input
On entry: the number of variables in the principal component analysis, p .
Constraint: 1 nvar min(n-1,m) .
11: e[nvar×tde] double Output
On exit: the statistics of the principal component analysis. e[(i-1)×tde] , the eigenvalues associated with the i th principal component, λ i 2 , for i=1,2,,p.
e[(i-1)×tde+1] , the proportion of variation explained by the i th principal component, for i=1,2,,p.
e[(i-1)×tde+2] , the cumulative proportion of variation explained by the first i principal components, for i=1,2,,p.
e[(i-1)×tde+3] , the χ 2 statistics, for i=1,2,,p.
e[(i-1)×tde+4] , the degrees of freedom for the χ 2 statistics, for i=1,2,,p.
If pcmatrixNag_MatCorrelation, then e[(i-1)×tde+5] contains the significance level for the χ 2 statistic, for i=1,2,,p.
If pcmatrix=Nag_MatCorrelation, then e[(i-1)×tde+5] is returned as zero.
12: tde Integer Input
On entry: the stride separating matrix column elements in the array e.
Constraint: tde6 .
13: p[nvar×tdp] double Output
Note: the (i,j)th element of the matrix P is stored in p[(i-1)×tdp+j-1].
On exit: the first nvar columns of p contain the principal component loadings, a i . The j th column of p contains the nvar coefficients for the j th principal component.
14: tdp Integer Input
On entry: the stride separating matrix column elements in the array p.
Constraint: tdpnvar .
15: v[n×tdv] double Output
Note: the (i,j)th element of the matrix V is stored in v[(i-1)×tdv+j-1].
On exit: the first nvar columns of v contain the principal component scores. The j th column of v contains the n scores for the j th principal component.
If weights are supplied in the array wt, then any rows for which wt[i-1] is zero will be set to zero.
16: tdv Integer Input
On entry: the stride separating matrix column elements in the array v.
Constraint: tdvnvar .
17: fail NagError * Input/Output
The NAG error argument (see Section 7 in the Introduction to the NAG Library CL Interface).

6 Error Indicators and Warnings

NE_2_INT_ARG_GE
On entry, nvar=value while n=value . These arguments must satisfy nvar<n .
NE_2_INT_ARG_GT
On entry, nvar=value while m=value . These arguments must satisfy nvarm .
NE_2_INT_ARG_LT
On entry, tdp=value while nvar=value . These arguments must satisfy tdpnvar .
On entry, tdv=value while nvar=value . These arguments must satisfy tdvnvar .
On entry, tdx=value while m=value . These arguments must satisfy tdxm .
NE_ALLOC_FAIL
Dynamic memory allocation failed.
NE_BAD_PARAM
On entry, argument pcmatrix had an illegal value.
On entry, argument scores had an illegal value.
NE_INT_ARG_LT
On entry, m=value.
Constraint: m1.
On entry, n=value.
Constraint: n2.
On entry, nvar=value.
Constraint: nvar1.
On entry, tde=value.
Constraint: tde6.
NE_INTERNAL_ERROR
An internal error has occurred in this function. Check the function call and any array sizes. If the call is correct then please contact NAG for assistance.
NE_NEG_WEIGHT_ELEMENT
On entry, wt[value] = value.
Constraint: when referenced, all elements of wt must be non-negative.
NE_OBSERV_LT_VAR
With weighted data, the effective number of observations given by the sum of weights =value , while the number of variables included in the analysis, nvar=value .
Constraint: effective number of observations > nvar + 1 .
NE_SVD_NOT_CONV
The singular value decomposition has failed to converge. This is an unlikely error exit.
NE_VAR_INCL_INDICATED
The number of variables, nvar in the analysis =value , while the number of variables included in the analysis via array isx=value .
Constraint: these two numbers must be the same.
NE_VAR_INCL_STANDARD
On entry, the standardization element s[value] = value, while the variable to be included isx[value] = value.
Constraint: when a variable is to be included, the standardization element must be positive.
NE_ZERO_EIGVALS
All eigenvalues/singular values are zero. This will be caused by all the variables being constant.

7 Accuracy

As g03aac uses a singular value decomposition of the data matrix, it will be less affected by ill-conditioned problems than traditional methods using the eigenvalue decomposition of the variance-covariance matrix.

8 Parallelism and Performance

g03aac is not threaded in any implementation.

9 Further Comments

None.

10 Example

A dataset is taken from Cooley and Lohnes (1971), it consists of ten observations on three variables. The unweighted principal components based on the variance-covariance matrix are computed and unstandardized principal component scores requested.

10.1 Program Text

Program Text (g03aace.c)

10.2 Program Data

Program Data (g03aace.d)

10.3 Program Results

Program Results (g03aace.r)