naginterfaces.library.mv.prin_comp¶

naginterfaces.library.mv.prin_comp(matrix, std, x, isx, s, nvar, wt=None)[source]¶

prin_comp performs a principal component analysis on a data matrix; both the principal component loadings and the principal component scores are returned.

For full information please refer to the NAG Library document for g03aa

https://support.nag.com/numeric/nl/nagdoc_30.3/flhtml/g03/g03aaf.html

Parameters

matrixstr, length 1

Indicates for which type of matrix the principal component analysis is to be carried out.

$m a t r i x ='C'$

It is for the correlation matrix.

$m a t r i x ='S'$

It is for a standardized matrix, with standardizations given by $s$ .

$m a t r i x ='U'$

It is for the sums of squares and cross-products matrix.

$m a t r i x ='V'$

It is for the variance-covariance matrix.

stdstr, length 1

Indicates if the principal component scores are to be standardized.

$s t d ='S'$

The principal component scores are standardized so that $F^{'} F = I$ , i.e., $F = X_{s} P Λ^{- 1} = V$ .

$s t d ='U'$

The principal component scores are unstandardized, i.e., $F = X_{s} P = V Λ$ .

$s t d ='Z'$

The principal component scores are standardized so that they have unit variance.

$s t d ='E'$

The principal component scores are standardized so that they have variance equal to the corresponding eigenvalue.

xfloat, array-like, shape $(n, m)$

$x [i - 1, j - 1]$ must contain the $i$ th observation for the $j$ th variable, for $j = 1, 2, \dots, m$ , for $i = 1, 2, \dots, n$ .

isxint, array-like, shape $(m)$

$i s x [j - 1]$ indicates whether or not the $j$ th variable is to be included in the analysis.

If $i s x [j - 1] > 0$ , the variable contained in the $j$ th column of $x$ is included in the principal component analysis, for $j = 1, 2, \dots, m$ .

sfloat, array-like, shape $(m)$

The standardizations to be used, if any.

If $m a t r i x ='S'$ , the first $m$ elements of $s$ must contain the standardization coefficients, the diagonal elements of $σ$ .

nvarint

$p$ , the number of variables in the principal component analysis.

wtNone or float, array-like, shape $(n)$ , optional

If $w t is not N o n e$ , the first $n$ elements of $w t$ must contain the weights to be used in the principal component analysis.

If $w t [i - 1] = 0.0$ , the $i$ th observation is not included in the analysis.

The effective number of observations is the sum of the weights.

If $w t$ is None the effective number of observations is $n$ .

Returns

sfloat, ndarray, shape $(m)$

If $m a t r i x ='S'$ , $s$ is unchanged on exit.

If $m a t r i x ='C'$ , $s$ contains the variances of the selected variables. $s [j - 1]$ contains the variance of the variable in the $j$ th column of $x$ if $i s x [j - 1] > 0$ .

If $m a t r i x ='U'$ or $'V'$ , $s$ is not referenced.

efloat, ndarray, shape $(n v a r, 6)$

The statistics of the principal component analysis.

$e [i - 1, 0]$

The eigenvalues associated with the $i$ th principal component, $λ_{i}^{2}$ , for $i = 1, 2, \dots, p$ .

$e [i - 1, 1]$

The proportion of variation explained by the $i$ th principal component, for $i = 1, 2, \dots, p$ .

$e [i - 1, 2]$

The cumulative proportion of variation explained by the first $i$ th principal components, for $i = 1, 2, \dots, p$ .

$e [i - 1, 3]$

The $χ^{2}$ statistics, for $i = 1, 2, \dots, p$ .

$e [i - 1, 4]$

The degrees of freedom for the $χ^{2}$ statistics, for $i = 1, 2, \dots, p$ .

If $m a t r i x \neq'C'$ , $e [i - 1, 5]$ contains significance level for the $χ^{2}$ statistic, for $i = 1, 2, \dots, p$ .

If $m a t r i x ='C'$ , $e [i - 1, 5]$ is returned as zero.

pfloat, ndarray, shape $(n v a r, n v a r)$

The first $n v a r$ columns of $p$ contain the principal component loadings, $a_{i}$ . The $j$ th column of $p$ contains the $n v a r$ coefficients for the $j$ th principal component.

vfloat, ndarray, shape $(n, n v a r)$

The first $n v a r$ columns of $v$ contain the principal component scores. The $j$ th column of $v$ contains the $n$ scores for the $j$ th principal component.

If $w t is not N o n e$ , any rows for which $w t [i - 1]$ is zero will be set to zero.

Raises

NagValueError

(errno $1$ )

On entry, $n v a r = ⟨ v a l u e ⟩$ and $m = ⟨ v a l u e ⟩$ .

Constraint: $n v a r \leq m$ .

(errno $1$ )

On entry, $n v a r = ⟨ v a l u e ⟩$ .

Constraint: $n v a r \geq 1$ .

(errno $1$ )

On entry, $n = ⟨ v a l u e ⟩$ and $n v a r = ⟨ v a l u e ⟩$ .

Constraint: $n > n v a r$ .

(errno $1$ )

On entry, $s t d = ⟨ v a l u e ⟩$ .

Constraint: $s t d ='E'$ , $'S'$ , $'U'$ or $'Z'$ .

(errno $1$ )

On entry, $m a t r i x = ⟨ v a l u e ⟩$ .

Constraint: $m a t r i x ='C'$ , $'S'$ , $'U'$ or $'V'$ .

(errno $1$ )

On entry, $n = ⟨ v a l u e ⟩$ .

Constraint: $n \geq 2$ .

(errno $1$ )

On entry, $m = ⟨ v a l u e ⟩$ .

Constraint: $m \geq 1$ .

(errno $2$ )

On entry, $i = ⟨ v a l u e ⟩$ and $w t [i - 1] < 0.0$ .

Constraint: $w t [i - 1] \geq 0.0$ .

(errno $3$ )

Number of selected variables $\geq$ effective number of observations.

(errno $3$ )

On entry, $n v a r = ⟨ v a l u e ⟩$ and $⟨ v a l u e ⟩$ values of $i s x > 0$ .

Constraint: exactly $n v a r$ elements of $i s x > 0$ .

(errno $4$ )

On entry, $j = ⟨ v a l u e ⟩$ and $s [j - 1] \leq 0.0$ .

Constraint: $s [j - 1] > 0.0$ .

(errno $5$ )

The singular value decomposition has failed to converge.

Warns

NagAlgorithmicWarning

(errno $6$ ): All eigenvalues/singular values are zero. This will be caused by all the variables being constant.

Notes

In the NAG Library the traditional C interface for this routine uses a different algorithmic base. Please contact NAG if you have any questions about compatibility.

Let $X$ be an $n \times p$ data matrix of $n$ observations on $p$ variables $x_{1}, x_{2}, \dots, x_{p}$ and let the $p \times p$ variance-covariance matrix of $x_{1}, x_{2}, \dots, x_{p}$ be $S$ . A vector $a_{1}$ of length $p$ is found such that:

a_{1}^{T} S a_{1} is maximized subject to a_{1}^{T} a_{1} = 1 .

The variable $z_{1} = \sum_{i = 1}^{p} a_{1 i} x_{i}$ is known as the first principal component and gives the linear combination of the variables that gives the maximum variation. A second principal component, $z_{2} = \sum_{i = 1}^{p} a_{2 i} x_{i}$ , is found such that:

a_{2}^{T} S a_{2} is maximized subject to a_{2}^{T} a_{2} = 1 and a_{2}^{T} a_{1} = 0 .

This gives the linear combination of variables that is orthogonal to the first principal component that gives the maximum variation. Further principal components are derived in a similar way.

The vectors $a_{1}, a_{2}, \dots, a_{p}$ , are the eigenvectors of the matrix $S$ and associated with each eigenvector is the eigenvalue, $λ_{i}^{2}$ . The value of $λ_{i}^{2} / \sum λ_{i}^{2}$ gives the proportion of variation explained by the $i$ th principal component. Alternatively, the $a_{i}$ ’s can be considered as the right singular vectors in a singular value decomposition with singular values $λ_{i}$ of the data matrix centred about its mean and scaled by $1 / \sqrt{(n - 1)}$ , $X_{s}$ . This latter approach is used in prin_comp, with

X_{s} = V Λ P^{'}

where $Λ$ is a diagonal matrix with elements $λ_{i}$ , $P$ is the $p \times p$ matrix with columns $a_{i}$ and $V$ is an $n \times p$ matrix with $V^{'} V = I$ , which gives the principal component scores.

Principal component analysis is often used to reduce the dimension of a dataset, replacing a large number of correlated variables with a smaller number of orthogonal variables that still contain most of the information in the original dataset.

The choice of the number of dimensions required is usually based on the amount of variation accounted for by the leading principal components. If $k$ principal components are selected, then a test of the equality of the remaining $p - k$ eigenvalues is

(n - (2 p + 5) / 6) {- p \sum i = k + 1 log (λ_{i}^{2}) + (p - k) log (p \sum i = k + 1 λ_{i}^{2} / (p - k))}

which has, asymptotically, a $χ^{2}$ -distribution with $\frac{1}{2} (p - k - 1) (p - k + 2)$ degrees of freedom.

Equality of the remaining eigenvalues indicates that if any more principal components are to be considered then they all should be considered.

Instead of the variance-covariance matrix the correlation matrix, the sums of squares and cross-products matrix or a standardized sums of squares and cross-products matrix may be used. In the last case $S$ is replaced by $σ^{- \frac{1}{2}} S σ^{- \frac{1}{2}}$ for a diagonal matrix $σ$ with positive elements. If the correlation matrix is used, the $χ^{2}$ approximation for the statistic given above is not valid.

The principal component scores, $F$ , are the values of the principal component variables for the observations. These can be standardized so that the variance of these scores for each principal component is $1.0$ or equal to the corresponding eigenvalue.

Weights can be used with the analysis, in which case the matrix $X$ is first centred about the weighted means then each row is scaled by an amount $\sqrt{w_{i}}$ , where $w_{i}$ is the weight for the $i$ th observation.

References

Chatfield, C and Collins, A J, 1980, Introduction to Multivariate Analysis, Chapman and Hall

Cooley, W C and Lohnes, P R, 1971, Multivariate Data Analysis, Wiley

Hammarling, S, 1985, The singular value decomposition in multivariate statistics, SIGNUM Newsl. (20(3)), 2–25

Kendall, M G and Stuart, A, 1969, The Advanced Theory of Statistics (Volume 1), (3rd Edition), Griffin

Morrison, D F, 1967, Multivariate Statistical Methods, McGraw–Hill

NAG and Python

Return to Front

naginterfaces.library.mv.prin_comp¶

naginterfaces.library.mv.prin_​comp¶

naginterfaces.library.mv.prin_comp¶