G02 Chapter Introduction : NAG Library CL Interface, Mark 28

The (Pearson) product-moment correlation coefficients measure a linear relationship, while Kendall's tau and Spearman's rank order correlation coefficients measure monotonicity only. All three coefficients range from

- 1.0

+ 1.0

. A coefficient of zero always indicates that no linear relationship exists; a

+ 1.0

coefficient implies a ‘perfect’ positive relationship (i.e., an increase in one variable is always associated with a corresponding increase in the other variable); and a coefficient of

- 1.0

indicates a ‘perfect’ negative relationship (i.e., an increase in one variable is always associated with a corresponding decrease in the other variable).

Consider the bivariate scattergrams in Figure 1: (a) and (b) show strictly linear functions for which the values of the product-moment correlation coefficient, and (since a linear function is also monotonic) both Kendall's tau and Spearman's rank order coefficients, would be

+ 1.0

and

- 1.0

respectively. However, though the relationships in figures (c) and (d) are respectively monotonically increasing and monotonically decreasing, for which both Kendall's and Spearman's nonparametric coefficients would be

+ 1.0

(in (c)) and

- 1.0

(in (d)), the functions are nonlinear so that the product-moment coefficients would not take such ‘perfect’ extreme values. There is no obvious relationship between the variables in figure (e), so all three coefficients would assume values close to zero, while in figure (f) though there is an obvious parabolic relationship between the two variables, it would not be detected by any of the correlation coefficients which would again take values near to zero; it is important, therefore, to examine scattergrams as well as the correlation coefficients.

In order to decide which type of correlation is the most appropriate, it is necessary to appreciate the different groups into which variables may be classified. Variables are generally divided into four types of scales: the nominal scale, the ordinal scale, the interval scale, and the ratio scale. The nominal scale is used only to categorise data; for each category a name, perhaps numeric, is assigned so that two different categories will be identified by distinct names. The ordinal scale, as well as categorising the observations, orders the categories. Each category is assigned a distinct identifying symbol, in such a way that the order of the symbols corresponds to the order of the categories. (The most common system for ordinal variables is to assign numerical identifiers to the categories, though if they have previously been assigned alphabetic characters, these may be transformed to a numerical system by any convenient method which preserves the ordering of the categories.) The interval scale not only categorises and orders the observations, but also quantifies the comparison between categories; this necessitates a common unit of measurement and an arbitrary zero-point. Finally, the ratio scale is similar to the interval scale, except that it has an absolute (as opposed to arbitrary) zero-point.

For a more complete discussion of these four types of scales, and some examples, you are referred to Churchman and Ratoosh (1959) and Hays (1970).

Figure 1

Product-moment correlation coefficients are used with variables which are interval (or ratio) scales; these coefficients measure the amount of spread about the linear least squares equation. For a product-moment correlation coefficient,

r

, based on

n

pairs of observations, testing against the null hypothesis that there is no correlation between the two variables, the statistic

r \sqrt{\frac{n - 2}{1 - r^{2}}}

has a Student's

t

-distribution with

n - 2

degrees of freedom; its significance can be tested accordingly.

Ranked and ordinal scale data are generally analysed by nonparametric methods – usually either Spearman's or Kendall's tau rank order correlation coefficients, which, as their names suggest, operate solely on the ranks, or relative orders, of the data values. Interval or ratio scale variables may also be validly analysed by nonparametric methods, but such techniques are statistically less powerful than a product-moment method. For a Spearman rank order correlation coefficient,

R

, based on

n

pairs of observations, testing against the null hypothesis that there is no correlation between the two variables, for large samples the statistic

R \sqrt{\frac{n - 2}{1 - R^{2}}}

has approximately a Student's

t

-distribution with

n - 2

degrees of freedom, and may be treated accordingly. (This is similar to the product-moment correlation coefficient,

r

, see above.) Kendall's tau coefficient, based on

n

pairs of observations, has, for large samples, an approximately Normal distribution with mean zero and standard deviation

\sqrt{\frac{4 n + 10}{9 n (n - 1)}}

when tested against the null hypothesis that there is no correlation between the two variables; the coefficient should, therefore, be divided by this standard deviation and tested against the standard Normal distribution,

N (0, 1)

When the number of ordinal categories a variable takes is large, and the number of ties is relatively small, Spearman's rank order correlation coefficients have advantages over Kendall's tau; conversely, when the number of categories is small, or there are a large number of ties, Kendall's tau is usually preferred. Thus, when the ordinal scale is more or less continuous, Spearman's rank order coefficients are preferred, whereas Kendall's tau is used when the data is grouped into a smaller number of categories; both measures do however include corrections for the occurrence of ties, and the basic concepts underlying the two coefficients are quite similar. The absolute value of Kendall's tau coefficient tends to be slightly smaller than Spearman's coefficient for the same set of data.

There is no authoritative dictum on the selection of correlation coefficients – particularly on the advisability of using correlations with ordinal data. This is a matter of discretion for you.

2.1.3 Partial correlation

The correlation coefficients described above measure the association between two variables ignoring any other variables in the system. Suppose there are three variables

X, Y

and

Z

as shown in the path diagram below.

Figure 2

The association between

Y

and

Z

is made up of the direct association between

Y

and

Z

and the association caused by the path through

X

, that is the association of both

Y

and

Z

with the third variable

X

. For example, if

Z

and

Y

were cholesterol level and blood pressure and

X

were age since both blood pressure and cholesterol level may increase with age the correlation between blood pressure and cholesterol level eliminating the effect of age is required.

The correlation between two variables eliminating the effect of a third variable is known as the partial correlation. If

ρ_{z y}

ρ_{z x}

and

ρ_{x y}

represent the correlations between

x

y

and

z

then the partial correlation between

Z

and

Y

given

X

\frac{ρ_{z y} - ρ_{z x} ρ_{x y}}{\sqrt{(1 - ρ_{z x}^{2}) (1 - ρ_{x y}^{2})}} .

The partial correlation is then estimated by using product-moment correlation coefficients.

In general, let a set of variables be partitioned into two groups

Y

and

X

with

n_{y}

variables in

Y

and

n_{x}

variables in

X

and let the variance-covariance matrix of all

n_{y} + n_{x}

variables be partitioned into

[\begin{array}{l} Σ_{x x} & Σ_{y x} \\ Σ_{x y} & Σ_{y y} \end{array}] .

Then the variance-covariance of

Y

conditional on fixed values of the

X

variables is given by

Σ_{y ∣ x} = Σ_{y y} - Σ_{y x} Σ_{x x}^{−1} Σ_{x y} .

The partial correlation matrix is then computed by standardizing

Σ_{y ∣ x}

2.1.4 Robust estimation of correlation coefficients

The product-moment correlation coefficient can be greatly affected by the presence of a few extreme observations or outliers. There are robust estimation procedures which aim to decrease the effect of extreme values.

Mathematically these methods can be described as follows. A robust estimate of the variance-covariance matrix,

C

, can be written as

C = τ^{2} {(A^{T} A)}^{−1}

where

τ^{2}

is a correction factor to give an unbiased estimator if the data is Normal and

A

is a lower triangular matrix. Let

x_{i}

be the vector of values for the

i

th observation and let

z_{i} = A (x_{i} - θ)

θ

being a robust estimate of location, then

θ

and

A

are found as solutions to

\frac{1}{n} \sum_{i = 1}^{n} w ({‖ z_{i} ‖}_{2}) z_{i} = 0

and

\frac{1}{n} \sum_{i = 1}^{n} w ({‖ z_{i} ‖}_{2}) z_{i} z_{i}^{T} - v ({‖ z_{i} ‖}_{2}) I = 0,

where

w (t)

u (t)

and

v (t)

are functions such that they return a value of

1

for reasonable values of

t

and decreasing values for large

t

. The correlation matrix can then be calculated from the variance-covariance matrix. If

w

u

, and

v

returned

1

for all values then the product-moment correlation coefficient would be calculated.

2.1.5 Missing values

When there are missing values in the data these may be handled in one of two ways. Firstly, if a case contains a missing observation for any variable, then that case is omitted in its entirety from all calculations; this may be termed casewise treatment of missing data. Secondly, if a case contains a missing observation for any variable, then the case is omitted from only those calculations involving the variable for which the value is missing; this may be called pairwise treatment of missing data. Pairwise deletion of missing data has the advantage of using as much of the data as possible in the computation of each coefficient. In extreme circumstances, however, it can have the disadvantage of producing coefficients which are based on a different number of cases, and even on different selections of cases or samples; furthermore, the ‘correlation’ matrices formed in this way need not necessarily be positive semidefinite, a requirement for a correlation matrix. Casewise deletion of missing data generally causes fewer cases to be used in the calculation of the coefficients than does pairwise deletion. How great this difference is will obviously depend on the distribution of the missing data, both among cases and among variables.

Pairwise treatment does, therefore, use more information from the sample but should not be used without careful consideration of the location of the missing observations in the data matrix, and the consequent effect of processing the missing data in that fashion.

2.1.6 Nearest Correlation Matrix

A correlation matrix is, by definition, a symmetric, positive semidefinite matrix with unit diagonals and all elements in the range

[−1, 1]

In practice, rather than having a true correlation matrix, you may find that you have a matrix of pairwise correlations. This usually occurs in the presence of missing values, when the missing values are treated in a pairwise fashion as discussed in Section 2.1.5. Matrices constructed in this way may not be not positive semidefinite, and, therefore, are not a valid correlation matrix. However, a valid correlation matrix can be calculated that is in some sense ‘close’ to the original.

Given an

n \times n

matrix,

G

, there are a number of available ways of computing the ‘nearest’ correlation matrix,

Σ

G

(a)Frobenius Norm
Find $Σ$ such that

$\sum_{i = 1}^{n} \sum_{j = 1}^{n} {(s_{i j} - σ_{i j})}^{2}$

is minimized.

Where $S$ is the symmetric matrix defined as $S = \frac{1}{2} (G + G^{T})$ and $s_{i j}$ and $σ_{i j}$ denotes the elements of $S$ and $Σ$ respectively.

A weighted Frobenius norm can also be used. The term being summed across, therefore, becomes $w_{i} w_{j} {(s_{i j} - σ_{i j})}^{2}$ if row and column weights are being used or $w_{i j} {(s_{i j} - σ_{i j})}^{2}$ when element-wise weights are used. A constraint on the rank of $Σ$ can also be added if a low-rank correlation matrix is required to constrain the number of independent random variables.
(b)Factor Loading Method
This method is similar to (a) in that it finds a $Σ$ that is closest to $S$ in the Frobenius norm. However, it also ensures that $Σ$ has a $k$ -factor structure, that is $Σ$ can be written as

$Σ = X X^{T} + diag (I - X X^{T})$

where $I$ is the identity matrix and $X$ has $n$ rows and $k$ columns.

$X$ is often referred to as the factor loading matrix. This problem primarily arises when a factor model $ξ = X η + D ε$ is used to describe a multivariate time series or collateralized debt obligations. In this model $η \in ℝ^{k}$ and $ξ \in ℝ^{n}$ are vectors of independent random variables having zero mean and unit variance, with $η$ and $ε$ independent of each other, and $X \in ℝ^{n \times k}$ with $D \in ℝ^{n \times n}$ diagonal. In the case of modelling debt obligations $ξ$ can, for example, model the equity returns of $n$ different companies of a portfolio where $η$ describes $k$ factors influencing all companies, in contrast to the elements of $ε$ having only an effect on the equity of the corresponding company. With this model the complex behaviour of a portfolio, with potentially thousands of equities, is captured by looking at the major factors driving the behaviour.

The number of factors usually chosen is a lot smaller than $n$ , perhaps between $1$ and $10$ , yielding a large reduction in the complexity. The number of the factors, $k$ , which yields a matrix $X$ such that ${‖ G - X X^{T} + diag (I - X X^{T}) ‖}_{F}$ is within a required tolerance can also be determined, by experimenting with the input $k$ and comparing the norms.
(c)Shrinking
Find the smallest $α$ such that

$α T + (1 - α) S$

is a correlation matrix. Here $T$ is a positive definite target matrix with unit diagonal. $T$ can be chosen to fix elements in the resulting matrix by having elements equal to the corresponding element in $S$ .

Shrinking algorithms can be very efficient, using bisection to find $α$ . A solution is always found, as $α = 1$ gives the result $T$ , which is necessarily a valid correlation matrix.

Note that shrinking algorithms do not find the nearest correlation matrix in any mathematical sense, simply the smallest $α$ in the structure above.

2.2 Regression

2.2.1 Aims of regression modelling

In regression analysis the relationship between one specific random variable, the dependent or response variable, and one or more known variables, called the independent variables or covariates, is studied. This relationship is represented by a mathematical model, or an equation, which associates the dependent variable with the independent variables, together with a set of relevant assumptions. The independent variables are related to the dependent variable by a function, called the regression function, which involves a set of unknown parameters. Values of the parameters which give the best fit for a given set of data are obtained; these values are known as the estimates of the parameters.

The reasons for using a regression model are twofold. The first is to obtain a description of the relationship between the variables as an indicator of possible causality. The second reason is to predict the value of the dependent variable from a set of values of the independent variables. Accordingly, the most usual statistical problems involved in regression analysis are:

(i)to obtain best estimates of the unknown regression parameters;
(ii)to test hypotheses about these parameters;
(iii)to determine the adequacy of the assumed model; and
(iv)to verify the set of relevant assumptions.

2.2.2 Regression models and designed experiments

One application of regression models is in the analysis of experiments. In this case the model relates the dependent variable to qualitative independent variables known as factors. Factors may take a number of different values known as levels. For example, in an experiment in which one of four different treatments is applied, the model will have one factor with four levels. Each level of the factor can be represented by a dummy variable taking the values

0

1

. So, in the example there are four dummy variables

x_{j}

, for

j = 1, 2, 3, 4

, such that:

\begin{array}{l} x_{i j} & = 1 ​ if the ​ i th observation received the ​ j th treatment \\ = 0 ​ otherwise, \end{array}

along with a variable for the mean

x_{0}

\begin{array}{l} x_{i 0} & = 1 ​ for all ​ i . \end{array}

If there were seven observations the data would be:

\begin{array}{cl} Treatment & Y & x_{0} & x_{1} & x_{2} & x_{3} & x_{4} \\ 1 & y_{1} & 1 & 1 & 0 & 0 & 0 \\ 2 & y_{2} & 1 & 0 & 1 & 0 & 0 \\ 2 & y_{3} & 1 & 0 & 1 & 0 & 0 \\ 3 & y_{4} & 1 & 0 & 0 & 1 & 0 \\ 3 & y_{5} & 1 & 0 & 0 & 1 & 0 \\ 4 & y_{6} & 1 & 0 & 0 & 0 & 1 \\ 4 & y_{7} & 1 & 0 & 0 & 0 & 1 \end{array}

When dummy variables are used it is common for the model not to be of full rank. In the case above, the model would not be of full rank because

x_{i 4} = x_{i 0} - x_{i 1} - x_{i 2} - x_{i 3}, i = 1, 2, \dots, 7 .

This means that the effect of

x_{4}

cannot be distinguished from the combined effect of

x_{0}, x_{1}, x_{2}

and

x_{3}

. This is known as aliasing. In this situation, the aliasing can be deduced from the experimental design and as a result the model to be fitted; in such situations it is known as intrinsic aliasing. In the example above no matter how many times each treatment is replicated (other than

0

) the aliasing will still be present. If the aliasing is due to a particular dataset to which the model is to be fitted then it is known as extrinsic aliasing. If in the example above observation

1

was missing then the

x_{1}

term would also be aliased. In general, intrinsic aliasing may be overcome by changing the model, e.g., remove

x_{0}

x_{1}

from the model, or by introducing constraints on the parameters, e.g.,

β_{1} + β_{2} + β_{3} + β_{4} = 0

If aliasing is present then there will no longer be a unique set of least squares estimates for the parameters of the model but the fitted values will still have a unique estimate. Some linear functions of the parameters will also have unique estimates; these are known as estimable functions. In the example given above the functions (

β_{0} + β_{1}

) and (

β_{2} - β_{3}

) are both estimable.

2.2.3 Selecting the regression model

In many situations there are several possible independent variables, not all of which may be needed in the model. In order to select a suitable set of independent variables, two basic approaches can be used.

(a)All possible regressions
In this case all the possible combinations of independent variables are fitted and the one considered the best selected. To choose the best, two conflicting criteria have to be balanced. One is the fit of the model which will improve as more variables are added to the model. The second criterion is the desire to have a model with a small number of significant terms. Depending on how the model is fit, statistics such as $R^{2}$ , which gives the proportion of variation explained by the model, and $C_{p}$ , which tries to balance the size of the residual sum of squares against the number of terms in the model, can be used to aid in the choice of model.
(b)Stepwise model building
In stepwise model building the regression model is constructed recursively, adding or deleting the independent variables one at a time. When the model is built up the procedure is known as forward selection. The first step is to choose the single variable which is the best predictor. The second independent variable to be added to the regression equation is that which provides the best fit in conjunction with the first variable. Further variables are then added in this recursive fashion, adding at each step the optimum variable, given the other variables already in the equation. Alternatively, backward elimination can be used. This is when all variables are added and then the variables dropped one at a time, the variable dropped being the one which has the least effect on the fit of the model at that stage. There are also hybrid techniques which combine forward selection with backward elimination.

2.3 Linear Regression Models

When the regression model is linear in the parameters (but not necessarily in the independent variables), then the regression model is said to be linear; otherwise the model is classified as nonlinear.

The most elementary form of regression model is the simple linear regression of the dependent variable,

Y

, on a single independent variable,

x

, which takes the form

E (Y) = β_{0} + β_{1} x

(1)

where

E (Y)

is the expected or average value of

Y

and

β_{0}

and

β_{1}

are the parameters whose values are to be estimated, or, if the regression is required to pass through the origin (i.e., no constant term),

E (Y) = β_{1} x

(2)

where

β_{1}

is the only unknown parameter.

An extension of this is multiple linear regression in which the dependent variable,

Y

, is regressed on the

p

(

p > 1

) independent variables,

x_{1}, x_{2}, \dots, x_{p}

, which takes the form

E (Y) = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{p} x_{p}

(3)

where

β_{1}, β_{2}, \dots, β_{p}

and

β_{0}

are the unknown parameters. Multiple linear regression models test include factors are sometimes known as General Linear (Regression) Models.

A special case of multiple linear regression is polynomial linear regression, in which the

p

independent variables are in fact powers of the same single variable

x

(i.e.,

x_{j} = x^{j}

, for

j = 1, 2, \dots, p

In this case, the model defined by (3) becomes

E (Y) = β_{0} + β_{1} x + β_{2} x^{2} + \dots + β_{p} x^{p} .

(4)

There are a great variety of nonlinear regression models; one of the most common is exponential regression, in which the equation may take the form

E (Y) = a + b e^{c x} .

(5)

It should be noted that equation (4) represents a linear regression, since even though the equation is not linear in the independent variable,

x

, it is linear in the parameters

β_{0}, β_{1}, β_{2}, \dots ., β_{p}

, whereas the regression model of equation (5) is nonlinear, as it is nonlinear in the parameters (

a

b

, and

c

2.3.1 Fitting the regression model – least squares estimation

One method used to determine values for the parameters is, based on a given set of data, to minimize the sums of squares of the differences between the observed values of the dependent variable and the values predicted by the regression equation for that set of data – hence the term least squares estimation. For example, if a regression model of the type given by equation (3), namely

E (Y) = β_{0} x_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{p} x_{p},

where

x_{0} = 1

for all observations, is to be fitted to the

n

data points

\begin{matrix} (x_{01}, x_{11}, x_{21}, \dots, x_{p 1}, y_{1}) \\ (x_{02}, x_{12}, x_{22}, \dots, x_{p 2}, y_{2}) \\ ⋮ \\ (x_{0 n}, x_{1 n}, x_{2 n}, \dots, x_{p n}, y_{n}) \end{matrix}

(6)

such that

y_{i} = β_{0} x_{0} + β_{1} x_{1 i} + β_{2} x_{2 i} + \dots + β_{p} x_{p i} + e_{i}, i = 1, 2, \dots, n

where

e_{i}

are unknown independent random errors with

E (e_{i}) = 0

and

var (e_{i}) = σ^{2}

σ^{2}

being a constant, then the method used is to calculate the estimates of the regression parameters

β_{0}, β_{1}, β_{2}, \dots, β_{p}

by minimizing

\sum_{i = 1}^{n} e_{i}^{2} .

(7)

If the errors do not have constant variance, i.e.,

var (e_{i}) = σ_{i}^{2} = \frac{σ^{2}}{w_{i}}

then weighted least squares estimation is used in which

\sum_{i = 1}^{n} w_{i} e_{i}^{2}

is minimized. For a more complete discussion of these least squares regression methods, and details of the mathematical techniques used, see Draper and Smith (1985) or Kendall and Stuart (1973).

2.3.2 Computational methods for least squares regression

Let

X

be the

n \times p

matrix of independent variables and

y

be the vector of values for the dependent variable. To find the least squares estimates of the vector of parameters,

\hat{β}

, the

Q R

decomposition of

X

is found, i.e.,

X = Q R^{*}

where

R^{*} = (\begin{matrix} R \\ 0 \end{matrix})

R

being a

p \times p

upper triangular matrix, and

Q

n \times n

orthogonal matrix. If

R

is of full rank then

\hat{β}

is the solution to

R \hat{β} = c_{1}

where

c = Q^{T} y

and

c_{1}

is the first

p

rows of

c

. If

R

is not of full rank, a solution is obtained by means of a singular value decomposition (SVD) of

R

R = Q_{*} (\begin{matrix} D & 0 \\ 0 & 0 \end{matrix}) P^{T},

where

D

is a

k \times k

diagonal matrix with nonzero diagonal elements,

k

being the rank of

R

, and

Q_{*}

and

P

are

p \times p

orthogonal matrices. This gives the solution

\hat{β} = P_{1} D^{- 1} Q_{*_{1}}^{T} c_{1},

P_{1}

being the first

k

columns of

P

and

Q_{*_{1}}

being the first

k

columns of

Q_{*}

This will be only one of the possible solutions. Other estimates may be obtained by applying constraints to the parameters. If weighted regression with a vector of weights

w

is required then both

X

and

y

are premultiplied by

w^{1 / 2}

The method described above will, in general, be more accurate than methods based on forming (

X^{T} X

), (or a scaled version), and then solving the equations

(X^{T} X) \hat{β} = X^{T} y .

2.3.3 Examining the fit of the model

Having fitted a model two questions need to be asked: first, ‘are all the terms in the model needed?’ and second, ‘is there some systematic lack of fit?’. To answer the first question, either confidence intervals can be computed for the parameters or

t

-tests can be calculated to test hypotheses about the regression parameters – for example, whether the value of the parameter,

β_{k}

, is significantly different from a specified value,

b_{k}

(often zero). If the estimate of

β_{k}

{\hat{β}}_{k}

and its standard error is

se ({\hat{β}}_{k})

then the

t

-statistic is

\frac{{\hat{β}}_{k} - b_{k}}{\sqrt{se ({\hat{β}}_{k})}} .

It should be noted that both the tests and the confidence intervals may not be independent. Alternatively,

F

-tests based on the residual sums of squares for different models can also be used to test the significance of terms in the model. If model

1

, giving residual sum of squares

{RSS}_{1}

with degrees of freedom

ν_{1}

, is a sub-model of model

2

, giving residual sum of squares

{RSS}_{2}

with degrees of freedom

ν_{2}

, i.e., all terms in model

1

are also in model

2

, then to test if the extra terms in model

2

are needed the

F

-statistic

F = \frac{({RSS}_{1} - {RSS}_{2}) / (ν_{1} - ν_{2})}{{RSS}_{2} / ν_{2}}

may be used. These tests and confidence intervals require the additional assumption that the errors,

e_{i}

, are Normally distributed.

To check for systematic lack of fit the residuals,

r_{i} = y_{i} - {\hat{y}}_{i}

, where

{\hat{y}}_{i}

is the fitted value, should be examined. If the model is correct then they should be random with no discernible pattern. Due to the way they are calculated the residuals do not have constant variance. Now the vector of fitted values can be written as a linear combination of the vector of observations of the dependent variable,

y

\hat{y} = H y

. The variance-covariance matrix of the residuals is then

(I - H) σ^{2}

I

being the identity matrix. The diagonal elements of

H

h_{i i}

, can, therefore, be used to standardize the residuals. The

h_{i i}

are a measure of the effect of the

i

th observation on the fitted model and are sometimes known as leverages.

If the observations were taken serially the residuals may also be used to test the assumption of the independence of the

e_{i}

and hence the independence of the observations.

2.3.4 Ridge regression

When data on predictor variables

x

are multicollinear, ridge regression models provide an alternative to variable selection in the multiple regression model. In the ridge regression case, parameter estimates in the linear model are found by penalised least squares:

\sum_{i = 1}^{n} {[(\sum_{j = 1}^{p} x_{i j} {\hat{β}}_{j}) - y_{i}]}^{2} + h \sum_{j = 1}^{p} {\hat{β}}_{j}^{2}, h \in ℝ^{+},

where the value of the ridge parameter

h

controls the trade-off between the goodness-of-fit and smoothness of a solution.

2.4 Robust Estimation

Least squares regression can be greatly affected by a small number of unusual, atypical, or extreme observations. To protect against such occurrences, robust regression methods have been developed. These methods aim to give less weight to an observation which seems to be out of line with the rest of the data given the model under consideration. That is to seek to bound the influence. For a discussion of influence in regression, see Hampel et al. (1986) and Huber (1981).

There are two ways in which an observation for a regression model can be considered atypical. The values of the independent variables for the observation may be atypical or the residual from the model may be large.

The first problem of atypical values of the independent variables can be tackled by calculating weights for each observation which reflect how atypical it is, i.e., a strongly atypical observation would have a low weight. There are several ways of finding suitable weights; some are discussed in Hampel et al. (1986).

The second problem is tackled by bounding the contribution of the individual

e_{i}

to the criterion to be minimized. When minimizing (7) a set of linear equations is formed, the solution of which gives the least squares estimates. The equations are

\sum_{i = 1}^{n} e_{i} x_{i j} = 0, j = 0, 1, \dots, k .

These equations are replaced by

\sum_{i = 1}^{n} ψ (e_{i} / σ) x_{i j} = 0, j = 0, 1, \dots, k,

(8)

where

σ^{2}

is the variance of the

e_{i}

, and

ψ

is a suitable function which down weights large values of the standardized residuals

e_{i} / σ

. There are several suggested forms for

ψ

, one of which is Huber's function,

ψ (t) = {\begin{cases} - c, t < c \\ t, | t | \leq c \\ c, t > c \end{cases}

(9)

The solution to (8) gives the

M

-estimates of the regression coefficients. The weights can be included in (8) to protect against both types of extreme value. The parameter

σ

can be estimated by the median absolute deviations of the residuals or as a solution to, in the unweighted case,

\sum_{i = 1}^{n} χ (e_{i} / \hat{σ}) = (n - k) β,

where

χ

is a suitable function and

β

is a constant chosen to make the estimate unbiased.

χ

is often chosen to be

ψ^{2} / 2

where

ψ

is given in (9). Another form of robust regression is to minimize the sum of absolute deviations, i.e.,

\sum_{i = 1}^{n} | e_{i} | .

For details of robust regression, see Hampel et al. (1986) and Huber (1981).

Robust regressions using least absolute deviations can be computed using functions in Chapter E02.

2.5 Generalized Linear Models

Generalized linear models are an extension of the general linear regression model discussed above. They allow a wide range of models to be fitted. These included certain nonlinear regression models, logistic and probit regression models for binary data, and log-linear models for contingency tables. A generalized linear model consists of three basic components:

(a)A suitable distribution for the dependent variable $Y$ . The following distributions are common:
1. (i)Normal
2. (ii)binomial
3. (iii)Poisson
4. (iv)gamma
In addition to the obvious uses of models with these distributions it should be noted that the Poisson distribution can be used in the analysis of contingency tables while the gamma distribution can be used to model variance components. The effect of the choice of the distribution is to define the relationship between the expected value of $Y$ , $E (Y) = μ$ , and its variance and so a generalized linear model with one of the above distributions may be used in a wider context when that relationship holds.
(b)A linear model $η = \sum β_{j} x_{j}$ , $η$ is known as a linear predictor.
(c)A link function $g (\cdot)$ between the expected value of $Y$ and the linear predictor, $g (μ) = η$ . The following link functions are available:
For the binomial distribution $ε$ , observing $y$ out of $t$ :
1. (i)logistic link: $η = \log (\frac{μ}{t - μ})$ ;
2. (ii)probit link: $η = Φ^{- 1} (\frac{μ}{t})$ ;
3. (iii)complementary log-log: $η = \log (- \log (1 - \frac{μ}{t}))$ .
For the Normal, Poisson, and gamma distributions:
1. (i)exponent link: $η = μ^{a}$ , for a constant $a$ ;
2. (ii)identity link: $η = μ$ ;
3. (iii)log link: $η = \log μ$ ;
4. (iv)square root link: $η = \sqrt{μ}$ ;
5. (v)reciprocal link: $η = \frac{1}{μ}$ .
For each distribution there is a canonical link. For the canonical link there exist sufficient statistics for the parameters. The canonical links are:
1. (i)Normal – identity;
2. (ii)binomial – logistic;
3. (iii)Poisson – logarithmic;
4. (iv)gamma – reciprocal.
For the general linear regression model described above the three components are:
1. (i)Distribution – Normal;
2. (ii)Linear model – $\sum β_{j} x_{j}$ ;
3. (iii)Link – identity.

The model is fitted by maximum likelihood; this is equivalent to least squares in the case of the Normal distribution. The residual sums of squares used in regression models is generalized to the concept of deviance. The deviance is the logarithm of the ratio of the likelihood of the model to the full model in which

{\hat{μ}}_{i} = y_{i}

, where

{\hat{μ}}_{i}

is the estimated value of

μ_{i}

. For the Normal distribution the deviance is the residual sum of squares. Except for the case of the Normal distribution with the identity link, the

χ^{2}

and

F

-tests based on the deviance are only approximate; also, the estimates of the parameters will only be approximately Normally distributed. Thus, only approximate

z

- or

t

-tests may be performed on the parameter values and approximate confidence intervals computed.

The estimates are found by using an iterative weighted least squares procedure. This is equivalent to the Fisher scoring method in which the Hessian matrix used in the Newton–Raphson method is replaced by its expected value. In the case of canonical links, the Fisher scoring method and the Newton–Raphson method are identical. Starting values for the iterative procedure are obtained by replacing the

μ_{i}

y_{i}

in the appropriate equations.

2.6 Linear Mixed Effects Regression

In a standard linear model, the independent (or explanatory) variables are assumed to take the same set of values for all units in the population of interest. This type of variable is called fixed. In contrast, an independent variable that fluctuates over the different units is said to be random. Modelling a variable as fixed allows conclusions to be drawn only about the particular set of values observed. Modelling a variable as random allows the results to be generalized to the different levels that may have been observed. In general, if the effects of the levels of a variable are thought of as being drawn from a probability distribution of such effects then the variable is random. If the levels are not a sample of possible levels then the variable is fixed. In practice many qualitative variables can be considered as having fixed effects and most blocking, sampling design, control and repeated measures as having random effects.

In a general linear regression model, defined by

y = X β + ε

where	$y$ is a vector of $n$ observations on the dependent variable,
	$X$ is an $n \times p$ design matrix of independent variables,
	$β$ is a vector of $p$ unknown parameters,
and	$ε$ is a vector of $n$ , independent and identically distributed, unknown errors, with $ε ~ N (0, σ^{2})$ ,

there are

p

fixed effects (the

β

) and a single random effect (the error term

ε

An extension to the general linear regression model that allows for additional random effects is the linear mixed effects regression model, (sometimes called the variance components model). One parameterisation of a linear mixed effects model is

y = X β + Z ν + ε

where	$y$ is a vector of $n$ observations on the dependent variable,
	$X$ is an $n \times p$ design matrix of fixed independent variables,
	$β$ is a vector of $p$ unknown fixed effects,
	$Z$ is an $n \times q$ design matrix of random independent variables,
	$ν$ is a vector of length $q$ of unknown random effects,
	$ε$ is a vector of length $n$ of unknown random errors,

and

ν

and

ε

are normally distributed with expectation zero and variance / covariance matrix defined by

Var (\begin{matrix} ν \\ ε \end{matrix}) = (\begin{matrix} G & 0 \\ 0 & R \end{matrix}) .

The functions currently available in this chapter are restricted to cases where

R = σ_{R}^{2} I

I

is the

n \times n

identity matrix and

G

is a diagonal matrix. Given this restriction the random variables,

Z

, can be subdivided into

g \leq q

groups containing one or more variables. The variables in the

i

th group are identically distributed with expectation zero and variance

σ_{i}^{2}

. The model, therefore, contains three sets of unknowns, the fixed effects,

β

, the random effects,

ν

, and a vector of

g + 1

variance components,

γ

, with

γ = {σ_{1}^{2}, σ_{2}^{2}, \dots,,, σ_{g - 1}^{2}, σ_{g}^{2}, σ_{R}^{2}}

. Rather than work directly with

γ

and the full likelihood function,

γ

is replaced by

γ^{*} = {σ_{1}^{2} / σ_{R}^{2}, σ_{2}^{2} / σ_{R}^{2}, \dots, σ_{g - 1}^{2} / σ_{R}^{2}, σ_{g}^{2} / σ_{R}^{2}, 1}

and the profiled likelihood function is used instead.

The model parameters are estimated using an iterative method based on maximizing either the restricted (profiled) likelihood function or the (profiled) likelihood functions. Fitting the model via restricted maximum likelihood involves maximizing the function

−2 l_{R} = \log (| V |) + (n - p) \log (r^{T} V^{- 1} r) + \log | X^{T} V^{- 1} X | + (n - p) (1 + \log (2 π / (n - p))) + (n - p) .

Whereas fitting the model via maximum likelihood involves maximizing

−2 l_{R} = \log (| V |) + n \log (r^{T} V^{- 1} r) + n \log (2 π / n) + n .

In both cases

V = Z G Z^{T} + R, r = y - X b and b = {(X^{T} V^{- 1} X)}^{- 1} X^{T} V^{- 1} y .

Once the final estimates for

γ^{*}

have been obtained, the value of

σ_{R}^{2}

is given by

σ_{R}^{2} = (r^{T} V^{- 1} r) / (n - p) .

Case weights,

W_{c}

, can be incorporated into the model by replacing

X^{T} X

and

Z^{T} Z

with

X^{T} W_{c} X

and

Z^{T} W_{c} Z

respectively, for a diagonal weight matrix

W_{c}

2.7 Quantile Regression

Quantile regression is related to least squares regression in that both are interested in studying the relationship between a response variable and one or more independent or explanatory variables. However, whereas least squares regression is concerned with modelling the conditional mean of the dependent variable, quantile regression models the conditional

τ

th quantile of the dependent variable, for some value of

τ \in (0, 1)

. So, for example,

τ = 0.5

would be the median.

Throughout this section we will be making use of the following definitions:

(a)If $Z$ is a real valued random variable with distribution function $F$ and density function $f$ , such that

$F (α) = P (Z \leq α) = \int_{- \infty}^{α} f (z) dz$

then the $τ$ th quantile, $α$ , can be defined as

$α = F^{- 1} (τ) = \inf {z : F (z) \geq τ}, τ \in (0, 1) .$
(b) $I (L)$ denotes an indicator function taking the value $1$ if the logical expression $L$ is true and 0 otherwise, e.g., $I (z < 0) = 1$ if $z < 0$ and $0$ if $z \geq 0$ .
(c) $y$ denotes a vector of $n$ observations on the dependent (or response) variable, $y = {y_{i} : i = 1, 2, \dots, n}$ .
(d) $X$ denotes an $n \times p$ matrix of explanatory or independent variables, often referred to as the design matrix, and $x_{i}$ denotes a column vector of length $p$ which holds the $i$ th row of $X$ .

2.7.1 Finding a sample quantile as an optimization problem

Consider the piecewise linear loss function

ρ_{τ} (z) = z (τ - I (z < 0)) .

The minimum of the expectation

E (ρ_{τ} (z - α)) = (τ - 1) \int_{- \infty}^{α} (z - α) f (z) dz + τ \int_{α}^{\infty} (z - α) f (z) dz

can be obtained by using the integral rule of Leibnitz to differentiate with respect to

z

and then setting the result to zero, giving

(1 - τ) \int_{- \infty}^{α} f (z) dz - \int_{α}^{\infty} f (z) dz = F (α) - τ = 0

hence

α = F^{- 1} (τ)

when the solution is unique. If the solution is not unique then there exists a range of quantiles, each of which is equally valid. Taking the smallest value of such a range ensures that the empirical quantile function is left-continuous. Therefore, obtaining the

τ

th quantile of a distribution

F

can be achieved by minimizing the expected value of the loss function

ρ_{τ}

This idea of obtaining the quantile by solving an optimization problem can be extended to finding the

τ

th sample quantile. Given a vector of

n

observed values,

y

, from some distribution the empirical distribution function,

F_{n} (α) = n^{- 1} \sum_{i = 1}^{n} I (y_{i} \leq α)

provides an estimate of the unknown distribution function

F

giving an expected loss of

E (ρ_{τ} (y - α)) = n^{- 1} \sum_{i = 1}^{n} ρ_{τ} (y_{i} - α)

and, therefore, the problem of finding the

τ

th sample quantile,

\hat{α} (τ)

, can be expressed as finding the solution to the problem

\underset{α \in ℝ}{minimize} \sum_{i = 1}^{n} ρ_{τ} (y_{i} - α)

effectively replacing the operation of sorting, usually required when obtaining a sample quantile, with an optimization.

2.7.2 From least squares to quantile regression

Given the vector

y

it is a well known result that the sample mean,

\hat{y}

, solves the least squares problem

\underset{μ \in ℝ}{minimize} \sum_{i = 1}^{n} {(y_{i} - μ)}^{2} .

This result leads to least squares regression where, given design matrix

X

and defining the conditional mean of

y

μ (X) = X β

, an estimate of

β

is obtained from the solution to

\underset{β \in ℝ^{p}}{minimize} \sum_{i = 1}^{n} {(y_{i} - x_{i}^{T} β)}^{2} .

Quantile regression can be derived in a similar manner by specifying the

τ

th conditional quantile as

Q_{y} (τ | X) = X β (τ)

and estimating

β (τ)

as the solution to

\underset{β \in ℝ^{p}}{minimize} \sum_{i = 1}^{n} ρ_{τ} (y_{i} - x_{i}^{T} β) .

(10)

2.7.3 Quantile regression as a linear programming problem

By introducing

2 n

slack variables,

u = {u_{i} : i = 1, 2, \dots, n}

and

v = {u_{i} : i = 1, 2, \dots, n}

, the quantile regression minimization problem, (10), can be expressed as a Linear Programming (LP) problem, with primal and associated dual formulations:

(a)Primal form

\underset{(u, v, β) \in ℝ_{+}^{n} \times ℝ_{+}^{n} \times ℝ^{p}}{minimize} τ e^{T} u + (1 - τ) e^{T} v ​   subject to y = X β + u - v

(11)

where

e

is a vector of length

n

, where each element is

1

r_{i}

denotes the

i

th residual,

r_{i} = y_{i} - x_{i}^{T} β

, then the slack variables,

(u, v)

, can be thought as corresponding to the absolute value of the positive and negative residuals respectively with

\begin{matrix} u_{i} = {\begin{cases} r_{i} ​ if ​ r_{i} > 0 \\ 0 ​ otherwise \end{cases} & v_{i} = {\begin{cases} - r_{i} ​ if ​ r_{i} < 0 \\ 0 ​ otherwise \end{cases} \end{matrix}

(b)Dual form
The dual formulation of (11) is given by

$\underset{d}{maximize} y^{T} d subject to X^{T} d = 0, d \in {[τ - 1, τ]}^{n}$

which, on setting $a = d + (1 - τ) e$ , is equivalent to

$\underset{a}{maximize} y^{T} a subject to X^{T} a = (1 - τ) X^{T} e, a \in {[0, 1]}^{n} .$ (12)

(c)Canonical form

Linear programming problems are often described in a standard way, called the canonical form. The canonical form of an LP problem is

\underset{z}{minimize} c^{T} z ​   subject to l_{l} \leq {\begin{matrix} z \\ A z \end{matrix}} \leq l_{u} .

Letting

0_{p}

denote a vector of

p

zeros

\pm \infty_{p}

denote a vector of

p

arbitrarily small or large values,

I_{n \times n}

denote the

n \times n

identity matrix,

c = {a, b}

denote the row vector constructed by concatenating the elements of vector

b

to the elements of vector

a

and

C = [A, B]

denote the matrix constructed by concatenating the columns of matrix

B

onto the columns of matrix

A

then setting

\begin{matrix} c^{T} = {0_{p}, τ e^{T}, (1 - τ) e^{T}} & z^{T} = {β^{T}, u^{T}, v^{T}} \\ A = [X, I_{n \times n}, - I_{n \times n}] & b = y \\ l_{u} = {{+ \infty}_{p}, \infty_{n}, \infty_{n}, y} & l_{l} = {{- \infty}_{p}, 0_{n}, 0_{n}, y} \end{matrix}

gives the quantile regression LP problem as described in (11).

Once expressed as an LP problem the parameter estimates

\hat{β} (τ)

can be obtained in a number of ways, for example via the inertia-controlling method of Gill and Murray (1978) (see e04mfc), the simplex method or an interior point method as used by g02qfc and g02qgc.

2.7.4 Estimation of the covariance matrix

Koenker (2005) shows that the limiting covariance matrix of

\sqrt{n} (\hat{β} (τ) - β (τ))

is of the form of a Huber Sandwich. Therefore, under the assumption of Normally distributed errors

\sqrt{n} (\hat{β} (τ) - β (τ)) \sim N (0, τ (1 - τ) {H_{n} (τ)}^{- 1} J_{n} {H_{n} (τ)}^{- 1})

(13)

where

\begin{matrix} J_{n} = n^{- 1} \sum_{i = 1}^{n} x_{i} x_{i}^{T} \\ H_{n} (τ) = \lim_{n \to \infty} n^{- 1} \sum_{i = 1}^{n} x_{i} x_{i}^{T} f_{i} (Q_{y_{i}} (τ | x_{i})) \end{matrix}

and

f_{i} (Q_{y_{i}} (τ | x_{i}))

denotes the conditional density of the response

y

evaluated at the

τ

th conditional quantile.

More generally, the asymptotic covariance matrix for

\hat{β} (τ_{1}), \hat{β} (τ_{1}), \dots, \hat{β} (τ_{n})

has blocks defined by

cov (\sqrt{n} (\hat{β} (τ_{i}) - β (τ_{i})), \sqrt{n} (\hat{β} (τ_{j}) - β (τ_{j}))) = (\min (τ_{i}, τ_{j}) - τ_{i} τ_{j}) {H_{n} (τ_{i})}^{- 1} J_{n} {H_{n} (τ_{j})}^{- 1}

(14)

Under the assumption of independent, identically distributed (iid) errors, (13) simplifies to

\sqrt{n} (\hat{β} (τ) - β (τ)) \sim N (0, τ (1 - τ) {s (τ)}^{2} {(X^{T} X)}^{- 1})

where

s (τ)

is the sparsity function, given by

s (τ) = \frac{1}{f (F^{- 1} (τ))}

a similar simplification occurs with (14).

In cases where the assumption of iid errors does not hold, Powell (1991) suggests using a kernel estimator of the form

{\hat{H}}_{n} (τ) = {(n c_{n})}^{- 1} \sum_{i = 1}^{n} K (\frac{y_{i} - x_{i}^{T} \hat{β} (τ)}{c_{n}}) x_{i} x_{i}^{T}

for some bandwidth parameter

c_{n}

satisfying

\lim_{n \to \infty} c_{n} \to 0

and

\lim_{n \to \infty} \sqrt{n} c_{n} \to \infty

and Hendricks and Koenker (1991) suggest a method based on an extension of the idea of sparsity.

Rather than use an asymptotic estimate of the covariance matrix, it is also possible to use bootstrapping. Roughly speaking the original data is resampled and a set of parameter estimates obtained from each new sample. A sample covariance matrix is then constructed from the resulting matrix of parameter estimates.

2.8 Latent Variable Methods

Regression by means of projections to latent structures also known as partial least squares, is a latent variable linear model suited to data for which:

the number of $x$ -variables is high compared to the number of observations;
$x$ -variables and/or $y$ -variables are multicollinear.

Latent variables are linear combinations of

x

-variables that explain variance in

x

and

y

-variables. These latent variables, known as factors, are extracted iteratively from the data. A choice of the number of factors to include in a model can be made by considering diagnostic statistics such as the variable influence on projections (VIP).

2.9 LARS, LASSO and Forward Stagewise Regression

Least Angle Regression (LARS), Least Absolute Shrinkage Selection Operator (LASSO) and forward stagewise regression are three closely related regression techniques. Of the three, only LASSO has an easily accessible mathematical description suitable for being summarised here. A full description of the all three methods and the relationship between them can be found in Efron et al. (2004) and the references therein.

Given a vector of

n

observed values,

y = {y_{i} : i = 1, 2, \dots, n}

and an

n \times p

design matrix

X

, where the

j

th column of

X

, denoted

x_{j}

, is a vector of length

n

representing the

j

th independent variable

x_{j}

, standardized such that

\sum_{i = 1}^{n} x_{i j} = 0

, and

\sum_{i = 1}^{n} x_{i j}^{2} = 1

and a set of model parameters

β

to be estimated from the observed values, the LASSO model of Tibshirani (1996) is given by

\underset{α, β \in ℝ^{p}}{minimize} {‖ y - α - X^{T} β ‖}^{2} subject to {‖ β ‖}_{1} \leq t

(15)

for a given value of

t

, where

α = \bar{y} = n^{−1} \sum_{i = 1}^{n} y_{i}

. The positive LASSO model is the same as the standard LASSO model, given above, with the added constraint that

β_{j} \geq 0, j = 1, 2, \dots, p .

Rather than solve (15) for a given value of

t

, Efron et al. (2004) defined an algorithm that returns a full solution path for all possible values of

t

. It turns out that this path is piecewise linear with a finite number of pieces, denoted

K

, corresponding to

K

sets of parameter estimates.

3 Recommendations on Choice and Use of Available Functions

3.1 Correlation

3.1.1 Product-moment correlation

Let

{SS}_{x}

be the sum of squares of deviations from the mean,

\bar{x}

, for the variable

x

for a sample of size

n

, i.e.,

{SS}_{x} = \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2}

and let

{SC}_{x y}

be the cross-products of deviations from the means,

\bar{x}

and

\bar{y}

, for the variables

x

and

y

for a sample of size

n

, i.e.,

{SC}_{x y} = \sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y}) .

Then the sample covariance of

x

and

y

cov (x, y) = \frac{{SC}_{x y}}{(n - 1)}

and the product-moment correlation coefficient is

r = \frac{cov (x, y)}{\sqrt{var (x) var (y)}} = \frac{{SC}_{x y}}{\sqrt{{SS}_{x} {SS}_{y}}} .

g02btc updates the sample sums of squares and cross-products and deviations from the means by the addition/deletion of a (weighted) observation.

g02buc computes the sample sums of squares and cross-products deviations from the means (optionally weighted). The output from multiple calls to g02buc can be combined via a call to g02bzc, allowing large datasets to be summarised across multiple processing units.

g02bwc computes the product-moment correlation coefficients from the sample sums of squares and cross-products of deviations from the means.

The three functions compute only the upper triangle of the correlation matrix which is stored in a one-dimensional array in packed form.

g02bxc computes both the (optionally weighted) covariance matrix and the (optionally weighted) correlation matrix. These are returned in two-dimensional arrays. (Note that g02btc and g02buc can be used to compute the sums of squares from zero.)

3.1.2 Product-moment correlation with missing values

If there are missing values then g02buc and g02bxc, as described above, will allow casewise deletion giving the observation zero weight (compared with unit weight for an otherwise unweighted computation).

3.1.3 Nonparametric correlation

g02brc computes Kendall and/or Spearman nonparametric rank correlation coefficients. The function allows for a subset of variables to be selected and for observations to be excluded from the calculations if, for example, they contain missing values.

3.1.4 Partial correlation

g02byc computes a matrix of partial correlation coefficients from the correlation coefficients or variance-covariance matrix returned by g02bxc.

3.1.5 Robust correlation

g02hlc and g02hmc compute robust estimates of the variance-covariance matrix by solving the equations

\frac{1}{n} \sum_{i = 1}^{n} w ({‖ z_{i} ‖}_{2}) z_{i} = 0

and

\frac{1}{n} \sum_{i = 1}^{n} u ({‖ z_{i} ‖}_{2}) z_{i} z_{i}^{T} - v ({‖ z_{i} ‖}_{2}) I = 0,

as described in Section 2.1.4 for user-supplied functions

w

and

u

. Two options are available for

v

, either

v (t) = 1

for all

t

v (t) = u (t)

g02hmc requires only the function

w

and

u

to be supplied while g02hlc also requires their derivatives. In general, g02hlc will be considerably faster than g02hmc and should be used if derivatives are available.

g02hkc computes a robust variance-covariance matrix for the following functions:

\begin{array}{l} u (t) & = a_{u} / t^{2} ​ if ​ t < a_{u}^{2} \\ u (t) & = 1 ​ if ​ a_{u}^{2} \leq t \leq b_{u}^{2} \\ u (t) & = b_{u} / t^{2} ​ if ​ t > b_{u}^{2} \end{array}

and

\begin{array}{l} w (t) & = 1 ​ if ​ t \leq c_{w} \\ w (t) & = c_{w} / t ​ if ​ t > c_{w} \end{array}

for constants

a_{u}

b_{u}

and

c_{w}

These functions solve a minimax space problem considered by Huber (1981). The values of

a_{u}

b_{u}

and

c_{w}

are calculated from the fraction of gross errors; see Hampel et al. (1986) and Huber (1981).

To compute a correlation matrix from the variance-covariance matrix g02bwc may be used.

3.1.6 Nearest correlation matrix

A number of functions are provided to calculate a nearest correlation matrix. The choice of function will depend on what definition of ‘nearest’ is required and whether there is any particular structure desired in the resulting correlation matrix.

g02aac computes the nearest correlation matrix in the Frobenius norm, using the method of Qi and Sun (2006), modified by Borsdorf and Higham (2010). An extension to this function is g02abc which allows a row and column weighted Frobenius norm to be used as well as a bound on the minimum eigenvalue of the resulting correlation matrix to be specified. If a low-rank correlation matrix is required to constrain the number of independent random variables, g02akc can be used to find a nearest correlation matrix of maximum prescribed rank.

If fixing of individual elements is needed then g02asc can be used, which implements the alternating projection algorithm of Higham and Strabić (2016). If only elementwise weighting is required g02ajc can be used. Both of these functions again compute the nearest correlation matrix in the Frobenius norm and provide a bound on the eigenvalues. It should be noted, that these functions can be computationally expensive. If computational time needs to be minimized, you should consider a shrinking algorithm.

g02anc uses the shrinking method of Higham et al. (2014) to fix the leading block of the input. However, it does not compute the nearest correlation matrix in the Frobenius norm but finds the smallest relative perturbation to the unfixed elements to give a positive definite output. This functionality is extended in g02apc which allows an arbitrary target matrix to be specified via elementwise weights.

If a low-rank correlation matrix is required to constrain the number of independent random variables, g02akc can be used to find a nearest correlation matrix of maximum prescribed rank.

g02aec computes the factor loading matrix, allowing a correlation matrix with a

k

-factor structure to be computed.

See also the NAG optimization modelling suite in Chapter E04 which can be used to solve a variety of nearest correlation matrix problems. See for example Section 10 in e04rfc.

3.2 Regression

3.2.1 Simple linear regression

Two functions are provided for simple linear regression. The function g02cac calculates the parameter estimates for a simple linear regression with or without a constant term. The function g02cbc calculates fitted values, residuals and confidence intervals for both the fitted line and individual observations. This function produces the information required for various regression plots.

3.2.2 Ridge regression

g02kac calculates a ridge regression, optimizing the ridge parameter according to one of four prediction error criteria.

g02kbc calculates ridge regressions for a given set of ridge parameters.

3.2.3 Polynomial regression and nonlinear regression

No functions are currently provided in this chapter for polynomial regression. If you wish to perform polynomial regressions you have three alternatives: you can use the multiple linear regression functions, g02dac, with a set of independent variables which are in fact simply the same single variable raised to different powers, or you can use the function g04eac to compute orthogonal polynomials which can then be used with g02dac, or you can use the functions in Chapter E02 (Curve and Surface Fitting) which fit polynomials to sets of data points using the techniques of orthogonal polynomials. This latter course is to be preferred, since it is more efficient and liable to be more accurate, but in some cases more statistical information may be required than is provided by those functions, and it may be necessary to use the functions of this chapter.

More general nonlinear regression models may be fitted using the optimization functions in Chapter E04, which contains functions to minimize the function

\sum_{i = 1}^{n} e_{i}^{2}

where the regression parameters are the variables of the minimization problem.

3.2.4

l_{\infty}

norm (the absolutely largest residual) regression

No functions are currently provided in this chapter for regression using the

l_{\infty}

norm (the absolutely largest residual) regression, however, such a model can be fit using e02gcc, which calculates an

l_{\infty}

solution to the overdetermined system of equations

X β = y .

That is to say, it calculates a vector

β

, with

p

elements, which minimizes the

l_{\infty}

norm of the residuals (the absolutely largest residual)

r (β) = \max_{1 \leq i \leq n} | r_{i} |

where the residuals

r = y - X β

3.2.5 Multiple linear regression – general linear model

g02dac fits a general linear regression model using the

Q R

method and an SVD if the model is not of full rank. The results returned include: residual sum of squares, parameter estimates, their standard errors and variance-covariance matrix, residuals and leverages. There are also several functions to modify the model fitted by g02dac and to aid in the interpretation of the model.

g02dcc adds or deletes an observation from the model.

g02ddc computes the parameter estimates, and their standard errors and variance-covariance matrix for a model that is modified by g02dcc, g02dec or g02dfc.

g02dec adds a new variable to a model.

g02dfc drops a variable from a model.

g02dgc fits the regression to a new dependent variable, i.e., keeping the same independent variables.

g02dkc calculates the estimates of the parameters for a given set of constraints, (e.g., parameters for the levels of a factor sum to zero) for a model which is not of full rank and the SVD has been used.

g02dnc calculates the estimate of an estimable function and its standard error.

Note: g02dec also allows you to initialize a model building process and then to build up the model by adding variables one at a time.

3.2.6 Selecting regression models

To aid the selection of a regression model the following functions are available.

g02eac computes the residual sums of squares for all possible regressions for a given set of dependent variables. The function allows some variables to be forced into all regressions.

g02ecc computes the values of

R^{2}

and

C_{p}

from the residual sums of squares as provided by g02eac.

g02eec enables you to fit a model by forward selection. You may call g02eec a number of times. At each call the function will calculate the changes in the residual sum of squares from adding each of the variables not already included in the model, select the variable which gives the largest change and then if the change in residual sum of squares meets the given criterion will add it to the model.

g02efc uses a full stepwise selection to choose a subset of the explanatory variables. The method repeatedly applies a forward selection step followed by a backward elimination step until neither step updates the current model.

3.2.7 Residuals

g02fac computes the following standardized residuals and measures of influence for the residuals and leverages produced by g02dac:

(i)Internally studentized residual;
(ii)Externally studentized residual;
(iii)Cook's $D$ statistic;
(iv)Atkinson's $T$ statistic.

g02fcc computes the Durbin–Watson test statistic and bounds for its significance to test for serial correlation in the errors,

e_{i}

3.2.8 Robust regression

For robust regression using

M

-estimates instead of least squares the function g02hac will generally be suitable. g02hac provides a choice of four

ψ

-functions (Huber's, Hampel's, Andrew's and Tukey's) plus two different weighting methods and the option not to use weights. If other weights or different

ψ

-functions are needed the function g02hdc may be used. g02hdc requires you to supply weights, if required, and also functions to calculate the

ψ

-function and, optionally, the

χ

-function. g02hbc can be used in calculating suitable weights. The function g02hfc can be used after a call to g02hdc in order to calculate the variance-covariance estimate of the estimated regression coefficients.

For robust regression, using least absolute deviation, e02gac can be used.

3.2.9 Generalized linear models

There are four functions for fitting generalized linear models. The output includes: the deviance, parameter estimates and their standard errors, fitted values, residuals and leverages.

g02gac Normal distribution.

g02gbc binomial distribution.

g02gcc Poisson distribution.

g02gdc gamma distribution.

While g02gac can be used to fit linear regression models (i.e., by using an identity link) this is not recommended as g02dac will fit these models more efficiently. g02gcc can be used to fit log-linear models to contingency tables.

In addition to the functions to fit the models there is one function to predict from the fitted model and two functions to aid interpretation when the fitted model is not of full rank, i.e., aliasing is present.

g02gpc computes a predicted value and its associated standard error based on a previously fitted generalized linear model.

g02gkc computes parameter estimates for a set of constraints, (e.g., sum of effects for a factor is zero), from the SVD solution provided by the fitting function.

g02gnc calculates an estimate of an estimable function along with its standard error.

3.2.10 Linear mixed effects regression

Fitting a linear mixed effects regression model is split into three stages: model specification, data pre-processing and model fitting.

The model is specified using the modelling language described in Chapter G22. The fixed and random parts of the model are specified via one or more calls to g22yac.

Prior to pre-processing the data, the dataset must first be described via a call to g22ybc. The description of the model and dataset is then passed to g02jfc, along with the raw data, for pre-processing. For large problems it is possible to split the dataset up into smaller subsets of data, pre-processing each one separately and then combining the results using g02jgc.

Finally the model can be fit, either via maximum likelihood (ML) or restricted maximum likelihood (REML) using g02jhc. Labels for the various parameter estimates can be obtained using g22ydc.

The model specification function, g22yac, data description function, g22ybc and model fitting function g02jhc, have a number of optional parameters which can be set via g22zmc and queried using g22znc.

As the estimates of the variance components are found using an iterative procedure initial values must be supplied for each

σ

. In all four functions you can either specify these initial values or allow the function to calculate them from the data using minimum variance quadratic unbiased estimation (MIVQUE0). Setting the maximum number of iterations to zero in any of the functions will return the corresponding likelihood, parameter estimates, and standard errors based on these initial values.

The library contains a number of older functions for fitting linear mixed effects regression models, specifically g02jac, g02jbc, g02jdc and g02jec. Whilst the algorithmic details may vary the functionality of these functions is identical to that supplied by g02jhc, however g02jhc allows the model to be specified in a more intuitive way. These older functions have been deprecated and remain in the library only for backwards compatibility, it is, therefore, recommended that you use the newer functions going forward.

3.2.11 Linear quantile regression

Two functions are provided for performing linear quantile regression, g02qfc and g02qgc. Of these, g02qfc provides a simplified interface to g02qgc, where many of the input parameters have been given default values and the amount of output available has been reduced.

Prior to calling g02qgc the optional parameter array must be initialized by calling g02zkc with optstr set to Initialize. Once these arrays have been initialized g02zlc can be called to query the value of an optional parameter.

3.2.12 Partial Least Squares (PLS)

g02lac calculates a nonlinear, iterative PLS by using singular value decomposition.

g02lbc calculates a nonlinear, iterative PLS by using Wold's method.

g02lcc calculates parameter estimates for a given number of PLS factors.

g02ldc calculates predictions given a PLS model.

3.2.13 LARS, LASSO and Forward Stagewise Regression

Two functions for fitting a LARS, LASSO or forward stagewise regression are supplied: g02mac and g02mbc. The difference between the two functions is in the way that the data,

X

and

y

, are supplied. The first function, g02mac takes

X

and

y

directly, whereas g02mbc takes the data in the form of the cross-products:

X^{T} X

X^{T} y

and

y^{T} y

. In most situations g02mac will be the recommended function as the full data tends to be available. However, when there is a large number of observations (i.e.,

n

is large) it might be preferable to split the data into smaller blocks and process one block at a time. In such situations g02buc and g02bzc can be used to construct the required cross-products and g02mbc called to fit the required model.

Both g02mac and g02mbc return

K

sets of parameter estimates, which, because of its piecewise linear nature, define the full LARS, LASSO or forward stagewise regression solution path. However, parameter estimates are sometimes required at points along the solution path that differ from those returned by g02mac and g02mbc, for example when performing a cross-validation. g02mcc will return the parameter estimates in such cases.

3.2.14 Design Matrix Construction

Many of the functions in this chapter require a design matrix to be constructed. Rather than constructing this by hand functions from Chapter G22 may be used.

g22yac can be used to specify a linear model via a formula string. This model specification, along with a description of the dataset returned from g22ybc, can then be passed to g22ycc to produce a design matrix suitable for use with many of the Chapter G02 regression functions. In addition, g22ydc can be used to obtain submodel information or labels for parameter estimates.

3.2.15 Nonlinear Regression

Nonlinear regression is essentially a nonlinear optimization problem. This can be solved by the solver e04gnc from the NAG optimization modelling suite for general nonlinear data-fitting problems with constraints. This supports various forms of regression, including

ℓ_{1}

ℓ_{2}

and

ℓ_{\infty}

, and various forms of regularization including LASSO and ridge. As a general data fitting solver, e04gnc is also able to incorporate linear and nonlinear constraints on the model parameters.

4 Functionality Index

Generalized linear models,

binomial errors

g02gbc

computes estimable function

g02gnc

gamma errors

g02gdc

Normal errors

g02gac

Poisson errors

g02gcc

prediction

g02gpc

transform model parameters

g02gkc

Least angle regression (includes LASSO),

Additional parameter calculation

g02mcc

Model fitting,

Cross-product matrix

g02mbc

Raw data

g02mac

Linear mixed effects regression,

fitting (via REML or ML)

g02jhc

initiation

g02jfc

initiation, combine

g02jgc

Multiple linear regression/General linear model,

add/delete observation from model

g02dcc

add independent variable to model

g02dec

computes estimable function

g02dnc

delete independent variable from model

g02dfc

general linear regression model

g02dac

regression for new dependent variable

g02dgc

regression parameters from updated model

g02ddc

transform model parameters

g02dkc

Nearest correlation matrix,

fixed elements

g02asc

fixed submatrix

g02anc

k

-factor structure

g02aec

method of Qi and Sun,

element-wise weights

g02ajc

unweighted, unbounded

g02aac

weighted norm

g02abc

rank-constrained

g02akc

shrinkage method

g02apc

Non-parametric rank correlation (Kendall and/or Spearman):

missing values,

casewise treatment of missing values,

preserving input data

g02brc

Partial least squares,

calculates predictions given an estimated PLS model

g02ldc

fits a PLS model for a given number of factors

g02lcc

orthogonal scores using SVD

g02lac

orthogonal scores using Wold's method

g02lbc

Product-moment correlation,

correlation matrix,

compute correlation and covariance matrices

g02bxc

compute from sum of squares matrix

g02bwc

compute partial correlation and covariance matrices

g02byc

sum of squares matrix,

combine

g02bzc

compute

g02buc

update

g02btc

Quantile regression,

linear,

comprehensive

g02qgc

simple

g02qfc

Residuals,

Durbin–Watson test

g02fcc

standardized residuals and influence statistics

g02fac

Ridge regression,

ridge parameter(s) supplied

g02kbc

ridge parameter optimized

g02kac

Robust correlation,

Huber's method

g02hkc

user-supplied weight function only

g02hmc

user-supplied weight function plus derivatives

g02hlc

Robust regression,

compute weights for use with g02hdc

g02hbc

standard

M

-estimates

g02hac

user-supplied weight functions

g02hdc

variance-covariance matrix following g02hdc

g02hfc

Selecting regression model,

all possible regressions

g02eac

forward selection

g02eec

R^{2}

and

C_{p}

statistics

g02ecc

Service functions,

general option getting function

g02zlc

general option setting function

g02zkc

Simple linear regression,

no intercept

g02cbc

with intercept

g02cac

Stepwise linear regression,

Clarke's sweep algorithm

g02efc

NAG CL InterfaceG02 (Correg)Correlation and Regression Analysis

▸▿ Contents

1 Scope of the Chapter

2 Background to the Problems

2.1 Correlation

2.1.1 Aims of correlation analysis

2.1.2 Correlation coefficients

2.1.3 Partial correlation

2.1.4 Robust estimation of correlation coefficients

2.1.5 Missing values

2.1.6 Nearest Correlation Matrix

2.2 Regression

2.2.1 Aims of regression modelling

2.2.2 Regression models and designed experiments

2.2.3 Selecting the regression model

2.3 Linear Regression Models

2.3.1 Fitting the regression model – least squares estimation

2.3.2 Computational methods for least squares regression

2.3.3 Examining the fit of the model

2.3.4 Ridge regression

2.4 Robust Estimation

2.5 Generalized Linear Models

2.6 Linear Mixed Effects Regression

2.7 Quantile Regression

2.7.1 Finding a sample quantile as an optimization problem

2.7.2 From least squares to quantile regression

2.7.3 Quantile regression as a linear programming problem

2.7.4 Estimation of the covariance matrix

2.8 Latent Variable Methods

2.9 LARS, LASSO and Forward Stagewise Regression

3 Recommendations on Choice and Use of Available Functions

3.1 Correlation

3.1.1 Product-moment correlation

3.1.2 Product-moment correlation with missing values

3.1.3 Nonparametric correlation

3.1.4 Partial correlation

3.1.5 Robust correlation

3.1.6 Nearest correlation matrix

3.2 Regression

3.2.1 Simple linear regression

3.2.2 Ridge regression

3.2.3 Polynomial regression and nonlinear regression

3.2.4 l∞ norm (the absolutely largest residual) regression

3.2.5 Multiple linear regression – general linear model

3.2.6 Selecting regression models

3.2.7 Residuals

3.2.8 Robust regression

3.2.9 Generalized linear models

3.2.10 Linear mixed effects regression

3.2.11 Linear quantile regression

3.2.12 Partial Least Squares (PLS)

3.2.13 LARS, LASSO and Forward Stagewise Regression

3.2.14 Design Matrix Construction

3.2.15 Nonlinear Regression

4 Functionality Index

5 Auxiliary Functions Associated with Library Function Arguments

6 Withdrawn or Deprecated Functions

7 References

NAG CL Interface
G02 (Correg)
Correlation and Regression Analysis

3.2.4 $l_{\infty}$ norm (the absolutely largest residual) regression