NAG Library Function Document
nag_lars (g02mac)
1 Purpose
nag_lars (g02mac) performs Least Angle Regression (LARS), forward stagewise linear regression or Least Absolute Shrinkage and Selection Operator (LASSO).
2 Specification
#include <nag.h> |
#include <nagg02.h> |
void |
nag_lars (Nag_LARSModelType mtype,
Nag_LARSPreProcess pred,
Nag_LARSPreProcess prey,
Integer n,
Integer m,
const double d[],
Integer pdd,
const Integer isx[],
const double y[],
Integer mnstep,
Integer *ip,
Integer *nstep,
double b[],
Integer pdb,
double fitsum[],
const double ropt[],
Integer lropt,
NagError *fail) |
|
3 Description
nag_lars (g02mac) implements the LARS algorithm of
Efron et al. (2004) as well as the modifications needed to perform forward stagewise linear regression and fit LASSO and positive LASSO models.
Given a vector of
observed values,
and an
design matrix
, where the
th column of
, denoted
, is a vector of length
representing the
th independent variable
, standardized such that
, and
and a set of model parameters
to be estimated from the observed values, the LARS algorithm can be summarised as:
1. |
Set and all coefficients to zero, that is . |
2. |
Find the variable most correlated with , say . Add to the ‘most correlated’ set . If go to 8. |
3. |
Take the largest possible step in the direction of (i.e., increase the magnitude of ) until some other variable, say , has the same correlation with the current residual, . |
4. |
Increment and add to . |
5. |
If go to 8. |
6. |
Proceed in the ‘least angle direction’, that is, the direction which is equiangular between all variables in , altering the magnitude of the parameter estimates of those variables in , until the th variable, , has the same correlation with the current residual. |
7. |
Go to 4. |
8. |
Let . |
As well as being a model selection process in its own right, with a small number of modifications the LARS algorithm can be used to fit the LASSO model of
Tibshirani (1996), a positive LASSO model, where the independent variables enter the model in their defined direction (i.e.,
), forward stagewise linear regression (
Hastie et al. (2001)) and forward selection (
Weisberg (1985)). Details of the required modifications in each of these cases are given in
Efron et al. (2004).
The LASSO model of
Tibshirani (1996) is given by
for all values of
, where
. The positive LASSO model is the same as the standard LASSO model, given above, with the added constraint that
Unlike the standard LARS algorithm, when fitting either of the LASSO models, variables can be dropped as well as added to the set . Therefore the total number of steps is no longer bounded by .
Forward stagewise linear regression is an iterative procedure of the form:
1. |
Initialize and the vector of residuals . |
2. |
For each calculate . The value is therefore proportional to the correlation between the th independent variable and the vector of previous residual values, . |
3. |
Calculate , the value of with the largest absolute value of . |
4. |
If then go to 7. |
5. |
Update the residual values, with
where is a small constant and when and otherwise. |
6. |
Increment and go to 2. |
7. |
Set . |
If the largest possible step were to be taken, that is
then forward stagewise linear regression reverts to the standard forward selection method as implemented in
nag_step_regsn (g02eec).
The LARS procedure results in
models, one for each step of the fitting process. In order to aid in choosing which is the most suitable
Efron et al. (2004) introduced a
-type statistic given by
where
is the approximate degrees of freedom for the
th step and
One way of choosing a model is therefore to take the one with the smallest value of .
4 References
Efron B, Hastie T, Johnstone I and Tibshirani R (2004) Least Angle Regression The Annals of Statistics (Volume 32) 2 407–499
Hastie T, Tibshirani R and Friedman J (2001) The Elements of Statistical Learning: Data Mining, Inference and Prediction Springer (New York)
Tibshirani R (1996) Regression Shrinkage and Selection via the Lasso Journal of the Royal Statistics Society, Series B (Methodological) (Volume 58) 1 267–288
Weisberg S (1985) Applied Linear Regression Wiley
5 Arguments
- 1:
– Nag_LARSModelTypeInput
-
On entry: indicates the type of model to fit.
- LARS is performed.
- Forward linear stagewise regression is performed.
- LASSO model is fit.
- A positive LASSO model is fit.
Constraint:
, , or .
- 2:
– Nag_LARSPreProcessInput
-
On entry: indicates the type of data preprocessing to perform on the independent variables supplied in
d to comply with the standardized form of the design matrix.
- No preprocessing is performed.
- Each of the independent variables,
, for , are mean centered prior to fitting the model. The means of the independent variables, , are returned in b, with
, for .
- Each independent variable is normalized, with the th variable scaled by . The scaling factor used by variable is returned in .
- As and , all of the independent variables are mean centered prior to being normalized.
Suggested value:
Constraint:
, , or .
- 3:
– Nag_LARSPreProcessInput
-
On entry: indicates the type of data preprocessing to perform on the dependent variable supplied in
y.
- No preprocessing is performed, this is equivalent to setting .
- The dependent variable, , is mean centered prior to fitting the model, so . Which is equivalent to fitting a non-penalized intercept to the model and the degrees of freedom etc. are adjusted accordingly.
The value of used is returned in .
Suggested value:
Constraint:
or .
- 4:
– IntegerInput
-
On entry: , the number of observations.
Constraint:
.
- 5:
– IntegerInput
-
On entry: , the total number of independent variables.
Constraint:
.
- 6:
– const doubleInput
-
Note: the dimension,
dim, of the array
d
must be at least
.
On entry:
, the data, which along with
pred and
isx, defines the design matrix
. The
th observation for the
th variable must be supplied in
, for
and
.
- 7:
– IntegerInput
-
On entry: the stride separating row elements in the two-dimensional data stored in the array
d.
Constraint:
.
- 8:
– const IntegerInput
-
On entry: indicates which independent variables from
d will be included in the design matrix,
.
If
isx is
NULL, all variables are included in the design matrix.
Otherwise, for
when
must be set as follows:
- To indicate that the th variable, as supplied in d, is included in the design matrix;
- To indicated that the th variable, as supplied in d, is not included in the design matrix;
and
.
Constraint:
or and at least one value of , for .
- 9:
– const doubleInput
-
On entry: , the observations on the dependent variable.
- 10:
– IntegerInput
-
On entry: the maximum number of steps to carry out in the model fitting process.
If , the maximum number of steps the algorithm will take is if , otherwise .
If , the maximum number of steps the algorithm will take is likely to be several orders of magnitude more and is no longer bound by or .
If or , the maximum number of steps the algorithm will take lies somewhere between that of the LARS and forward linear stagewise regression, again it is no longer bound by or .
Constraint:
.
- 11:
– Integer *Output
-
On exit:
, number of parameter estimates.
If
isx is
NULL,
, i.e., the number of variables in
d.
Otherwise
is the number of nonzero values in
isx.
- 12:
– Integer *Output
-
On exit: , the actual number of steps carried out in the model fitting process.
- 13:
– doubleOutput
-
Note: the dimension,
dim, of the array
b
must be at least
.
On exit:
the parameter estimates, with
, the parameter estimate for the
th variable,
at the
th step of the model fitting process,
.
By default, when
or
the parameter estimates are rescaled prior to being returned. If the parameter estimates are required on the normalized scale, then this can be overridden via
ropt.
The values held in the remaining part of
b depend on the type of preprocessing performed.
- If ,
-
- If ,
-
- If ,
-
- If ,
-
for .
- 14:
– IntegerInput
-
On entry: the stride separating row elements in the two-dimensional data stored in the array
b.
Constraint:
, where
is the number of parameter estimates as described in
ip.
- 15:
– doubleOutput
-
On exit: summaries of the model fitting process. When
,
- , the sum of the absolute values of the parameter estimates for the th step of the modelling fitting process. If or , the scaled parameter estimates are used in the summation.
- , the residual sums of squares for the th step, where .
- , approximate degrees of freedom for the th step.
- , a -type statistic for the th step, where .
- , correlation between the residual at step and the most correlated variable not yet in the active set , where the residual at step is .
- , the step size used at step .
In addition
- , with if and otherwise.
- , the residual sums of squares for the null model, where when and otherwise.
- , the degrees of freedom for the null model, where if and otherwise.
- , a -type statistic for the null model, where .
- , where and .
Although the
statistics described above are returned when
NW_LIMIT_REACHED they may not be meaningful due to the estimate
not being based on the saturated model.
- 16:
– const doubleInput
-
On entry: optional arguments to control various aspects of the LARS algorithm.
The default value will be used for
if
, therefore setting
will use the default values for all optional arguments and
ropt need not be set and may be
NULL. The default value will also be used if an invalid value is supplied for a particular argument, for example, setting
will use the default value for argument
.
- The minimum step size that will be taken.
Default is
is used, where
is the
machine precision returned by
nag_machine_precision (X02AJC).
- General tolerance, used amongst other things, for comparing correlations.
Default is .
- If set to , parameter estimates are rescaled before being returned.
If set to , no rescaling is performed.
This argument has no effect when or .
Default is for the parameter estimates to be rescaled.
- If set to , it is assumed that the model contains an intercept during the model fitting process and when calculating the degrees of freedom.
If set to , no intercept is assumed.
This has no effect on the amount of preprocessing performed on
y.
Default is to treat the model as having an intercept when and as not having an intercept when .
- As implemented, the LARS algorithm can either work directly with and , or it can work with the cross-product matrices, and . In most cases it is more efficient to work with the cross-product matrices. This flag allows you direct control over which method is used, however, the default value will usually be the best choice.
If , and are worked with directly.
If , the cross-product matrices are used.
Default is when and and otherwise.
Constraints:
- ;
- ;
- or ;
- or ;
- or .
- 17:
– IntegerInput
-
On entry: length of the options array
ropt.
Constraint:
.
- 18:
– NagError *Input/Output
-
The NAG error argument (see
Section 3.6 in the Essential Introduction).
6 Error Indicators and Warnings
- NE_ALLOC_FAIL
-
Dynamic memory allocation failed.
See
Section 3.2.1.2 in the Essential Introduction for further information.
- NE_ARRAY_SIZE
-
On entry, .
Constraint: .
On entry,
and
.
Constraint: if
isx is
NULL then
.
On entry,
and
.
Constraint: if
isx is not
NULL then
.
On entry, and .
Constraint: .
- NE_BAD_PARAM
-
On entry, argument had an illegal value.
- NE_INT
-
On entry, .
Constraint: .
On entry, .
Constraint: .
- NE_INT_ARRAY
-
On entry, all values of
isx are zero.
Constraint: at least one value of
isx must be nonzero.
On entry, .
Constraint: or for all .
- NE_INTERNAL_ERROR
-
An internal error has occurred in this function. Check the function call and any array sizes. If the call is correct then please contact
NAG for assistance.
An unexpected error has been triggered by this function. Please contact
NAG.
See
Section 3.6.6 in the Essential Introduction for further information.
- NE_MAX_STEP
-
On entry, .
Constraint: .
- NE_NO_LICENCE
-
Your licence key may have expired or may not have been installed correctly.
See
Section 3.6.5 in the Essential Introduction for further information.
- NW_LIMIT_REACHED
-
Fitting process did not finish in
mnstep steps. Try increasing the size of
mnstep and supplying larger output arrays.
All output is returned as documented, up to step
mnstep, however,
and the
statistics may not be meaningful.
- NW_OVERFLOW_WARN
-
, therefore has been set to a large value. Output is returned as documented.
is approximately zero and hence the -type criterion cannot be calculated. All other output is returned as documented.
- NW_POTENTIAL_PROBLEM
-
Degenerate model, no variables added and . Output is returned as documented.
7 Accuracy
Not applicable.
nag_lars (g02mac) returns the parameter estimates at various points along the solution path of a LARS, LASSO or stagewise regression analysis. If the solution is required at a different set of points, for example when performing cross-validation, then
nag_lars_param (g02mcc) can be used.
For datasets with a large number of observations,
, it may be impractical to store the full
matrix in memory in one go. In such instances the cross-product matrices
and
can be calculated, using for example, multiple calls to
nag_sum_sqs (g02buc) and
nag_sum_sqs_combine (g02bzc), and
nag_lars_xtx (g02mbc) called to perform the analysis.
The amount of workspace used by nag_lars (g02mac) depends on whether the cross-product matrices are being used internally (as controlled by
ropt). If the cross-product matrices are being used then nag_lars (g02mac) internally allocates approximately
elements of real storage compared to
elements when
and
are used directly. In both cases approximately
elements of integer storage are also used. If a forward linear stagewise analysis is performed than an additional
elements of real storage are required.
9 Example
This example performs a LARS on a simulated dataset with observations and independent variables.
9.1 Program Text
Program Text (g02mace.c)
9.2 Program Data
Program Data (g02mace.d)
9.3 Program Results
Program Results (g02mace.r)
This example plot shows the regression coefficients () plotted against the scaled absolute sum of the parameter estimates ().