g02mbc performs Least Angle Regression (LARS), forward stagewise linear regression or Least Absolute Shrinkage and Selection Operator (LASSO) using cross-product matrices.
The function may be called by the names: g02mbc, nag_correg_lars_xtx or nag_lars_xtx.
3Description
g02mbc implements the LARS algorithm of Efron et al. (2004) as well as the modifications needed to perform forward stagewise linear regression and fit LASSO and positive LASSO models.
Given a vector of observed values,
and an design matrix , where the th column of , denoted , is a vector of length representing the th independent variable , standardized such that
, and
and a set of model parameters to be estimated from the observed values, the LARS algorithm can be summarised as:
1.Set and all coefficients to zero, that is .
2.Find the variable most correlated with , say . Add to the ‘most correlated’ set . If go to 8.
3.Take the largest possible step in the direction of (i.e., increase the magnitude of ) until some other variable, say , has the same correlation with the current residual, .
6.Proceed in the ‘least angle direction’, that is, the direction which is equiangular between all variables in , altering the magnitude of the parameter estimates of those variables in , until the th variable, , has the same correlation with the current residual.
As well as being a model selection process in its own right, with a small number of modifications the LARS algorithm can be used to fit the LASSO model of Tibshirani (1996), a positive LASSO model, where the independent variables enter the model in their defined direction, forward stagewise linear regression (Hastie et al. (2001)) and forward selection (Weisberg (1985)). Details of the required modifications in each of these cases are given in Efron et al. (2004).
for all values of , where . The positive LASSO model is the same as the standard LASSO model, given above, with the added constraint that
Unlike the standard LARS algorithm, when fitting either of the LASSO models, variables can be dropped as well as added to the set . Therefore, the total number of steps is no longer bounded by .
Forward stagewise linear regression is an iterative procedure of the form:
1.Initialize and the vector of residuals .
2.For each calculate . The value is, therefore, proportional to the correlation between the th independent variable and the vector of previous residual values, .
3.Calculate , the value of with the largest absolute value of .
If the largest possible step were to be taken, that is then forward stagewise linear regression reverts to the standard forward selection method as implemented in g02eec.
The LARS procedure results in models, one for each step of the fitting process. In order to aid in choosing which is the most suitable Efron et al. (2004) introduced a -type statistic given by
where is the approximate degrees of freedom for the th step and
One way of choosing a model is, therefore, to take the one with the smallest value of .
4References
Efron B, Hastie T, Johnstone I and Tibshirani R (2004) Least Angle Regression The Annals of Statistics (Volume 32)2 407–499
Hastie T, Tibshirani R and Friedman J (2001) The Elements of Statistical Learning: Data Mining, Inference and Prediction Springer (New York)
Tibshirani R (1996) Regression Shrinkage and Selection via the Lasso Journal of the Royal Statistics Society, Series B (Methodological) (Volume 58)1 267–288
Weisberg S (1985) Applied Linear Regression Wiley
5Arguments
1: – Nag_LARSModelTypeInput
On entry: indicates the type of model to fit.
LARS is performed.
Forward linear stagewise regression is performed.
LASSO model is fit.
A positive LASSO model is fit.
Constraint:
, , or .
2: – Nag_LARSPreProcessInput
On entry: indicates the type of preprocessing to perform on the cross-products involving the independent variables, i.e., those supplied in dtd and dty.
No preprocessing is performed.
Each independent variable is normalized, with the th variable scaled by . The scaling factor used by variable is returned in .
Constraint:
or .
3: – Nag_LARSPreProcessInput
On entry: indicates the type of data preprocessing that was perform on the dependent variable, , prior to calling this function.
No preprocessing was performed.
The dependent variable, , was mean centred.
Constraint:
or .
4: – IntegerInput
On entry: , the number of observations.
Constraint:
.
5: – IntegerInput
On entry: , the total number of independent variables.
Constraint:
.
6: – const doubleInput
Note: the dimension, dim, of the array dtd
must be at least
when
;
otherwise.
On entry: , the cross-product matrix, which along with isx, defines the design matrix cross-product .
If ,
must contain the cross-product of the th and th variable, for and . That is the cross-product stacked by columns as returned by g02buc, for example.
Otherwise
must contain the cross-product of the th and th variable, for and . It should be noted that, even though is symmetric, the full matrix must be supplied.
The matrix specified in dtd must be a valid cross-products matrix.
7: – IntegerInput
On entry: the stride separating row elements in the two-dimensional data stored in the array dtd.
Constraint:
or .
8: – const IntegerInput
Note: the dimension, dim, of the array isx
must be at least
, when .
On entry: indicates which independent variables from dtd will be included in the design matrix, .
If isx is NULL, all variables are included in the design matrix.
Otherwise must be set as follows, for :
To indicate that the th variable, as supplied in dtd, is included in the design matrix;
To indicate that the th variable, as supplied in dtd, is not included in the design matrix;
and .
Constraint:
or and at least one value of , for .
9: – const doubleInput
On entry: , the cross-product between the dependent variable, , and the independent variables .
10: – doubleInput
On entry: , the sums of squares of the dependent variable.
Constraint:
.
11: – IntegerInput
On entry: the maximum number of steps to carry out in the model fitting process.
If , the maximum number of steps the algorithm will take is if , otherwise .
If , the maximum number of steps the algorithm will take is likely to be several orders of magnitude more and is no longer bound by or .
If or , the maximum number of steps the algorithm will take lies somewhere between that of the LARS and forward linear stagewise regression, again it is no longer bound by or .
Constraint:
.
12: – Integer *Output
On exit: , number of parameter estimates.
If isx is NULL, , i.e., the number of variables in dtd.
On exit: , the actual number of steps carried out in the model fitting process.
14: – doubleOutput
Note: the dimension, dim, of the array b
must be at least
.
On exit: the parameter estimates, with , the parameter estimate for the th variable, at the th step of the model fitting process, .
By default, when the parameter estimates are rescaled prior to being returned. If the parameter estimates are required on the normalized scale, then this can be overridden via ropt.
The values held in the remaining part of b depend on the type of preprocessing performed.
for .
15: – IntegerInput
On entry: the stride separating row elements in the two-dimensional data stored in the array b.
Constraint:
, where is the number of parameter estimates as described in ip.
16: – doubleOutput
On exit: summaries of the model fitting process. When
, the sum of the absolute values of the parameter estimates for the th step of the modelling fitting process. If , the scaled parameter estimates are used in the summation.
, the residual sums of squares for the th step, where .
, approximate degrees of freedom for the th step.
, a -type statistic for the th step, where .
, correlation between the residual at step and the most correlated variable not yet in the active set , where the residual at step is .
, the step size used at step .
In addition
.
, the residual sums of squares for the null model, where .
, the degrees of freedom for the null model, where if and otherwise.
, a -type statistic for the null model, where .
, where and .
Although the statistics described above are returned when NW_LIMIT_REACHED they may not be meaningful due to the estimate not being based on the saturated model.
17: – const doubleInput
On entry: optional parameters to control various aspects of the LARS algorithm.
The default value will be used for if , therefore, setting will use the default values for all optional parameters and ropt need not be set and may be NULL. The default value will also be used if an invalid value is supplied for a particular argument, for example, setting will use the default value for argument .
The minimum step size that will be taken.
Default is is used, where is the machine precision returned by X02AJC.
General tolerance, used amongst other things, for comparing correlations.
Default is .
If set to , parameter estimates are rescaled before being returned. If set to , no rescaling is performed. This argument has no effect when .
Default is for the parameter estimates to be rescaled.
On entry, all values of isx are zero.
Constraint: at least one value of isx must be nonzero.
On entry, .
Constraint: or , for all .
NE_INTERNAL_ERROR
An internal error has occurred in this function. Check the function call and any array sizes. If the call is correct then please contact NAG for assistance.
See Section 7.5 in the Introduction to the NAG Library CL Interface for further information.
NE_MAX_STEP
On entry, .
Constraint: .
NE_NEG_ELEMENT
On entry, .
Constraint: diagonal elements of must be positive.
On entry, and .
Constraint: diagonal elements of must be positive.
NE_NEG_SX
A negative value for the residual sums of squares was obtained. Check the values of dtd, dty and yty.
NE_NO_LICENCE
Your licence key may have expired or may not have been installed correctly.
See Section 8 in the Introduction to the NAG Library CL Interface for further information.
NE_REAL
On entry, .
Constraint: .
NE_SYMM_MATRIX
The cross-product matrix supplied in dtd is not symmetric.
NW_LIMIT_REACHED
Fitting process did not finished in mnstep steps. Try increasing the size of mnstep and supplying larger output arrays. All output is returned as documented, up to step mnstep, however, and the statistics may not be meaningful.
NW_OVERFLOW_WARN
, therefore, sigma has been set to a large value. Output is returned as documented.
is approximately zero and hence the -type criterion cannot be calculated. All other output is returned as documented.
NW_POTENTIAL_PROBLEM
Degenerate model, no variables added and . Output is returned as documented.
7Accuracy
Not applicable.
8Parallelism and Performance
g02mbc is threaded by NAG for parallel execution in multithreaded implementations of the NAG Library.
g02mbc makes calls to BLAS and/or LAPACK routines, which may be threaded within the vendor library used by this implementation. Consult the documentation for the vendor library for further information.
Please consult the X06 Chapter Introduction for information on how to control and interrogate the OpenMP environment used within this function. Please also consult the Users' Note for your implementation for any additional implementation-specific information.
9Further Comments
The solution path to the LARS, LASSO and stagewise regression analysis is a continuous, piecewise linear. g02mbc returns the parameter estimates at various points along this path. g02mcc can be used to obtain estimates at different points along the path.
If you have the raw data values, that is and , then g02mac can be used instead of g02mbc.
10Example
This example performs a LARS on a simulated dataset with observations and independent variables.
The example uses g02buc to get the cross-products of the augmented matrix . The first elements of the (column packed) cross-products matrix returned, therefore, contain the elements of , the next elements contain and the last element .