NAG FL Interface
g02mbf (lars_​xtx)

Settings help

FL Name Style:

FL Specification Language:

1 Purpose

g02mbf performs Least Angle Regression (LARS), forward stagewise linear regression or Least Absolute Shrinkage and Selection Operator (LASSO) using cross-product matrices.

2 Specification

Fortran Interface
Subroutine g02mbf ( mtype, pred, intcpt, n, m, dtd, lddtd, isx, lisx, dty, yty, mnstep, ip, nstep, b, ldb, fitsum, ropt, lropt, ifail)
Integer, Intent (In) :: mtype, pred, intcpt, n, m, lddtd, isx(lisx), lisx, mnstep, ldb, lropt
Integer, Intent (Inout) :: ifail
Integer, Intent (Out) :: ip, nstep
Real (Kind=nag_wp), Intent (In) :: dtd(lddtd,*), dty(m), yty, ropt(lropt)
Real (Kind=nag_wp), Intent (Inout) :: b(ldb,*)
Real (Kind=nag_wp), Intent (Out) :: fitsum(6,mnstep+1)
C Header Interface
#include <nag.h>
void  g02mbf_ (const Integer *mtype, const Integer *pred, const Integer *intcpt, const Integer *n, const Integer *m, const double dtd[], const Integer *lddtd, const Integer isx[], const Integer *lisx, const double dty[], const double *yty, const Integer *mnstep, Integer *ip, Integer *nstep, double b[], const Integer *ldb, double fitsum[], const double ropt[], const Integer *lropt, Integer *ifail)
The routine may be called by the names g02mbf or nagf_correg_lars_xtx.

3 Description

g02mbf implements the LARS algorithm of Efron et al. (2004) as well as the modifications needed to perform forward stagewise linear regression and fit LASSO and positive LASSO models.
Given a vector of n observed values, y = {yi:i=1,2,,n} and an n×p design matrix X, where the jth column of X, denoted xj, is a vector of length n representing the jth independent variable xj, standardized such that i=1 n xij =0 , and i=1 n xij2 =1 and a set of model parameters β to be estimated from the observed values, the LARS algorithm can be summarised as:
  1. 1.Set k=1 and all coefficients to zero, that is β=0.
  2. 2.Find the variable most correlated with y, say xj1. Add xj1 to the ‘most correlated’ set A. If p=1 go to 8.
  3. 3.Take the largest possible step in the direction of xj1 (i.e., increase the magnitude of βj1) until some other variable, say xj2, has the same correlation with the current residual, y-xj1βj1.
  4. 4.Increment k and add xjk to A.
  5. 5.If |A|=p go to 8.
  6. 6.Proceed in the ‘least angle direction’, that is, the direction which is equiangular between all variables in A, altering the magnitude of the parameter estimates of those variables in A, until the kth variable, xjk, has the same correlation with the current residual.
  7. 7.Go to 4.
  8. 8.Let K=k.
As well as being a model selection process in its own right, with a small number of modifications the LARS algorithm can be used to fit the LASSO model of Tibshirani (1996), a positive LASSO model, where the independent variables enter the model in their defined direction, forward stagewise linear regression (Hastie et al. (2001)) and forward selection (Weisberg (1985)). Details of the required modifications in each of these cases are given in Efron et al. (2004).
The LASSO model of Tibshirani (1996) is given by
minimize α , βk p y-α-XTβk 2   subject to   βk1 tk  
for all values of tk, where α=y¯=n−1 i=1 n yi. The positive LASSO model is the same as the standard LASSO model, given above, with the added constraint that
βkj 0 ,   j=1,2,,p .  
Unlike the standard LARS algorithm, when fitting either of the LASSO models, variables can be dropped as well as added to the set A. Therefore, the total number of steps K is no longer bounded by p.
Forward stagewise linear regression is an iterative procedure of the form:
  1. 1.Initialize k=1 and the vector of residuals r0=y-α.
  2. 2.For each j=1,2,,p calculate cj=xjTrk-1. The value cj is, therefore, proportional to the correlation between the jth independent variable and the vector of previous residual values, rk.
  3. 3.Calculate jk = argmax j |cj| , the value of j with the largest absolute value of cj.
  4. 4.If |cjk|<ε then go to 7.
  5. 5.Update the residual values, with
    rk = rk-1 + δ ​ ​ sign(cjk) xjk  
    where δ is a small constant and sign(cjk)=−1 when cjk<0 and 1 otherwise.
  6. 6.Increment k and go to 2.
  7. 7.Set K=k.
If the largest possible step were to be taken, that is δ=|cjk| then forward stagewise linear regression reverts to the standard forward selection method as implemented in g02eef.
The LARS procedure results in K models, one for each step of the fitting process. In order to aid in choosing which is the most suitable Efron et al. (2004) introduced a Cp-type statistic given by
Cp(k) = y-XTβk 2 σ2 -n+2νk,  
where νk is the approximate degrees of freedom for the kth step and
σ2 = n-yTyνK .  
One way of choosing a model is, therefore, to take the one with the smallest value of Cp(k).

4 References

Efron B, Hastie T, Johnstone I and Tibshirani R (2004) Least Angle Regression The Annals of Statistics (Volume 32) 2 407–499
Hastie T, Tibshirani R and Friedman J (2001) The Elements of Statistical Learning: Data Mining, Inference and Prediction Springer (New York)
Tibshirani R (1996) Regression Shrinkage and Selection via the Lasso Journal of the Royal Statistics Society, Series B (Methodological) (Volume 58) 1 267–288
Weisberg S (1985) Applied Linear Regression Wiley

5 Arguments

1: mtype Integer Input
On entry: indicates the type of model to fit.
LARS is performed.
Forward linear stagewise regression is performed.
LASSO model is fit.
A positive LASSO model is fit.
Constraint: mtype=1, 2, 3 or 4.
2: pred Integer Input
On entry: indicates the type of preprocessing to perform on the cross-products involving the independent variables, i.e., those supplied in dtd and dty.
No preprocessing is performed.
Each independent variable is normalized, with the jth variable scaled by 1/xjTxj. The scaling factor used by variable j is returned in b(j,nstep+1).
Constraint: pred=0 or 2.
3: intcpt Integer Input
On entry: indicates the type of data preprocessing that was perform on the dependent variable, y, prior to calling this routine.
No preprocessing was performed.
The dependent variable, y, was mean centred.
Constraint: intcpt=0 or 1.
4: n Integer Input
On entry: n, the number of observations.
Constraint: n1.
5: m Integer Input
On entry: m, the total number of independent variables.
Constraint: m1.
6: dtd(lddtd,*) Real (Kind=nag_wp) array Input
Note: the second dimension of the array dtd must be at least m(m+1)/2 if lddtd=1, and at least m otherwise.
On entry: DTD, the cross-product matrix, which along with isx, defines the design matrix cross-product XTX.
If lddtd=1, dtd(1,i×(i-1)/2+j) must contain the cross-product of the ith and jth variable, for i=1,2,,m and j=1,2,,m. That is the cross-product stacked by columns as returned by g02buf, for example.
Otherwise dtd(i,j) must contain the cross-product of the ith and jth variable, for i=1,2,,m and j=1,2,,m. It should be noted that, even though DTD is symmetric, the full matrix must be supplied.
The matrix specified in dtd must be a valid cross-products matrix.
7: lddtd Integer Input
On entry: the first dimension of the array dtd as declared in the (sub)program from which g02mbf is called.
Constraint: lddtd=1 or lddtdm.
8: isx(lisx) Integer array Input
On entry: indicates which independent variables from dtd will be included in the design matrix, X.
If lisx=0, all variables are included in the design matrix and isx is not referenced.
If lisx=m,isx(j) must be set as follows, for j=1,2,,m:
To indicate that the jth variable, as supplied in dtd, is included in the design matrix;
To indicate that the jth variable, as supplied in dtd, is not included in the design matrix;
and p= j=1 m isx(j).
Constraint: if lisx=m, isx(j)=0 or 1 and at least one value of isx(j)0, for j=1,2,,m.
9: lisx Integer Input
On entry: length of the isx array.
Constraint: lisx=0 or m.
10: dty(m) Real (Kind=nag_wp) array Input
On entry: DTy, the cross-product between the dependent variable, y, and the independent variables D.
11: yty Real (Kind=nag_wp) Input
On entry: yTy, the sums of squares of the dependent variable.
Constraint: yty>0.0.
12: mnstep Integer Input
On entry: the maximum number of steps to carry out in the model fitting process.
If mtype=1, i.e., a LARS is being performed, the maximum number of steps the algorithm will take is min(p,n) if intcpt=0, otherwise min(p,n-1).
If mtype=2, i.e., a forward linear stagewise regression is being performed, the maximum number of steps the algorithm will take is likely to be several orders of magnitude more and is no longer bound by p or n.
If mtype=3 or 4, i.e., a LASSO or positive LASSO model is being fit, the maximum number of steps the algorithm will take lies somewhere between that of the LARS and forward linear stagewise regression, again it is no longer bound by p or n.
Constraint: mnstep1.
13: ip Integer Output
On exit: p, number of parameter estimates.
If lisx=0, p=m, i.e., the number of variables in dtd.
Otherwise p is the number of nonzero values in isx.
14: nstep Integer Output
On exit: K, the actual number of steps carried out in the model fitting process.
15: b(ldb,*) Real (Kind=nag_wp) array Output
Note: the second dimension of the array b must be at least mnstep+1.
On exit: β the parameter estimates, with b(j,k)=βkj, the parameter estimate for the jth variable, j=1,2,,p at the kth step of the model fitting process, k=1,2,,nstep.
By default, when pred=2 the parameter estimates are rescaled prior to being returned. If the parameter estimates are required on the normalized scale, then this can be overridden via ropt.
The values held in the remaining part of b depend on the type of preprocessing performed.
If ​pred=0 b(j,nstep+1) = 1, if ​pred=2 b(j,nstep+1) = 1/ xjTxj,
for j=1,2,p.
16: ldb Integer Input
On entry: the first dimension of the array b as declared in the (sub)program from which g02mbf is called.
Constraint: ldbp, where p is the number of parameter estimates as described in ip.
17: fitsum(6,mnstep+1) Real (Kind=nag_wp) array Output
On exit: summaries of the model fitting process. When k=1,2,,nstep
βk1, the sum of the absolute values of the parameter estimates for the kth step of the modelling fitting process. If pred=2, the scaled parameter estimates are used in the summation.
RSSk, the residual sums of squares for the kth step, where RSSk= y-XTβk 2 .
νk, approximate degrees of freedom for the kth step.
Cp(k), a Cp-type statistic for the kth step, where Cp(k)=RSSkσ2-n+2νk.
C^k, correlation between the residual at step k-1 and the most correlated variable not yet in the active set A, where the residual at step 0 is y.
γ^k, the step size used at step k.
In addition
RSS0, the residual sums of squares for the null model, where RSS0=yTy.
ν0, the degrees of freedom for the null model, where ν0=0 if intcpt=0 and ν0=1 otherwise.
Cp(0), a Cp-type statistic for the null model, where Cp(0)=RSS0σ2-n+2ν0.
σ2, where σ2=n-RSSKνK and K=nstep.
Although the Cp statistics described above are returned when ifail=122 they may not be meaningful due to the estimate σ2 not being based on the saturated model.
18: ropt(lropt) Real (Kind=nag_wp) array Input
On entry: optional parameters to control various aspects of the LARS algorithm.
The default value will be used for ropt(i) if lropt<i, therefore, setting lropt=0 will use the default values for all optional parameters and ropt need not be set. The default value will also be used if an invalid value is supplied for a particular argument, for example, setting ropt(i)=−1 will use the default value for argument i.
The minimum step size that will be taken.
Default is 100×eps is used, where eps is the machine precision returned by x02ajf.
General tolerance, used amongst other things, for comparing correlations.
Default is ropt(1).
If set to 1, parameter estimates are rescaled before being returned. If set to 0, no rescaling is performed. This argument has no effect when pred=0.
Default is for the parameter estimates to be rescaled.
  • ropt(1)>machine precision;
  • ropt(2)>machine precision.
19: lropt Integer Input
On entry: length of the options array ropt.
Constraint: 0lropt3.
20: ifail Integer Input/Output
On entry: ifail must be set to 0, −1 or 1 to set behaviour on detection of an error; these values have no effect when no error is detected.
A value of 0 causes the printing of an error message and program execution will be halted; otherwise program execution continues. A value of −1 means that an error message is printed while a value of 1 means that it is not.
If halting is not appropriate, the value −1 or 1 is recommended. If message printing is undesirable, then the value 1 is recommended. Otherwise, the value −1 is recommended since useful values can be provided in some output arguments even when ifail0 on exit. When the value -1 or 1 is used it is essential to test the value of ifail on exit.
On exit: ifail=0 unless the routine detects an error or a warning has been flagged (see Section 6).

6 Error Indicators and Warnings

If on entry ifail=0 or −1, explanatory error messages are output on the current error message unit (as defined by x04aaf).
Errors or warnings detected by the routine:
Note: in some cases g02mbf may return useful information.
On entry, mtype=value.
Constraint: mtype=1, 2, 3 or 4.
On entry, pred=value.
Constraint: pred=0 or 2.
On entry, intcpt=value.
Constraint: intcpt=0 or 1.
On entry, n=value.
Constraint: n1.
On entry, m=value.
Constraint: m1.
The cross-product matrix supplied in dtd is not symmetric.
On entry, dtd(1,value)=value.
Constraint: diagonal elements of DTD must be positive.
On entry, i=value and dtd(i,i)=value.
Constraint: diagonal elements of DTD must be positive.
On entry, lddtd=value and m=value
Constraint: lddtd=1 or lddtdm.
On entry, isx(value)=value.
Constraint: isx(i)=0 or 1, for all i.
On entry, all values of isx are zero.
Constraint: at least one value of isx must be nonzero.
On entry, lisx=value and m=value.
Constraint: lisx=0 or m.
On entry, yty=value.
Constraint: yty>0.0.
A negative value for the residual sums of squares was obtained. Check the values of dtd, dty and yty.
On entry, mnstep=value.
Constraint: mnstep1.
Fitting process did not finished in mnstep steps. Try increasing the size of mnstep and supplying larger output arrays.
All output is returned as documented, up to step mnstep, however, σ and the Cp statistics may not be meaningful.
On entry, ldb=value and m=value.
Constraint: if lisx=0 then ldbm.
On entry, ldb=value and p=value.
Constraint: if lisx=M, ldbp.
σ2 is approximately zero and hence the Cp-type criterion cannot be calculated. All other output is returned as documented.
νK=n, therefore, sigma has been set to a large value. Output is returned as documented.
Degenerate model, no variables added and nstep=0. Output is returned as documented.
On entry, lropt=value.
Constraint: 0lropt3.
An unexpected error has been triggered by this routine. Please contact NAG.
See Section 7 in the Introduction to the NAG Library FL Interface for further information.
Your licence key may have expired or may not have been installed correctly.
See Section 8 in the Introduction to the NAG Library FL Interface for further information.
Dynamic memory allocation failed.
See Section 9 in the Introduction to the NAG Library FL Interface for further information.

7 Accuracy

Not applicable.

8 Parallelism and Performance

Background information to multithreading can be found in the Multithreading documentation.
g02mbf is threaded by NAG for parallel execution in multithreaded implementations of the NAG Library.
g02mbf makes calls to BLAS and/or LAPACK routines, which may be threaded within the vendor library used by this implementation. Consult the documentation for the vendor library for further information.
Please consult the X06 Chapter Introduction for information on how to control and interrogate the OpenMP environment used within this routine. Please also consult the Users' Note for your implementation for any additional implementation-specific information.

9 Further Comments

The solution path to the LARS, LASSO and stagewise regression analysis is a continuous, piecewise linear. g02mbf returns the parameter estimates at various points along this path. g02mcf can be used to obtain estimates at different points along the path.
If you have the raw data values, that is D and y, then g02maf can be used instead of g02mbf.

10 Example

This example performs a LARS on a simulated dataset with 20 observations and 6 independent variables.
The example uses g02buf to get the cross-products of the augmented matrix [Dy]. The first m(m+1)/2 elements of the (column packed) cross-products matrix returned, therefore, contain the elements of DTD, the next m elements contain DTy and the last element yTy.

10.1 Program Text

Program Text (g02mbfe.f90)

10.2 Program Data

Program Data (g02mbfe.d)

10.3 Program Results

Program Results (g02mbfe.r)
This example plot shows the regression coefficients (βk) plotted against the scaled absolute sum of the parameter estimates (βk1).
GnuplotProduced by GNUPLOT 5.4 patchlevel 6 −1 0 1 2 3 4 0 20 40 60 80 100 120 140 160 180 200 220 Parameter Estimates (βkj) ||βk||1 bk1 βk1 bk2 βk2 bk3 βk3 bk4 βk4 bk5 βk5 bk6 βk6 Example Program Parameter estimates for a LARS model fitted to a simulated dataset