nagcpp::correg::lars (g02ma) : NAG Library, Mark 27

Given a vector of

n

observed values,

y = {y_{i} : i = 1, 2, \dots, n}

and an

n \times p

design matrix

X

, where the

j

th column of

X

, denoted

x_{j}

, is a vector of length

n

representing the

j

th independent variable

x_{j}

, standardized such that

\sum_{i = 1}^{n} x_{i j} = 0

, and

\sum_{i = 1}^{n} x_{i j}^{2} = 1

and a set of model parameters

β

to be estimated from the observed values, the LARS algorithm can be summarised as:

1.Set $k = 1$ and all coefficients to zero, that is $β = 0$ .
2.Find the variable most correlated with $y$ , say $x_{j_{1}}$ . Add $x_{j_{1}}$ to the ‘most correlated’ set $A$ . If $p = 1$ go to 8.
3.Take the largest possible step in the direction of $x_{j_{1}}$ (i.e., increase the magnitude of $β_{j_{1}}$ ) until some other variable, say $x_{j_{2}}$ , has the same correlation with the current residual, $y - x_{j_{1}} β_{j_{1}}$ .
4.Increment $k$ and add $x_{j_{k}}$ to $A$ .
5.If $| A | = p$ go to 8.
6.Proceed in the ‘least angle direction’, that is, the direction which is equiangular between all variables in $A$ , altering the magnitude of the parameter estimates of those variables in $A$ , until the $k$ th variable, $x_{j_{k}}$ , has the same correlation with the current residual.
7.Go to 4.
8.Let $K = k$ .

mtype

– types::f77_integer Input

On entry: indicates the type of model to fit.

$mtype = 1$: LARS is performed.
$mtype = 2$: Forward linear stagewise regression is performed.
$mtype = 3$: LASSO model is fit.
$mtype = 4$: A positive LASSO model is fit.

Constraint:

mtype = 1

2

3

4

d (n, m)

– double array Input

On entry:

D

, the data, which along with pred and isx, defines the design matrix

X

. The

i

th observation for the

j

th variable must be supplied in

d (i - 1, j - 1)

, for

i = 1, 2, \dots, n

and

j = 1, 2, \dots, m

isx (lisx)

– types::f77_integer array Input

On entry: indicates which independent variables from d will be included in the design matrix,

X

If isx is nullptr, all variables are included in the design matrix.

Otherwise

isx (j - 1)

must be set as follows, for

j = 1, 2, \dots, m

$isx (j - 1) = 1$: To indicate that the $j$ th variable, as supplied in d, is included in the design matrix;
$isx (j - 1) = 0$: To indicated that the $j$ th variable, as supplied in d, is not included in the design matrix;

and

p = \sum_{j = 1}^{m} isx (j - 1)

Constraint: if

lisx = m

isx (j - 1) = 0

1

and at least one value of

isx (j - 1) \neq 0

, for

j = 1, 2, \dots, m

y (n)

– double array Input

On entry:

y

, the observations on the dependent variable.

ip

– types::f77_integer Output

On exit:

p

, number of parameter estimates.

If isx is nullptr,

p = m

, i.e., the number of variables in d.

Otherwise

p

is the number of nonzero values in isx.

nstep

– types::f77_integer Output

On exit:

K

, the actual number of steps carried out in the model fitting process.

b (vl_p, mnstep + 2)

– double array Output

On exit:

β

the parameter estimates, with

b (j - 1, k - 1) = β_{k j}

, the parameter estimate for the

j

th variable,

j = 1, 2, \dots, p

at the

k

th step of the model fitting process,

k = 1, 2, \dots, nstep

By default, when

pred = 2

3

the parameter estimates are rescaled prior to being returned. If the parameter estimates are required on the normalized scale, then this can be overridden via ropt.

The values held in the remaining part of b depend on the type of preprocessing performed.

If $pred = 0$ ,: $\begin{array}{l} b (j - 1, nstep) & = & 1 \\ b (j - 1, nstep + 1) & = & 0 \end{array}$
If $pred = 1$ ,: $\begin{array}{l} b (j - 1, nstep) & = & 1 \\ b (j - 1, nstep + 1) & = & {\bar{x}}_{j} \end{array}$
If $pred = 2$ ,: $\begin{array}{l} b (j - 1, nstep) & = & 1 / \sqrt{x_{j}^{T} x_{j}} \\ b (j - 1, nstep + 1) & = & 0 \end{array}$
If $pred = 3$ ,: $\begin{array}{l} b (j - 1, nstep) & = & 1 / \sqrt{{(x_{j} - {\bar{x}}_{j})}^{T} (x_{j} - {\bar{x}}_{j})} \\ b (j - 1, nstep + 1) & = & {\bar{x}}_{j} \end{array}$

for

j = 1, 2, \dots, p

fitsum (6, mnstep + 1)

– double array Output

On exit: summaries of the model fitting process. When

k = 1, 2, \dots, nstep

$fitsum (0, k - 1)$: ${‖ β_{k} ‖}_{1}$ , the sum of the absolute values of the parameter estimates for the $k$ th step of the modelling fitting process. If $pred = 2$ or $3$ , the scaled parameter estimates are used in the summation.
$fitsum (1, k - 1)$: ${RSS}_{k}$ , the residual sums of squares for the $k$ th step, where ${RSS}_{k} = {‖ y - X^{T} β_{k} ‖}^{2}$ .
$fitsum (2, k - 1)$: $ν_{k}$ , approximate degrees of freedom for the $k$ th step.
$fitsum (3, k - 1)$: $C_{p}^{(k)}$ , a $C_{p}$ -type statistic for the $k$ th step, where $C_{p}^{(k)} = \frac{{RSS}_{k}}{σ^{2}} - n + 2 ν_{k}$ .
$fitsum (4, k - 1)$: ${\hat{C}}_{k}$ , correlation between the residual at step $k - 1$ and the most correlated variable not yet in the active set $A$ , where the residual at step $0$ is $y$ .
$fitsum (5, k - 1)$: ${\hat{γ}}_{k}$ , the step size used at step $k$ .

In addition

$fitsum (0, nstep)$: $α$ , with $α = \bar{y}$ if $prey = 1$ and $0$ otherwise.
$fitsum (1, nstep)$: ${RSS}_{0}$ , the residual sums of squares for the null model, where ${RSS}_{0} = y^{T} y$ when $prey = 0$ and ${RSS}_{0} = {(y - \bar{y})}^{T} (y - \bar{y})$ otherwise.
$fitsum (2, nstep)$: $ν_{0}$ , the degrees of freedom for the null model, where $ν_{0} = 0$ if $prey = 0$ and $ν_{0} = 1$ otherwise.
$fitsum (3, nstep)$: $C_{p}^{(0)}$ , a $C_{p}$ -type statistic for the null model, where $C_{p}^{(0)} = \frac{{RSS}_{0}}{σ^{2}} - n + 2 ν_{0}$ .
$fitsum (4, nstep)$: $σ^{2}$ , where $σ^{2} = \frac{n - {RSS}_{K}}{ν_{K}}$ and $K = nstep$ .

Although the

C_{p}

statistics described above are returned when

errorid = 112

they may not be meaningful due to the estimate

σ^{2}

not being based on the saturated model.

opt

– OptionalG02MA Input/Output

Optional parameter container, derived from Optional.

Container for:

pred – types::f77_integer

This optional parameter may be set using the method OptionalG02MA::pred and accessed via OptionalG02MA::get_pred.

Default:

3

On entry: indicates the type of data preprocessing to perform on the independent variables supplied in d to comply with the standardized form of the design matrix.

$pred = 0$: No preprocessing is performed.
$pred = 1$: Each of the independent variables, $x_{j}$ , for $j = 1, 2, \dots, p$ , are mean centred prior to fitting the model. The means of the independent variables, $\bar{x}$ , are returned in b, with ${\bar{x}}_{j} = b (j - 1, nstep + 1)$ , for $j = 1, 2, \dots, p$ .
$pred = 2$: Each independent variable is normalized, with the $j$ th variable scaled by $1 / \sqrt{x_{j}^{T} x_{j}}$ . The scaling factor used by variable $j$ is returned in $b (j - 1, nstep)$ .
$pred = 3$: As $pred = 1$ and $2$ , all of the independent variables are mean centred prior to being normalized.

Suggested value:

pred = 3

Constraint:

pred = 0

1

2

3

prey – types::f77_integer

This optional parameter may be set using the method OptionalG02MA::prey and accessed via OptionalG02MA::get_prey.

Default:

1

On entry: indicates the type of data preprocessing to perform on the dependent variable supplied in y.

$prey = 0$: No preprocessing is performed, this is equivalent to setting $α = 0$ .
$prey = 1$: The dependent variable, $y$ , is mean centred prior to fitting the model, so $α = \bar{y}$ . Which is equivalent to fitting a non-penalized intercept to the model and the degrees of freedom etc. are adjusted accordingly.

The value of

α

used is returned in

fitsum (0, nstep)

Suggested value:

prey = 1

Constraint:

prey = 0

1

mnstep – types::f77_integer

This optional parameter may be set using the method OptionalG02MA::mnstep and accessed via OptionalG02MA::get_mnstep.

Default: if

(mtype = 1)

m

; otherwise:

200 * m

On entry: the maximum number of steps to carry out in the model fitting process.

mtype = 1

, i.e., a LARS is being performed, the maximum number of steps the algorithm will take is

\min (p, n)

prey = 0

, otherwise

\min (p, n - 1)

mtype = 2

, i.e., a forward linear stagewise regression is being performed, the maximum number of steps the algorithm will take is likely to be several orders of magnitude more and is no longer bound by

p

n

mtype = 3

4

, i.e., a LASSO or positive LASSO model is being fit, the maximum number of steps the algorithm will take lies somewhere between that of the LARS and forward linear stagewise regression, again it is no longer bound by

p

n

Constraint:

mnstep \geq 1

ropt – vector<double> array

This optional parameter may be set using the method OptionalG02MA::ropt and accessed via OptionalG02MA::get_ropt.

On entry: optional parameters to control various aspects of the LARS algorithm.

The default value will be used for

ropt (i - 1)

if the length of ropt is less than

i

, therefore, to use the default values for all optional parameters ropt need not be set. The default value will also be used if an invalid value is supplied for a particular argument, for example, setting

ropt (i - 1) = −1

will use the default value for argument

i

$ropt (0)$: The minimum step size that will be taken.
Default is $100 \times eps$ , where $eps$ is the machine precision returned by precision.
$ropt (1)$: General tolerance, used amongst other things, for comparing correlations.
Default is $ropt (0)$ .
$ropt (2)$: If set to $1$ , parameter estimates are rescaled before being returned.
If set to $0$ , no rescaling is performed.

This argument has no effect when $pred = 0$ or $1$ .

Default is for the parameter estimates to be rescaled.
$ropt (3)$: If set to $1$ , it is assumed that the model contains an intercept during the model fitting process and when calculating the degrees of freedom.
If set to $0$ , no intercept is assumed.

This has no effect on the amount of preprocessing performed on y.

Default is to treat the model as having an intercept when $prey = 1$ and as not having an intercept when $prey = 0$ .
$ropt (4)$: As implemented, the LARS algorithm can either work directly with $y$ and $X$ , or it can work with the cross-product matrices, $X^{T} y$ and $X^{T} X$ . In most cases it is more efficient to work with the cross-product matrices. This flag allows you direct control over which method is used, however, the default value will usually be the best choice.
If $ropt (4) = 1$ , $y$ and $X$ are worked with directly.

If $ropt (4) = 0$ , the cross-product matrices are used.

Default is $1$ when $p \geq 500$ and $n < p$ and $0$ otherwise.

Constraints:

$ropt (0) > machine precision$ ;
$ropt (1) > machine precision$ ;
$ropt (2) = 0$ or $1$ ;
$ropt (3) = 0$ or $1$ ;
$ropt (4) = 0$ or $1$ .

NAG CPP Interface
nagcpp::correg::lars (g02ma)

▸▿ Contents

1 Purpose

2 Specification

3 Description

4 References

5 Arguments

5.1Additional Quantities

6 Exceptions and Warnings

7 Accuracy

8 Parallelism and Performance

9 Further Comments

10 Example

10.1 Example Program

10.2 Plot

NAG CPP Interfacenagcpp::correg::lars (g02ma)

▸▿ Contents

1 Purpose

2 Specification

3 Description

4 References

5 Arguments

5.1Additional Quantities

6 Exceptions and Warnings

7 Accuracy

8 Parallelism and Performance

9 Further Comments

10 Example

10.1 Example Program

10.2 Plot

NAG CPP Interface
nagcpp::correg::lars (g02ma)