NAG Library Function Document

nag_step_regsn (g02eec)

nag_step_regsn (Nag_OrderType order, Integer *istep, Nag_IncludeMean mean, Integer n, Integer m, const double x[], Integer pdx, const char *var_names[], const Integer sx[], Integer maxip, const double y[], const double wt[], double fin, Nag_Boolean *addvar, const char *newvar[], double *chrss, double *f, const char *model[], Integer *nterm, double *rss, Integer *idf, Integer *ifr, const char *free_vars[], double exss[], double q[], Integer pdq, double p[], NagError *fail)

3 Description

One method of selecting a linear regression model from a given set of independent variables is by forward selection. The following procedure is used:

(i)	Select the best fitting independent variable, i.e., the independent variable which gives the smallest residual sum of squares. If the $F$ -test for this variable is greater than a chosen critical value, $F_{c}$ , then include the variable in the model, else stop.
(ii)	Find the independent variable that leads to the greatest reduction in the residual sum of squares when added to the current model.
(iii)	If the $F$ -test for this variable is greater than a chosen critical value, $F_{c}$ , then include the variable in the model and go to (ii), otherwise stop.

At any step the variables not in the model are known as the free terms.

nag_step_regsn (g02eec) allows you to specify some independent variables that must be in the model, these are known as forced variables.

The computational procedure involves the use of

Q R

decompositions, the

R

and the

Q

matrices being updated as each new variable is added to the model. In addition the matrix

Q^{T} X_{free}

, where

X_{free}

is the matrix of variables not included in the model, is updated.

nag_step_regsn (g02eec) computes one step of the forward selection procedure at a call. The results produced at each step may be printed or used as inputs to nag_regsn_mult_linear_upd_model (g02ddc), in order to compute the regression coefficients for the model fitted at that step. Repeated calls to nag_step_regsn (g02eec) should be made until

F < F_{c}

is indicated.

4 References

Draper N R and Smith H (1985) Applied Regression Analysis (2nd Edition) Wiley

Weisberg S (1985) Applied Linear Regression Wiley

5 Arguments

Note: after the initial call to nag_step_regsn (g02eec) with

istep = 0

all arguments except fin must not be changed by you between calls.

1: order – Nag_OrderTypeInput

On entry: the order argument specifies the two-dimensional storage scheme being used, i.e., row-major ordering or column-major ordering. C language defined storage is specified by

order = Nag_RowMajor

. See Section 3.2.1.3 in the Essential Introduction for a more detailed explanation of the use of this argument.

Constraint:

order = Nag_RowMajor

Nag_ColMajor

2: istep – Integer *Input/Output

On entry: indicates which step in the forward selection process is to be carried out.

$istep = 0$: The process is initialized.

Constraint:

istep \geq 0

On exit: is incremented by

1

3: mean – Nag_IncludeMeanInput

On entry: indicates if a mean term is to be included.

$mean = Nag_MeanInclude$: A mean term, intercept, will be included in the model.
$mean = Nag_MeanZero$: The model will pass through the origin, zero-point.

Constraint:

mean = Nag_MeanInclude

Nag_MeanZero

4: n – IntegerInput

On entry:

n

, the number of observations.

Constraint:

n \geq 2

5: m – IntegerInput

On entry:

m

, the total number of independent variables in the dataset.

Constraint:

m \geq 1

6: x[ $\dim$ ] – const doubleInput

Note: the dimension, dim, of the array x must be at least

$\max (1, pdx \times m)$ when $order = Nag_ColMajor$ ;
$\max (1, n \times pdx)$ when $order = Nag_RowMajor$ .

Where

X (i, j)

appears in this document, it refers to the array element

$x [(j - 1) \times pdx + i - 1]$ when $order = Nag_ColMajor$ ;
$x [(i - 1) \times pdx + j - 1]$ when $order = Nag_RowMajor$ .

On entry:

X (i, j)

must contain the

i

th observation for the

j

th independent variable, for

i = 1, 2, \dots, n

and

j = 1, 2, \dots, m

7: pdx – IntegerInput

On entry: the stride separating row or column elements (depending on the value of order) in the array x.

Constraints:

if $order = Nag_ColMajor$ , $pdx \geq n$ ;
if $order = Nag_RowMajor$ , $pdx \geq m$ .

8: var_names[m] – const char *Input

On entry:

var_names [i - 1]

must contain the name of the independent variable in row

i

of x, for

i = 1, 2, \dots, m

9: sx[m] – const IntegerInput

On entry: indicates which independent variables could be considered for inclusion in the regression.

$sx [j - 1] \geq 2$: The variable contained in the $j$ th column of x is automatically included in the regression model, for $j = 1, 2, \dots, m$ .
$sx [j - 1] = 1$: The variable contained in the $j$ th column of x is considered for inclusion in the regression model, for $j = 1, 2, \dots, m$ .
$sx [j - 1] = 0$: The variable in the $j$ th column is not considered for inclusion in the model, for $j = 1, 2, \dots, m$ .

Constraint:

sx [j - 1] \geq 0

and at least one value of

sx [j - 1] = 1

, for

j = 1, 2, \dots, m

10: maxip – IntegerInput

On entry: the maximum number of independent variables to be included in the model.

Constraints:

if $mean = Nag_MeanInclude$ , $maxip \geq 1 +$ number of values of $sx > 0$ ;
if $mean = Nag_MeanZero$ , $maxip \geq$ number of values of $sx > 0$ .

11: y[n] – const doubleInput

On entry: the dependent variable.

12: wt[ $\dim$ ] – const doubleInput

Note: the dimension, dim, of the array wt must be at least

n

On entry:

W

, wt must contain the weights to be used in the weighted regression.

wt [i - 1] = 0.0

, then the

i

th observation is not included in the model, in which case the effective number of observations is the number of observations with nonzero weights.

If weights are not provided then wt must be set to the null pointer, i.e., (double *)0, and the effective number of observations is n.

Constraint: if

wt is not NULL

wt [i] \geq 0.0

, for

i = 0, 1, \dots, n - 1

13: fin – doubleInput

On entry: the critical value of the

F

statistic for the term to be included in the model,

F_{c}

Suggested value:

2.0

is a commonly used value in exploratory modelling.

Constraint:

fin \geq 0.0

14: addvar – Nag_Boolean *Output

On exit: indicates if a variable has been added to the model.

$addvar = Nag_TRUE$: A variable has been added to the model.
$addvar = Nag_FALSE$: No variable had an $F$ value greater than $F_{c}$ and none were added to the model.

15: newvar[ $1$ ] – const char *Output

On exit: if

addvar = Nag_TRUE

, newvar contains the name of the variable added to the model.

16: chrss – double *Output

On exit: if

addvar = Nag_TRUE

, chrss contains the change in the residual sum of squares due to adding variable newvar.

17: f – double *Output

On exit: if

addvar = Nag_TRUE

, f contains the

F

statistic for the inclusion of the variable in newvar.

18: model[maxip] – const char *Input/Output

On entry: if

istep = 0

, model need not be set.

istep \neq 0

, model must contain the values returned by the previous call to nag_step_regsn (g02eec).

On exit: the names of the variables in the current model.

19: nterm – Integer *Input/Output

On entry: if

istep = 0

, nterm need not be set.

istep \neq 0

, nterm must contain the value returned by the previous call to nag_step_regsn (g02eec).

On exit: the number of independent variables in the current model, not including the mean, if any.

20: rss – double *Input/Output

On entry: if

istep = 0

, rss need not be set.

istep \neq 0

, rss must contain the value returned by the previous call to nag_step_regsn (g02eec).

On exit: the residual sums of squares for the current model.

21: idf – Integer *Input/Output

On entry: if

istep = 0

, idf need not be set.

istep \neq 0

, idf must contain the value returned by the previous call to nag_step_regsn (g02eec).

On exit: the degrees of freedom for the residual sum of squares for the current model.

22: ifr – Integer *Input/Output

On entry: if

istep = 0

, ifr need not be set.

istep \neq 0

, ifr must contain the value returned by the previous call to nag_step_regsn (g02eec).

On exit: the number of free independent variables, i.e., the number of variables not in the model that are still being considered for selection.

23: free_vars[maxip] – const char *Input/Output

On entry: if

istep = 0

, free_vars need not be set.

istep \neq 0

, free_vars must contain the values returned by the previous call to nag_step_regsn (g02eec).

On exit: the first ifr values of free_vars contain the names of the free variables.

24: exss[maxip] – doubleOutput

On exit: the first ifr values of exss contain what would be the change in regression sum of squares if the free variables had been added to the model, i.e., the extra sum of squares for the free variables.

exss [i - 1]

contains what would be the change in regression sum of squares if the variable

free_vars [i - 1]

had been added to the model.

25: q[ $\dim$ ] – doubleInput/Output

Note: the dimension, dim, of the array q must be at least

$\max (1, pdq \times maxip + 2)$ when $order = Nag_ColMajor$ ;
$\max (1, n \times pdq)$ when $order = Nag_RowMajor$ .

The

(i, j)

th element of the matrix

Q

is stored in

$q [(j - 1) \times pdq + i - 1]$ when $order = Nag_ColMajor$ ;
$q [(i - 1) \times pdq + j - 1]$ when $order = Nag_RowMajor$ .

On entry: if

istep = 0

, q need not be set.

istep \neq 0

, q must contain the values returned by the previous call to nag_step_regsn (g02eec).

On exit: the results of the

Q R

decomposition for the current model:

the first column of q contains $c = Q^{T} y$ (or $Q^{T} W^{\frac{1}{2}} y$ where $W$ is the vector of weights if used);
the upper triangular part of columns $2$ to $p + 1$ contain the $R$ matrix;
the strictly lower triangular part of columns $2$ to $p + 1$ contain details of the $Q$ matrix;
the remaining $p + 1$ to $p + ifr$ columns of contain $Q^{T} X_{free}$ (or $Q^{T} W^{\frac{1}{2}} X_{free}$ ),

where

p = nterm

, or

p = nterm + 1

mean = Nag_MeanInclude

26: pdq – IntegerInput

On entry: the stride separating row or column elements (depending on the value of order) in the array q.

Constraints:

if $order = Nag_ColMajor$ , $pdq \geq n$ ;
if $order = Nag_RowMajor$ , $pdq \geq maxip + 2$ .

27: p[ $maxip + 1$ ] – doubleInput/Output

On entry: if

istep = 0

, p need not be set.

istep \neq 0

, p must contain the values returned by the previous call to nag_step_regsn (g02eec).

On exit: the first

p

elements of p contain details of the

Q R

decomposition, where

p = nterm

, or

p = nterm + 1

mean = Nag_MeanInclude

28: fail – NagError *Input/Output

The NAG error argument (see Section 3.6 in the Essential Introduction).

6 Error Indicators and Warnings

NE_ALLOC_FAIL: Dynamic memory allocation failed.
NE_BAD_PARAM: On entry, argument $⟨value⟩$ had an illegal value.
NE_DENOM_ZERO: Denominator of f statistic is $\leq 0.0$ .
NE_FREE_VARS: There are no free variables in the regression.
NE_FULL_RANK: Forced variables not of full rank.
NE_INT: On entry, $istep = ⟨value⟩$ .
Constraint: $istep \geq 0$ .
On entry, $m = ⟨value⟩$ .
Constraint: $m \geq 1$ .
On entry, $n = ⟨value⟩$ .
Constraint: $n \geq 2$ .
On entry, $pdq = ⟨value⟩$ .
Constraint: $pdq > 0$ .
On entry, $pdx = ⟨value⟩$ .
Constraint: $pdx > 0$ .
NE_INT_2: On entry, istep and nterm are inconsistent: $istep = ⟨value⟩$ and $nterm = ⟨value⟩$ .
On entry, $pdq = ⟨value⟩$ and $n = ⟨value⟩$ .
Constraint: $pdq \geq n$ .
On entry, $pdx = ⟨value⟩$ and $m = ⟨value⟩$ .
Constraint: $pdx \geq m$ .
On entry, $pdx = ⟨value⟩$ and $n = ⟨value⟩$ .
Constraint: $pdx \geq n$ .
NE_INT_ARRAY: On entry, maxip is too small for number of terms given by sx: $maxip = ⟨value⟩$ .
NE_INT_ARRAY_ELEM_CONS: On entry, $sx [⟨value⟩] < 0$ .
NE_INTERNAL_ERROR: An internal error has occurred in this function. Check the function call and any array sizes. If the call is correct then please contact NAG for assistance.
NE_REAL: On entry, $fin = ⟨value⟩$ .
Constraint: $fin \geq 0.0$ .
On entry, with nonzero istep, $rss \leq 0.0$ : $rss = ⟨value⟩$ .
NE_REAL_ARRAY_ELEM_CONS: On entry, $wt [⟨value⟩] < 0.0$ .
NE_ZERO_DF: Degrees of freedom for error will equal $0$ if new variable is added.
On entry, number of forced variables $\geq n$ , i.e., idf would be zero.
NE_ZERO_VARS: Maximum number of variables to be included is $0$ .

7 Accuracy

As nag_step_regsn (g02eec) uses a

Q R

transformation the results will often be more accurate than traditional algorithms using methods based on the cross-products of the dependent and independent variables.

8 Parallelism and Performance

nag_step_regsn (g02eec) is threaded by NAG for parallel execution in multithreaded implementations of the NAG Library.

nag_step_regsn (g02eec) makes calls to BLAS and/or LAPACK routines, which may be threaded within the vendor library used by this implementation. Consult the documentation for the vendor library for further information.

Please consult the Users' Note for your implementation for any additional implementation-specific information.

9 Further Comments

None.

10 Example

The data, from an oxygen uptake experiment, is given by Weisberg (1985). The names of the variables are as given in Weisberg (1985). The independent and dependent variables are read and nag_step_regsn (g02eec) is repeatedly called until

addvar = Nag_FALSE

. At each step the

F

statistic, the free variables and their extra sum of squares are printed; also, except for when

addvar = Nag_FALSE

, the new variable, the change in the residual sum of squares and the terms in the model are printed.

NAG Library Function Documentnag_step_regsn (g02eec)

+− Contents

1 Purpose

2 Specification

3 Description

4 References

5 Arguments

6 Error Indicators and Warnings

7 Accuracy

8 Parallelism and Performance

9 Further Comments

10 Example

10.1 Program Text

10.2 Program Data

10.3 Program Results

NAG Library Function Document

nag_step_regsn (g02eec)