NAG CL Interface
g02dac (linregm_fit)
1
Purpose
g02dac performs a general multiple linear regression when the independent variables may be linearly dependent. Parameter estimates, standard errors, residuals and influence statistics are computed. g02dac may be used to perform a weighted regression.
2
Specification
void |
g02dac (Nag_IncludeMean mean,
Integer n,
const double x[],
Integer tdx,
Integer m,
const Integer sx[],
Integer ip,
const double y[],
const double wt[],
double *rss,
double *df,
double b[],
double se[],
double cov[],
double res[],
double h[],
double q[],
Integer tdq,
Nag_Boolean *svd,
Integer *rank,
double p[],
double tol,
double com_ar[],
NagError *fail) |
|
The function may be called by the names: g02dac, nag_correg_linregm_fit or nag_regsn_mult_linear.
3
Description
The general linear regression model is defined by
where
- is a vector of observations on the dependent variable,
- is an by matrix of the independent variables of column rank ,
- is a vector of length of unknown arguments, and
- is a vector of length of unknown random errors such that , where is a known diagonal matrix.
Note: the independent variables may be selected from a set of potential independent variables.
If , the identity matrix, then least squares estimation is used.
If , then for a given weight matrix , weighted least squares estimation is used.
The least squares estimates of the arguments minimize while the weighted least squares estimates minimize .
g02dac finds a
decomposition of
(or
in the weighted case), i.e.,
where
and
is a
by
upper triangular matrix and
is an
by
orthogonal matrix.
If
is of full rank, then
is the solution to
where
(or
) and
is the first
elements of
.
If
is not of full rank a solution is obtained by means of a singular value decomposition (SVD) of
,
where
is a
by
diagonal matrix with nonzero diagonal elements,
being the rank of
and
and
are
by
orthogonal matrices. This gives the solution
being the first
columns of
, i.e.,
and
being the first
columns of
.
Details of the SVD are made available, in the form of the matrix
:
This will be only one of the possible solutions. Other estimates may be obtained by applying constraints to the arguments. These solutions can be obtained by using
g02dkc after using
g02dac. Only certain linear combinations of the arguments will have unique estimates; these are known as estimable functions.
The fit of the model can be examined by considering the residuals, , where are the fitted values. The fitted values can be written as for an by matrix . The th diagonal element of , , gives a measure of the influence of the th value of the independent variables on the fitted regression model. The values are sometimes known as leverages. Both and are provided by g02dac.
The output of g02dac also includes , the residual sum of squares and associated degrees of freedom, , the standard errors of the parameter estimates and the variance-covariance matrix of the parameter estimates.
In many linear regression models the first term is taken as a mean term or an intercept, i.e., , for . This is provided as an option. Also note that not all the potential independent variables need to be included in a model; a facility to select variables to be included in the model is provided.
Details of the
decomposition and, if used, the SVD, are made available. These allow the regression to be updated by adding or deleting an observation using
g02dcc, adding or deleting a variable using
g02dec and
g02dfc or estimating and testing an estimable function using
g02dnc.
4
References
Cook R D and Weisberg S (1982) Residuals and Influence in Regression Chapman and Hall
Draper N R and Smith H (1985) Applied Regression Analysis (2nd Edition) Wiley
Golub G H and Van Loan C F (1996) Matrix Computations (3rd Edition) Johns Hopkins University Press, Baltimore
Hammarling S (1985) The singular value decomposition in multivariate statistics SIGNUM Newsl. 20(3) 2–25
McCullagh P and Nelder J A (1983) Generalized Linear Models Chapman and Hall
Searle S R (1971) Linear Models Wiley
5
Arguments
-
1:
– Nag_IncludeMean
Input
-
On entry: indicates if a mean term is to be included.
- A mean term, (intercept), will be included in the model.
- The model will pass through the origin, zero point.
Constraint:
or .
-
2:
– Integer
Input
-
On entry: the number of observations, .
Constraint:
.
-
3:
– const double
Input
-
On entry: must contain the th observation for the th potential independent variable, for and .
-
4:
– Integer
Input
-
On entry: the stride separating matrix column elements in the array
x.
Constraint:
.
-
5:
– Integer
Input
-
On entry: the total number of independent variables in the dataset, .
Constraint:
.
-
6:
– const Integer
Input
-
On entry: indicates which of the potential independent variables are to be included in the model. If
, then the variable contained in the corresponding column of
x is included in the regression model.
Constraints:
- , for ;
- if , then exactly values of sx must be ;
- if , then exactly ip values of sx must be .
-
7:
– Integer
Input
-
On entry: the number of independent variables in the model, including the mean or intercept if present.
Constraints:
- if , ;
- if , .
-
8:
– const double
Input
-
On entry: observations on the dependent variable, .
-
9:
– const double
Input
-
On entry: optionally, the weights to be used in the weighted regression.
If
, then the
th observation is not included in the model, in which case the effective number of observations is the number of observations with nonzero weights. The values of
res and
h will be set to zero for observations with zero weights.
If weights are not provided then
wt must be set to
NULL and the effective number of observations is
n.
Constraint:
if , , for .
-
On exit: the residual sum of squares for the regression.
-
11:
– double *
Output
-
On exit: the degrees of freedom associated with the residual sum of squares.
-
12:
– double
Output
-
On exit:
, for
, contain the least squares estimates of the arguments of the regression model,
.
If
, then
will contain the estimate of the mean argument and
will contain the coefficient of the variable contained in column
of
x, where
is the
th positive value in the array
sx.
If
, then
will contain the coefficient of the variable contained in column
of
x, where
is the
th positive value in the array
sx.
-
13:
– double
Output
-
On exit:
, for
, contains the standard errors of the
ip parameter estimates given in
b.
-
14:
– double
Output
-
On exit: the first
elements of
cov contain the upper triangular part of the variance-covariance matrix of the
ip parameter estimates given in
b. They are stored packed by column, i.e., the covariance between the parameter estimate given in
and the parameter estimate given in
,
, is stored in
, for
and
.
-
15:
– double
Output
-
On exit: the (weighted) residuals, .
-
16:
– double
Output
-
On exit: the diagonal elements of , , the leverages.
-
17:
– double
Output
-
Note: the th element of the matrix is stored in .
On exit: the results of the
decomposition: the first column of
q contains
, the upper triangular part of columns 2 to
contain the
matrix, the strictly lower triangular part of columns 2 to
contain details of the
matrix.
-
18:
– Integer
Input
-
On entry: the stride separating matrix column elements in the array
q.
Constraint:
.
-
19:
– Nag_Boolean *
Output
-
On exit: if a singular value decomposition has been performed then
svd will be Nag_TRUE, otherwise
svd will be Nag_FALSE.
-
20:
– Integer *
Output
-
On exit: the rank of the independent variables.
If , .
If
,
rank is an estimate of the rank of the independent variables.
rank is calculated as the number of singular values greater than
tol (largest singular value). It is possible for the SVD to be carried out but
rank to be returned as
ip.
-
21:
– double
Output
-
On exit: details of the
decomposition and SVD if used.
If
, only the first
ip elements of
p are used, these will contain details of the Householder vector in the
decomposition (see
Sections 2.2.1 and
3.4.6 in the
F08 Chapter Introduction).
If
, the first
ip elements of
p will
contain details of the Householder vector in the
decomposition and the next
ip elements of
p contain singular values. The following
ip by
ip elements contain the matrix
stored by rows.
-
22:
– double
Input
-
On entry: the value of
tol is used to decide what is the rank of the independent variables. The smaller the value of
tol the stricter the criterion for selecting the singular value decomposition. If
, then the singular value decomposition will never be used, this may cause run time errors or inaccurate results if the independent variables are not of full rank.
Suggested value:
.
Constraint:
.
-
23:
– double
Output
-
On exit: if
,
com_ar contains information which is needed by
g02dgc.
-
24:
– NagError *
Input/Output
-
The NAG error argument (see
Section 7 in the Introduction to the NAG Library CL Interface).
6
Error Indicators and Warnings
- NE_2_INT_ARG_LT
-
On entry, while . These arguments must satisfy .
On entry, while . These arguments must satisfy .
On entry, while . These arguments must satisfy .
- NE_ALLOC_FAIL
-
Dynamic memory allocation failed.
- NE_BAD_PARAM
-
On entry, argument
mean had an illegal value.
- NE_BAD_SX_OR_IP
-
Either a value of
sx is
, or
ip is incompatible with
mean and
sx, or
the effective number of observations.
- NE_INT_ARG_LT
-
On entry, .
Constraint: .
On entry, .
Constraint: .
On entry, .
Constraint: .
On entry, must not be less than 0: .
- NE_REAL_ARG_LT
-
On entry,
tol must not be less than 0.0:
.
On entry, must not be less than 0.0: .
- NE_SVD_NOT_CONV
-
The singular value decomposition has failed to converge.
- NE_ZERO_DOF_RESID
-
The degrees of freedom for the residuals are zero, i.e., the designated number of arguments the effective number of observations. In this case the parameter estimates will be returned along with the diagonal elements of , but neither standard errors nor the variance-covariance matrix will be calculated.
7
Accuracy
The accuracy of this function is closely related to the accuracy of the decomposition.
8
Parallelism and Performance
g02dac is not threaded in any implementation.
Function
g02fac can be used to compute standardized residuals and further measures of influence.
g02dac requires, in particular, the results stored in
res and
h.
9.1
Internal Changes
Internal changes have been made to this function as follows:
For details of all known issues which have been reported for the NAG Library please refer to the
Known Issues.
10
Example
For this function two examples are presented. There is a single example program for g02dac, with a main program and the code to solve the two example problems is given in the functions ex1 and ex2.
Example 1 (ex1)
Data from an experiment with four treatments and three observations per treatment are read in. The treatments are represented by dummy variables. An unweighted model is fitted with a mean included in the model.
Example 2 (ex2)
This example program uses
g02dac to find the coefficient of the
degree polynomial
that fits the data,
to
, in a least squares sense.
In this example g02dac is called with both and . The polynomial degree, the number of data points and the tolerance can be modified using the example data file.
10.1
Program Text
10.2
Program Data
10.3
Program Results