When selecting a linear regression model for a set of
observations a balance has to be found between the number of independent variables in the model and fit as measured by the residual sum of squares. The more variables included the smaller will be the residual sum of squares. Two statistics can help in selecting the best model.
-
(a) represents the proportion of variation in the dependent variable that is explained by the independent variables.
where |
(if mean is fitted, otherwise ) and |
|
, where
. |
The -values can be examined to find a model with a high -value but with small number of independent variables.
-
(b) statistic.
where is the number of parameters (including the mean) in the model and is an estimate of the true variance of the errors. This can often be obtained from fitting the full model.
A well fitting model will have . is often plotted against to see which models are closest to the line.
g02ecc may be called after
g02eac which calculates the residual sums of squares for all possible linear regression models.
-
1:
– Nag_IncludeMean
Input
-
On entry: indicates if a mean term is to be included.
- A mean term, intercept, will be included in the model.
- The model will pass through the origin, zero-point.
Constraint:
or .
-
2:
– Integer
Input
-
On entry: , the number of observations used in the regression model.
Constraint:
must be greater than , where is the largest number of independent variables fitted (including the mean if fitted).
-
3:
– double
Input
-
On entry: the best estimate of true variance of the errors, .
Constraint:
.
-
4:
– double
Input
-
On entry: the total sum of squares for the regression model.
Constraint:
.
-
5:
– Integer
Input
-
On entry: the number of regression models.
Constraint:
.
-
6:
– const Integer
Input
-
On entry: must contain the number of independent variables (not counting the mean) fitted to the th model, for .
-
On entry: must contain the residual sum of squares for the th model.
Constraint:
, for .
-
8:
– double
Output
-
On exit: contains the -value for the th model, for .
-
9:
– double
Output
-
On exit: contains the -value for the th model, for .
-
10:
– NagError *
Input/Output
-
The NAG error argument (see
Section 7 in the Introduction to the NAG Library CL Interface).
Accuracy is sufficient for all practical purposes.
None.
The data, from an oxygen uptake experiment, is given by
Weisberg (1985). The independent and dependent variables are read and the residual sums of squares for all possible models computed using
g02eac. The values of
and
are then computed and printed along with the names of variables in the models.