NAG Library Function Document

nag_simple_linear_regression (g02cac)

void	nag_simple_linear_regression (Nag_SumSquare mean, Integer n, const double x[], const double y[], const double wt[], double a, double b, double a_serr, double b_serr, double rsq, double rss, double df, NagError fail)

3 Description

nag_simple_linear_regression (g02cac) fits a straight line model of the form,

E (y) = a + b x,

where

E (y)

is the expected value of the variable

y

, to the data points

(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n}),

such that

y_{i} = a + {b x}_{i} + e_{i}, i = 1, 2, \dots, n (n > 2) .

where the

e_{i}

values are independent random errors. The

i

th data point may have an associated weight

w_{i}

, these may be used either in the situation when var

(ε_{i}) = σ^{2} / w_{i}

or if observations have to be removed from the regression by having zero weight or have been observed with frequency

w_{i}

The regression coefficient,

b

, and the regression constant,

a

are estimated by minimizing

\sum_{i = 1}^{n} w_{i} e_{i}^{2},

if the weights option is not selected then

w_{i} = 1.0

The following statistics are computed:

the estimate of regression constant $\hat{a} = \bar{y} - \hat{b} \bar{x}$ ,
the estimate of regression coefficient $\hat{b} = \frac{\sum w_{i} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sum w_{i} {(x_{i} - \bar{x})}^{2}}$ ,
the residual sum of squares $r s s = \sum w_{i} {(y_{i} - {\hat{y}}_{i})}^{2}$ ,

where the weighted means

\bar{x}

and

\bar{y}

are

\bar{x} = \frac{\sum w_{i} x_{i}}{\sum w_{i}} and \bar{y} = \frac{\sum w_{i} y_{i}}{\sum w_{i}} .

The number of degrees of freedom associated with

r s s

$d f = \sum w_{i} - 2$ where $mean = Nag_AboutMean$
$d f = \sum w_{i} - 1$ where $mean = Nag_AboutZero$

Note: the weights should be scaled to give the correct degrees of freedom in the case var

(ε_{i}) = σ^{2} / w_{i}

The

R^{2}

value or coefficient of determination

R^{2} = \frac{\sum w_{i} {({\hat{y}}_{i} - {\bar{y}}_{i})}^{2}}{\sum w_{i} {(y_{i} - \bar{y})}^{2}} = \frac{\sum w_{i} {(y_{i} - \bar{y})}^{2} - r s s}{\sum w_{i} {(y_{i} - \bar{y})}^{2}} .

This measures the proportion of the total variation about the mean

\bar{y}

that can be explained by the regression.

The standard error for the regression constant

\hat{a}

a_serr = \sqrt{\frac{r s s}{d f} (\frac{1}{\sum w_{i}} + \frac{{(\bar{x})}^{2}}{\sum w_{i} {(x_{i} - \bar{x})}^{2}})} = \sqrt{\frac{r s s}{d f} \frac{1}{\sum w_{i}} \frac{\sum w_{i} x_{i}^{2}}{\sum w_{i} {(x_{i} - \bar{x})}^{2}}} .

The standard error for the regression coefficient

\hat{b}

b_serr = \sqrt{\frac{r s s}{d f \sum w_{i} {(x_{i} - \bar{x})}^{2}}} .

Similar formulae can be derived for the case when the line goes through the origin, that is

a = 0

4 References

Draper N R and Smith H (1985) Applied Regression Analysis (2nd Edition) Wiley

5 Arguments

1: mean – Nag_SumSquareInput

On entry: indicates whether nag_simple_linear_regression (g02cac) is to include a constant term in the regression.

$mean = Nag_AboutMean$: The regression constant $a$ is included.
$mean = Nag_AboutZero$: The regression constant $a$ is not included, i.e., $a = 0$ .

Constraint:

mean = Nag_AboutMean

Nag_AboutZero

2: n – IntegerInput

On entry:

n

, the number of observations.

Constraints:

if $mean = Nag_AboutMean$ , $n \geq 2$ ;
if $mean = Nag_AboutZero$ , $n \geq 1$ .

3: x[n] – const doubleInput

On entry: the values of the independent variable with the

i

th value stored in

x [i - 1]

, for

i = 1, 2, \dots, n

Constraint: all the values of

x

must not be identical.

4: y[n] – const doubleInput

On entry: the values of the dependent variable with the

i

th value stored in

y [i - 1]

, for

i = 1, 2, \dots, n

Constraint: all the values of

y

must not be identical.

5: wt[n] – const doubleInput

On entry: if weighted estimates are required then wt must contain the weights to be used in the weighted regression. Usually

wt [i - 1]

will be an integral value corresponding to the number of observations associated with the

i

th data point, or zero if the

i

th data point is to be ignored. The sum of the weights therefore represents the effective total number of observations used to create the regression line.

If weights are not provided then wt must be set to NULL and the effective number of observations is n.

Constraint: if

wt is not NULL

wt [i - 1] = 0.0

, for

i = 1, 2, \dots, n

6: a – double *Output

On exit: if

mean = Nag_AboutMean

then a is the regression constant

\hat{a}

, otherwise a is set to zero.

7: b – double *Output

On exit: the regression coefficient

\hat{b}

8: a_serr – double *Output

On exit: the standard error of the regression constant

\hat{a}

9: b_serr – double *Output

On exit: the standard error of the regression coefficient

\hat{b}

10: rsq – double *Output

On exit: the coefficient of determination,

R^{2}

11: rss – double *Output

On exit: the sum of squares of the residuals about the regression.

12: df – double *Output

On exit: the degrees of freedom associated with the residual sum of squares.

13: fail – NagError *Input/Output

The NAG error argument (see Section 3.6 in the Essential Introduction).

6 Error Indicators and Warnings

NE_BAD_PARAM: On entry, argument mean had an illegal value.
NE_INT_ARG_LT: On entry, $n = ⟨value⟩$ .
Constraint: $n \geq 1$
if $mean = Nag_AboutZero$ .
On entry, $n = ⟨value⟩$ .
Constraint: $n \geq 2$
if $mean = Nag_AboutMean$ .
NE_NEG_WEIGHT: On entry, at least one of the weights is negative.
NE_SW_LOW: On entry, the sum of elements of wt must be greater than 1.0 if $mean = Nag_AboutZero$ or greater than 2.0 if $mean = Nag_AboutMean$ .
NE_WT_LOW: On entry, wt must contain at least 1 positive element if $mean = Nag_AboutZero$ or at least 2 positive elements if $mean = Nag_AboutMean$ .
NE_X_OR_Y_IDEN: On entry, all elements of x and/or y are equal.
NE_ZERO_DOF_RESID: On entry, the degrees of freedom for the residual are zero, i.e., the designated number of arguments $=$ the effective number of observations.
NW_RSS_EQ_ZERO: Residual sum of squares is zero, i.e., a perfect fit was obtained.

7 Accuracy

The computations are believed to be stable.

8 Parallelism and Performance

Not applicable.

9 Further Comments

The time taken by the function depends on

n

. The function uses a two-pass algorithm.

10 Example

A program to calculate regression constants,

\hat{a}

and

\hat{b}

, the standard error of the regression constants, the regression coefficient of determination and the degrees of freedom about the regression.

NAG Library Function Documentnag_simple_linear_regression (g02cac)

+− Contents

1 Purpose

2 Specification

3 Description

4 References

5 Arguments

6 Error Indicators and Warnings

7 Accuracy

8 Parallelism and Performance

9 Further Comments

10 Example

10.1 Program Text

10.2 Program Data

10.3 Program Results

NAG Library Function Document

nag_simple_linear_regression (g02cac)