Integer, Intent (In)	::	n, ldc, maxcal
Integer, Intent (Inout)	::	ifail
Real (Kind=nag_wp), Intent (In)	::	x(n), y(n), wt(*), u, tol
Real (Kind=nag_wp), Intent (Inout)	::	c(ldc,3), crit
Real (Kind=nag_wp), Intent (Out)	::	yhat(n), rss, df, res(n), h(n), rho, wk(7*(n+2))
Character (1), Intent (In)	::	method, weight

C Header Interface

#include <nag.h>

void

g10acf_ (const char *method, const char *weight, const Integer *n, const double x[], const double y[], const double wt[], double yhat[], double c[], const Integer *ldc, double *rss, double *df, double res[], double h[], double *crit, double *rho, const double *u, const double *tol, const Integer *maxcal, double wk[], Integer *ifail, const Charlen length_method, const Charlen length_weight)

The routine may be called by the names g10acf or nagf_smooth_fit_spline_parest.

3 Description

For a set of

n

observations

(x_{i}, y_{i})

, for

i = 1, 2, \dots, n

, the spline provides a flexible smooth function for situations in which a simple polynomial or nonlinear regression model is not suitable.

Cubic smoothing splines arise as the unique real-valued solution function

f

, with absolutely continuous first derivative and squared-integrable second derivative, which minimizes

\sum_{i = 1}^{n} w_{i} {(y_{i} - f (x_{i}))}^{2} + ρ \int_{- \infty}^{\infty} {(f^{''} (x))}^{2} d x,

where

w_{i}

is the (optional) weight for the

i

th observation and

ρ

is the smoothing parameter. This criterion consists of two parts: the first measures the fit of the curve and the second the smoothness of the curve. The value of the smoothing parameter

ρ

weights these two aspects; larger values of

ρ

give a smoother fitted curve but, in general, a poorer fit. For details of how the cubic spline can be fitted see Hutchinson and de Hoog (1985) and Reinsch (1967).

The fitted values,

\hat{y} = {({\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{n})}^{T}

, and weighted residuals,

r_{i}

, can be written as:

\hat{y} = H y and r_{i} = \sqrt{w_{i}} (y_{i} - {\hat{y}}_{i})

for a matrix

H

. The residual degrees of freedom for the spline is

trace (I - H)

and the diagonal elements of

H

are the leverages.

The parameter

ρ

can be estimated in a number of ways.

(i)The degrees of freedom for the spline can be specified, i.e., find $ρ$ such that $trace (H) = ν_{0}$ for given $ν_{0}$ .
(ii)Minimize the cross-validation (CV), i.e., find $ρ$ such that the CV is minimized, where

$CV = \frac{1}{\sum_{i = 1}^{n} w_{i}} \sum_{i = 1}^{n} {[\frac{r_{i}}{1 - h_{i i}}]}^{2} .$
(iii)Minimize the generalized cross-validation (GCV), i.e., find $ρ$ such that the GCV is minimized, where

$GCV = \frac{n^{2}}{\sum_{i = 1}^{n} w_{i}} [\frac{\sum_{i = 1}^{n} r_{i}^{2}}{{(\sum_{i = 1}^{n} (1 - h_{i i}))}^{2}}] .$

g10acf requires the

x_{i}

to be strictly increasing. If two or more observations have the same

x_{i}

value then they should be replaced by a single observation with

y_{i}

equal to the (weighted) mean of the

y

values and weight,

w_{i}

, equal to the sum of the weights. This operation can be performed by g10zaf.

The algorithm is based on Hutchinson (1986). c05azf is used to solve for

ρ

given

ν_{0}

and the method of e04abf/e04aba is used to minimize the GCV or CV.

4 References

Hastie T J and Tibshirani R J (1990) Generalized Additive Models Chapman and Hall

Hutchinson M F (1986) Algorithm 642: A fast procedure for calculating minimum cross-validation cubic smoothing splines ACM Trans. Math. Software 12 150–153

Hutchinson M F and de Hoog F R (1985) Smoothing noisy data with spline functions Numer. Math. 47 99–106

Reinsch C H (1967) Smoothing by spline functions Numer. Math. 10 177–183

5 Arguments

1: $method$ – Character(1) Input

On entry: indicates whether the smoothing parameter is to be found by minimization of the CV or GCV functions, or by finding the smoothing parameter corresponding to a specified degrees of freedom value.

$method ='C'$: Cross-validation is used.
$method ='D'$: The degrees of freedom are specified.
$method ='G'$: Generalized cross-validation is used.

Constraint:

method ='C'

'D'

'G'

2: $weight$ – Character(1) Input

On entry: indicates whether user-defined weights are to be used.

$weight ='W'$: User-defined weights should be supplied in wt.
$weight ='U'$: The data is treated as unweighted.

Constraint:

weight ='W'

'U'

3: $n$ – Integer Input

On entry:

n

, the number of observations.

Constraint:

n \geq 3

4: $x (n)$ – Real (Kind=nag_wp) array Input

On entry: the distinct and ordered values

x_{i}

, for

i = 1, 2, \dots, n

Constraint:

x (i) < x (i + 1)

, for

i = 1, 2, \dots, n - 1

5: $y (n)$ – Real (Kind=nag_wp) array Input

On entry: the values

y_{i}

, for

i = 1, 2, \dots, n

6: $wt (*)$ – Real (Kind=nag_wp) array Input

Note: the dimension of the array wt must be at least

n

weight ='W'

On entry: if

weight ='W'

, wt must contain the

n

weights. Otherwise wt is not referenced and unit weights are assumed.

Constraint: if

weight ='W'

wt (i) > 0.0

, for

i = 1, 2, \dots, n

7: $yhat (n)$ – Real (Kind=nag_wp) array Output

On exit: the fitted values,

{\hat{y}}_{i}

, for

i = 1, 2, \dots, n

8: $c (ldc, 3)$ – Real (Kind=nag_wp) array Output

On exit: the spline coefficients. More precisely, the value of the spline approximation at

t

is given by

((c (i, 3) \times d + c (i, 2)) \times d + c (i, 1)) \times d + {\hat{y}}_{i}

, where

x_{i} \leq t < x_{i + 1}

and

d = t - x_{i}

9: $ldc$ – Integer Input

On entry: the first dimension of the array c as declared in the (sub)program from which g10acf is called.

Constraint:

ldc \geq n - 1

10: $rss$ – Real (Kind=nag_wp) Output

On exit: the (weighted) residual sum of squares.

11: $df$ – Real (Kind=nag_wp) Output

On exit: the residual degrees of freedom. If

method ='D'

this will be

n - crit

to the required accuracy.

12: $res (n)$ – Real (Kind=nag_wp) array Output

On exit: the (weighted) residuals,

r_{i}

, for

i = 1, 2, \dots, n

13: $h (n)$ – Real (Kind=nag_wp) array Output

On exit: the leverages,

h_{i i}

, for

i = 1, 2, \dots, n

14: $crit$ – Real (Kind=nag_wp) Input/Output

On entry: if

method ='D'

, the required degrees of freedom for the spline.

method ='C'

'G'

, crit need not be set.

Constraint:

2.0 < crit \leq n

On exit: if

method ='C'

, the value of the cross-validation, or if

method ='G'

, the value of the generalized cross-validation function, evaluated at the value of

ρ

returned in rho.

15: $rho$ – Real (Kind=nag_wp) Output

On exit: the smoothing parameter,

ρ

16: $u$ – Real (Kind=nag_wp) Input

On entry: the upper bound on the smoothing parameter. If

u \leq tol

u = 1000.0

will be used instead. See Section 9 for details on how this argument is used.

17: $tol$ – Real (Kind=nag_wp) Input

On entry: the accuracy to which the smoothing parameter rho is required. tol should preferably be not much less than

\sqrt{ε}

, where

ε

is the machine precision. If

tol < ε

tol = \sqrt{ε}

will be used instead.

18: $maxcal$ – Integer Input

On entry: the maximum number of spline evaluations to be used in finding the value of

ρ

. If

maxcal < 3

maxcal = 100

will be used instead.

19: $wk (7 \times (n + 2))$ – Real (Kind=nag_wp) array Workspace

20: $ifail$ – Integer Input/Output

On entry: ifail must be set to

0

−1

1

to set behaviour on detection of an error; these values have no effect when no error is detected.

A value of

0

causes the printing of an error message and program execution will be halted; otherwise program execution continues. A value of

−1

means that an error message is printed while a value of

1

means that it is not.

If halting is not appropriate, the value

−1

1

is recommended. If message printing is undesirable, then the value

1

is recommended. Otherwise, the value

0

is recommended. When the value $- 1$ or $1$ is used it is essential to test the value of ifail on exit.

On exit:

ifail = 0

unless the routine detects an error or a warning has been flagged (see Section 6).

6 Error Indicators and Warnings

If on entry

ifail = 0

−1

, explanatory error messages are output on the current error message unit (as defined by x04aaf).

Errors or warnings detected by the routine:

$ifail = 1$: On entry, $crit = ⟨ value ⟩$ .
Constraint: if $method ='D'$ , $crit > 2.0$ .

On entry, $crit = ⟨ value ⟩$ .
Constraint: if $method ='D'$ , $crit \leq n$ .

On entry, $ldc = ⟨ value ⟩$ and $n = ⟨ value ⟩$ .
Constraint: $ldc \geq n - 1$ .

On entry, method is not valid: $method = ⟨ value ⟩$ .

On entry, $n = ⟨ value ⟩$ .
Constraint: $n \geq 3$ .

On entry, $weight = ⟨ value ⟩$ .
Constraint: $weight ='W'$ or $'U'$ .

$ifail = 2$: On entry, at least one element of $wt \leq 0.0$ .

$ifail = 3$: On entry, x is not a strictly ordered array.

$ifail = 4$: For the specified degrees of freedom, $rho > u$ : $u = ⟨ value ⟩$ .

$ifail = 5$: Accuracy of tol cannot be achieved: $tol = ⟨ value ⟩$ .

$ifail = 6$: maxcal iterations have been performed.

$ifail = 7$: Optimum value of rho lies above u: $u = ⟨ value ⟩$ .

$ifail = - 99$: An unexpected error has been triggered by this routine. Please contact NAG.
See Section 7 in the Introduction to the NAG Library FL Interface for further information.

$ifail = - 399$: Your licence key may have expired or may not have been installed correctly.
See Section 8 in the Introduction to the NAG Library FL Interface for further information.

$ifail = - 999$: Dynamic memory allocation failed.
See Section 9 in the Introduction to the NAG Library FL Interface for further information.

7 Accuracy

When minimizing the cross-validation or generalized cross-validation, the error in the estimate of

ρ

should be within

\pm 3 (tol \times rho + tol)

. When finding

ρ

for a fixed number of degrees of freedom the error in the estimate of

ρ

should be within

\pm 2 \times tol \times \max (1, rho)

Given the value of

ρ

, the accuracy of the fitted spline depends on the value of

ρ

and the position of the

x

values. The values of

x_{i} - x_{i - 1}

and

w_{i}

are scaled and

ρ

is transformed to avoid underflow and overflow problems.

8 Parallelism and Performance

Background information to multithreading can be found in the Multithreading documentation.

g10acf is not threaded in any implementation.

9 Further Comments

The time to fit the spline for a given value of

ρ

is of order

n

When finding the value of

ρ

that gives the required degrees of freedom, the algorithm examines the interval

0.0

to u. For small degrees of freedom the value of

ρ

can be large, as in the theoretical case of two degrees of freedom when the spline reduces to a straight line and

ρ

is infinite. If the CV or GCV is to be minimized then the algorithm searches for the minimum value in the interval

0.0

to u. If the function is decreasing in that range then the boundary value of u will be returned. In either case, the larger the value of u the more likely is the interval to contain the required solution, but the process will be less efficient.

In extreme cases, the conditioning of the sub-problems being solved can be adversely affected by small relative weights, small relative differences

x_{i} - x_{i - 1}

, or a combination of the two. The resultant low accuracy in the solutions to sub-problems can lead to spurious results and error exits. For example, for small degrees of freedom the correct value of

ρ

can be extremely large, but an inaccurate smaller value may be returned for moderate values of u and an error return of

ifail = 4

for larger values of u. In such cases it is advised to try using e02bef for a fixed large smoothing value, or attempt to remove data with very small relative weights to improve the conditioning of the problem.

Regression splines with a small

(< n)

number of knots can be fitted by e02baf and e02bef.

10 Example

This example uses the data given by Hastie and Tibshirani (1990), which consists of the age,

x_{i}

, and C-peptide concentration (pmol/ml),

y_{i}

, from a study of the factors affecting insulin-dependent diabetes mellitus in children. The data is input, reduced to a strictly ordered set by g10zaf and a spline with

5

degrees of freedom is fitted by g10acf. The fitted values and residuals are printed.

g10ac: FL CL CPP AD PY MB

NAG FL Interfaceg10acf (fit_​spline_​parest)

▸▿ Contents

1 Purpose

2 Specification

3 Description

4 References

5 Arguments

6 Error Indicators and Warnings

7 Accuracy

8 Parallelism and Performance

9 Further Comments

10 Example

10.1 Program Text

10.2 Program Data

10.3 Program Results

NAG FL Interface
g10acf (fit_spline_parest)