g10ac:: Smoothing in Statistics (NAG Toolbox)

Cubic smoothing splines arise as the unique real-valued solution function

f

, with absolutely continuous first derivative and squared-integrable second derivative, which minimizes

\sum_{i = 1}^{n} w_{i} {(y_{i} - f (x_{i}))}^{2} + ρ \int_{- \infty}^{\infty} {(f^{''} (x))}^{2} d x,

where

w_{i}

is the (optional) weight for the

i

th observation and

ρ

is the smoothing argument. This criterion consists of two parts: the first measures the fit of the curve and the second the smoothness of the curve. The value of the smoothing argument

ρ

weights these two aspects; larger values of

ρ

give a smoother fitted curve but, in general, a poorer fit. For details of how the cubic spline can be fitted see Hutchinson and de Hoog (1985) and Reinsch (1967).

The fitted values,

\hat{y} = {({\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{n})}^{T}

, and weighted residuals,

r_{i}

, can be written as:

\hat{y} = H y and r_{i} = \sqrt{w_{i}} (y_{i} - {\hat{y}}_{i})

for a matrix

H

. The residual degrees of freedom for the spline is

trace (I - H)

and the diagonal elements of

H

are the leverages.

The parameter

ρ

can be estimated in a number of ways.

(i)

The degrees of freedom for the spline can be specified, i.e., find

ρ

such that

trace (H) = ν_{0}

for given

ν_{0}

(ii)

Minimize the cross-validation (CV), i.e., find

ρ

such that the CV is minimized, where

CV = \frac{1}{\sum_{i = 1}^{n} w_{i}} \sum_{i = 1}^{n} {[\frac{r_{i}}{1 - h_{i i}}]}^{2} .

(iii)

Minimize the generalized cross-validation (GCV), i.e., find

ρ

such that the GCV is minimized, where

GCV = \frac{n^{2}}{\sum_{i = 1}^{n} w_{i}} [\frac{\sum_{i = 1}^{n} r_{i}^{2}}{{(\sum_{i = 1}^{n} (1 - h_{i i}))}^{2}}] .

nag_smooth_fit_spline_parest (g10ac) requires the

x_{i}

to be strictly increasing. If two or more observations have the same

x_{i}

value then they should be replaced by a single observation with

y_{i}

equal to the (weighted) mean of the

y

values and weight,

w_{i}

, equal to the sum of the weights. This operation can be performed by nag_smooth_data_order (g10za)

References

Parameters

Compulsory Input Parameters

Optional Input Parameters

Output Parameters

Error Indicators and Warnings

Cases prefixed with W are classified as warnings and do not generate an error of type NAG:error_n. See nag_issue_warnings.

Accuracy

When minimizing the cross-validation or generalized cross-validation, the error in the estimate of

ρ

should be within

\pm 3 (tol \times rho + tol)

. When finding

ρ

for a fixed number of degrees of freedom the error in the estimate of

ρ

should be within

\pm 2 \times tol \times \max (1, rho)

Given the value of

ρ

, the accuracy of the fitted spline depends on the value of

ρ

and the position of the

x

values. The values of

x_{i} - x_{i - 1}

and

w_{i}

are scaled and

ρ

is transformed to avoid underflow and overflow problems.

Further Comments

When finding the value of

ρ

that gives the required degrees of freedom, the algorithm examines the interval

0.0

to u. For small degrees of freedom the value of

ρ

can be large, as in the theoretical case of two degrees of freedom when the spline reduces to a straight line and

ρ

is infinite. If the CV or GCV is to be minimized then the algorithm searches for the minimum value in the interval

0.0

to u. If the function is decreasing in that range then the boundary value of u will be returned. In either case, the larger the value of u the more likely is the interval to contain the required solution, but the process will be less efficient.

Example

This example uses the data given by Hastie and Tibshirani (1990), which consists of the age,

x_{i}

, and C-peptide concentration (pmol/ml),

y_{i}

, from a study of the factors affecting insulin-dependent diabetes mellitus in children. The data is input, reduced to a strictly ordered set by nag_smooth_data_order (g10za) and a spline with

5

degrees of freedom is fitted by nag_smooth_fit_spline_parest (g10ac). The fitted values and residuals are printed.

function g10ac_example


fprintf('g10ac example results\n\n');

x =  [ 5.2  8.8 10.5 10.6 10.4  1.8 12.7 15.6  5.8  1.9 ...
       2.2  4.8  7.9  5.2  0.9 11.8  7.9 11.5 10.6  8.5 ...
      11.1 12.8 11.3  1.0 14.5 11.9  8.1 13.8 15.5  9.8 ...
      11.0 12.4 11.1  5.1  4.8  4.2  6.9 13.2  9.9 12.5 ...
      13.2  8.9 10.8];
y =  [ 4.8  4.1  5.2  5.5  5.0  3.4  3.4  4.9  5.6  3.7 ...
       3.9  4.5  4.8  4.9  3.0  4.6  4.8  5.5  4.5  5.3 ...
       4.7  6.6  5.1  3.9  5.7  5.1  5.2  3.7  4.9  4.8 ...
       4.4  5.2  5.1  4.6  3.9  5.1  5.1  6.0  4.9  4.1 ...
       4.6  4.9  5.1];

% Reorder x, remove ties and weight accordingly
[n, x, y, wt, rss, ifail] = g10za( ...
                                   x, y);
x = x(1:n);
y = y(1:n);

% Control parameters
crit = 12;

% fit cubic spline
method = 'D';
[yhat, c, rss, df, res, h, crit, rho, ifail] = ...
  g10ac( ...
         method, x, y, crit, 'wt', wt);

%  Display results
fprintf('Residual sum of squares     = %10.2f\n', rss);
fprintf('Degrees of freedom          = %10.2f\n', df);
fprintf('rho                         = %10.2f\n', rho);
fprintf('\n     Input data                Output results\n');
fprintf('   i     x       y            yhat      h\n');
ivar = double(1:n)';
fprintf('%4d%8.3f%8.3f%14.3f%8.3f\n', [ivar x y yhat h]');

g10ac example results

Residual sum of squares     =      10.35
Degrees of freedom          =      25.00
rho                         =       2.68

     Input data                Output results
   i     x       y            yhat      h
   1   0.900   3.000         3.373   0.534
   2   1.000   3.900         3.406   0.427
   3   1.800   3.400         3.642   0.313
   4   1.900   3.700         3.686   0.313
   5   2.200   3.900         3.839   0.448
   6   4.200   5.100         4.614   0.564
   7   4.800   4.200         4.576   0.442
   8   5.100   4.600         4.715   0.189
   9   5.200   4.850         4.783   0.407
  10   5.800   5.600         5.193   0.455
  11   6.900   5.100         5.184   0.592
  12   7.900   4.800         4.958   0.530
  13   8.100   5.200         4.931   0.235
  14   8.500   5.300         4.845   0.245
  15   8.800   4.100         4.763   0.271
  16   8.900   4.900         4.748   0.292
  17   9.800   4.800         4.850   0.301
  18   9.900   4.900         4.875   0.277
  19  10.400   5.000         4.970   0.173
  20  10.500   5.200         4.977   0.154
  21  10.600   5.000         4.979   0.285
  22  10.800   5.100         4.970   0.136
  23  11.000   4.400         4.961   0.137
  24  11.100   4.900         4.964   0.284
  25  11.300   5.100         4.975   0.162
  26  11.500   5.500         4.975   0.186
  27  11.800   4.600         4.930   0.213
  28  11.900   5.100         4.911   0.220
  29  12.400   5.200         4.852   0.206
  30  12.500   4.100         4.857   0.196
  31  12.700   3.400         4.900   0.189
  32  12.800   6.600         4.932   0.193
  33  13.200   5.300         4.955   0.488
  34  13.800   3.700         4.797   0.408
  35  14.500   5.700         5.076   0.559
  36  15.500   4.900         4.979   0.445
  37  15.600   4.900         4.946   0.535

NAG Toolbox: nag_smooth_fit_spline_parest (g10ac)

▸▿ Contents

Purpose

Syntax

Description