h05aa:: Operations Research (NAG Toolbox)

The solutions are found using a branch and bound method, where each node of the tree is a subset of

Ω

. Assuming that (1) holds then a particular node, defined by subset

S_{i}

, can be trimmed from the tree if

f (S_{i}) < \hat{f} (S_{o n})

where

\hat{f} (S_{o n})

is the

n

th highest score we have observed so far for a subset of size

p

, i.e., our current best guess of the score for the

n

th best subset. In addition, because of (1) we can also drop all nodes defined by any subset

S_{j}

where

S_{j} \subseteq S_{i}

, thus avoiding the need to enumerate the whole tree. Similar short cuts can be taken if (2) holds. A full description of this branch and bound algorithm can be found in Ridout (1988).

Rather than calculate the score at a given node of the tree nag_best_subset_given_size_revcomm (h05aa) utilizes the fast branch and bound algorithm of Somol et al. (2004), and attempts to estimate the score where possible. For each feature,

x_{i}

, two values are stored, a count

c_{i}

and

{\hat{μ}}_{i}

, an estimate of the contribution of that feature. An initial value of zero is used for both

c_{i}

and

{\hat{μ}}_{i}

. At any stage of the algorithm where both

f (S)

and

f (S - \{x_{i}\})

have been calculated (as opposed to estimated), the estimated contribution of the feature

x_{i}

is updated to

\frac{c_{i} {\hat{μ}}_{i} + [f (S) - f (S - \{x_{j}\})]}{c_{i} + 1}

and

c_{i}

is incremented by

1

, therefore at each stage

{\hat{μ}}_{i}

is the mean contribution of

x_{i}

observed so far and

c_{i}

is the number of observations used to calculate that mean.

References

Parameters

Compulsory Input Parameters

Optional Input Parameters

Output Parameters

Error Indicators and Warnings

Cases prefixed with W are classified as warnings and do not generate an error of type NAG:error_n. See nag_issue_warnings.

Accuracy

Further Comments

Example

function h05aa_example


fprintf('h05aa example results\n\n');

% Data required by the scoring function
n      = int64(40);
m      = int64(14);
[x,y]  = gen_data(n,m);

% Initialize parameters and get communication array lengths
irevcm = int64(0);
mincr  = int64(0);
ip     = int64(5);
drop   = int64(0);
lz     = int64(0);
mip    = m - ip;
z      = zeros(mip, 1, 'int64');
la     = int64(0);
nbest  = int64(3);
a      = zeros(max(nbest, m), 1, 'int64');
bscore = zeros(max(nbest, m), 1);
bz     = zeros(mip, nbest, 'int64');
mincnt = int64(-1);
gamma  = -1;
acc    = [0, 0];
icomm = zeros(2, 1, 'int64');
rcomm = zeros(0, 0);


warning('off', 'NAG:warning');
[irevcm, drop, lz, z, la, a, bscore, bz, icomm, rcomm, ifail] = ...
    h05aa(...
          irevcm, mincr, m, ip, drop, lz, z, la, a, ...
          bscore, bz, mincnt, gamma, acc, icomm, rcomm);
warning('on', 'NAG:warning');

% Ignore the warning message - required size of communication arrays now
% stored in icomm
rcomm = zeros(icomm(2), 1);
icomm = zeros(icomm(1), 1, 'int64');

% Initialization  call
cnt = 0;
[irevcm, drop, lz, z, la, a, bscore, bz, icomm, rcomm, ifail] = ...
    h05aa(...
          irevcm, mincr, m, ip, drop, lz, z, la, a, ...
          bscore, bz, mincnt, gamma, acc, icomm, rcomm);

% Reverse communication loop for best subset routine, terminates on irevcm = 0.
while not(irevcm == 0)
  % Calculate and return the score for the required models and keep track
  % of the number of subsets evaluated
  cnt = cnt + max(1,la);
  [bscore] = calc_subset_score(m, drop, lz, z, la, a, x, y, bscore);

  [irevcm, drop, lz, z, la, a, bscore, bz, icomm, rcomm, ifail] = ...
      h05aa(...
            irevcm, mincr, m, ip, drop, lz, z, la, a, ...
            bscore, bz, mincnt, gamma, acc, icomm, rcomm);
end

% Display the best subsets and corresponding scores. 
% h05aa returns a list of features excluded from the best subsets;
% this is inverted to give the set of features included in each subset.
fprintf('\n    Score        Feature Subset\n');
fprintf('    -----        --------------\n');
ibz = 1:m;
for i = 1:la
  mask = ones(1, m, 'int64');
  mask(bz(1:mip, i)) = 0;
  fprintf('%12.5e %5d %5d %5d %5d %5d\n', bscore(i), ibz(logical(mask)));
end

fprintf('\n%d subsets evaluated in total\n', cnt);



function [bscore] = calc_subset_score(m, drop, lz, z, la, a, x, y, bscore)
  % Set up the initial feature set.
  % If drop = 0, this is the Null set (i.e. no features).
  % If drop = 1 then this is the full set (i.e. all features)
  if drop == 0
    isx = zeros(m, 1, 'int64');
  else
    isx = ones(m, 1, 'int64');
  end

  % Add (if drop = 0) or remove (if drop = 1) all the features specified in z
  inv_drop = not(drop);
  isx(z(1:lz)) = inv_drop;

  for i=1:max(la, 1)
    if (la > 0)
      if (i > 1)
        % Reset the feature altered at the last iteration
        isx(a(i-1)) = drop;
      end

      %  Add or drop the i'th feature in a
      isx(a(i)) = inv_drop;
    end

    ip = int64(sum(isx));

    % Fit the regression model
    rss = g02da('z', x, isx, ip, y);

    % Return the score (the residual sums of squares)
    bscore(i) = rss;
  end
  
function [x, y] = gen_data(n,m)
  x = zeros(n,m);
  genid = int64(3);
  subid = int64(1);
  seed(1) = int64(23124124);
  [state, ifail] = g05kf(genid, subid, seed);
  for i = 1:m
    [state, x(1:n,i), ifail] = g05sk( ...
                                      n, 0, sqrt(3), state);
  end
  [state, b, ifail] = g05sk(m, 1.5, 3, state);
  [state, y, ifail] = g05sk(n, 0, 1, state);
  y = x*b + y;

NAG Toolbox: nag_best_subset_given_size_revcomm (h05aa)

▸▿ Contents

Purpose

Syntax

Description