NAG Library Routine Document

Integer, Intent (In)	::	mincr, m, ip, nbest, mincnt, licomm, lrcomm
Integer, Intent (Inout)	::	irevcm, drop, lz, z(m-ip), la, a(max(nbest,m)), bz(m-ip,nbest), icomm(licomm), ifail
Real (Kind=nag_wp), Intent (In)	::	gamma, acc(2)
Real (Kind=nag_wp), Intent (Inout)	::	bscore(max(nbest,m)), rcomm(lrcomm)

C Header Interface

#include <nagmk26.h>

void

h05aaf_ (Integer *irevcm, const Integer *mincr, const Integer *m, const Integer *ip, const Integer *nbest, Integer *drop, Integer *lz, Integer z[], Integer *la, Integer a[], double bscore[], Integer bz[], const Integer *mincnt, const double *gamma, const double acc[], Integer icomm[], const Integer *licomm, double rcomm[], const Integer *lrcomm, Integer *ifail)

3

Description

Given

Ω = \{x_{i} : i \in ℤ, 1 \leq i \leq m\}

, a set of

m

unique features and a scoring mechanism

f (S)

defined for all

S \subseteq Ω

then h05aaf is designed to find

S_{o 1} \subseteq Ω, |S_{o 1}| = p

, an optimal subset of size

p

. Here

|S_{o 1}|

denotes the cardinality of

S_{o 1}

, the number of elements in the set.

The definition of the optimal subset depends on the properties of the scoring mechanism, if

\begin{matrix} f (S_{i}) \leq f (S_{j}), & for all ​ S_{j} \subseteq Ω ​ and ​ S_{i} \subseteq S_{j} \end{matrix}

(1)

then the optimal subset is defined as one of the solutions to

\underset{S \subseteq Ω}{maximize} f (S) subject to |S| = p

else if

\begin{matrix} f (S_{i}) \geq f (S_{j}), & for all ​ S_{j} \subseteq Ω ​ and ​ S_{i} \subseteq S_{j} \end{matrix}

(2)

then the optimal subset is defined as one of the solutions to

\underset{S \subseteq Ω}{minimize} f (S) subject to |S| = p .

If neither of these properties hold then h05aaf cannot be used.

As well as returning the optimal subset,

S_{o 1}

, h05aaf can return the best

n

solutions of size

p

. If

S_{o i}

denotes the

i

th best subset, for

i = 1, 2, \dots, n - 1

, then the

(i + 1)

th best subset is defined as the solution to either

\underset{S \subseteq Ω - \{S_{o j} : j \in ℤ, 1 \leq j \leq i\}}{maximize} f (S) subject to |S| = p

\underset{S \subseteq Ω - \{S_{o j} : j \in ℤ, 1 \leq j \leq i\}}{minimize} f (S) subject to |S| = p

depending on the properties of

f

The solutions are found using a branch and bound method, where each node of the tree is a subset of

Ω

. Assuming that (1) holds then a particular node, defined by subset

S_{i}

, can be trimmed from the tree if

f (S_{i}) < \hat{f} (S_{o n})

where

\hat{f} (S_{o n})

is the

n

th highest score we have observed so far for a subset of size

p

, i.e., our current best guess of the score for the

n

th best subset. In addition, because of (1) we can also drop all nodes defined by any subset

S_{j}

where

S_{j} \subseteq S_{i}

, thus avoiding the need to enumerate the whole tree. Similar short cuts can be taken if (2) holds. A full description of this branch and bound algorithm can be found in Ridout (1988).

Rather than calculate the score at a given node of the tree h05aaf utilizes the fast branch and bound algorithm of Somol et al. (2004), and attempts to estimate the score where possible. For each feature,

x_{i}

, two values are stored, a count

c_{i}

and

{\hat{μ}}_{i}

, an estimate of the contribution of that feature. An initial value of zero is used for both

c_{i}

and

{\hat{μ}}_{i}

. At any stage of the algorithm where both

f (S)

and

f (S - \{x_{i}\})

have been calculated (as opposed to estimated), the estimated contribution of the feature

x_{i}

is updated to

\frac{c_{i} {\hat{μ}}_{i} + [f (S) - f (S - \{x_{j}\})]}{c_{i} + 1}

and

c_{i}

is incremented by

1

, therefore at each stage

{\hat{μ}}_{i}

is the mean contribution of

x_{i}

observed so far and

c_{i}

is the number of observations used to calculate that mean.

As long as

c_{i} \geq k

, for the user-supplied constant

k

, then rather than calculating

f (S - \{x_{i}\})

this routine estimates it using

\hat{f} (S - \{x_{i}\}) = f (S) - γ {\hat{μ}}_{i}

\hat{f} (S) - γ {\hat{μ}}_{i}

f (S)

has been estimated, where

γ

is a user-supplied scaling factor. An estimated score is never used to trim a node or returned as the optimal score.

Setting

k = 0

in this routine will cause the algorithm to always calculate the scores, returning to the branch and bound algorithm of Ridout (1988). In most cases it is preferable to use the fast branch and bound algorithm, by setting

k > 0

, unless the score function is iterative in nature, i.e.,

f (S)

must have been calculated before

f (S - \{x_{i}\})

can be calculated.

4

References

Narendra P M and Fukunaga K (1977) A branch and bound algorithm for feature subset selection IEEE Transactions on Computers 9 917–922

Ridout M S (1988) Algorithm AS 233: An improved branch and bound algorithm for feature subset selection Journal of the Royal Statistics Society, Series C (Applied Statistics) (Volume 37) 1 139–147

Somol P, Pudil P and Kittler J (2004) Fast branch and bound algorithms for optimal feature selection IEEE Transactions on Pattern Analysis and Machine Intelligence (Volume 26) 7 900–912

5

Arguments

Note: this routine uses reverse communication. Its use involves an initial entry, intermediate exits and re-entries, and a final exit, as indicated by the argument irevcm. Between intermediate exits and re-entries, all arguments other than bscore must remain unchanged.

1: $irevcm$ – IntegerInput/Output

On initial entry: must be set to

0

On intermediate exit:

irevcm = 1

and before re-entry the scores associated with la subsets must be calculated and returned in bscore.

The la subsets are constructed as follows:

$drop = 1$: The $j$ th subset is constructed by dropping the features specified in the first lz elements of z and the single feature given in $a (j)$ from the full set of features, $Ω$ . The subset will therefore contain $m - lz - 1$ features.
$drop = 0$: The $j$ th subset is constructed by adding the features specified in the first lz elements of z and the single feature specified in $a (j)$ to the empty set, $\emptyset$ . The subset will therefore contain $lz + 1$ features.

In both cases the individual features are referenced by the integers

1

to m with

1

indicating the first feature,

2

the second, etc., for some arbitrary ordering of the features. The same ordering must be used in all calls to h05aaf.

la = 0

, the score for a single subset should be returned. This subset is constructed by adding or removing only those features specified in the first lz elements of z.

lz = 0

, this subset will either be

Ω

\emptyset

The score associated with the

j

th subset must be returned in

bscore (j)

On intermediate re-entry: irevcm must remain unchanged.

On final exit:

irevcm = 0

, and the algorithm has terminated.

Constraint:

irevcm = 0

1

Note: any values you return to h05aaf as part of the reverse communication procedure should not include floating-point NaN (Not a Number) or infinity values, since these are not handled by h05aaf. If your code does inadvertently return any NaNs or infinities, h05aaf is likely to produce unexpected results.

2: $mincr$ – IntegerInput

On entry: flag indicating whether the scoring function

f

is increasing or decreasing.

$mincr = 1$: $f (S_{i}) \leq f (S_{j})$ , i.e., the subsets with the largest score will be selected.
$mincr = 0$: $f (S_{i}) \geq f (S_{j})$ , i.e., the subsets with the smallest score will be selected.

For all

S_{j} \subseteq Ω

and

S_{i} \subseteq S_{j}

Constraint:

mincr = 0

1

3: $m$ – IntegerInput

On entry:

m

, the number of features in the full feature set.

Constraint:

m \geq 2

4: $ip$ – IntegerInput

On entry:

p

, the number of features in the subset of interest.

Constraint:

1 \leq ip \leq m

5: $nbest$ – IntegerInput

On entry:

n

, the maximum number of best subsets required. The actual number of subsets returned is given by la on final exit. If on final exit

la \neq nbest

then

ifail = 53

is returned.

Constraint:

nbest \geq 1

6: $drop$ – IntegerInput/Output

On initial entry: drop need not be set.

On intermediate exit: flag indicating whether the intermediate subsets should be constructed by dropping features from the full set (

drop = 1

) or adding features to the empty set (

drop = 0

). See irevcm for details.

On intermediate re-entry: drop must remain unchanged.

On final exit: drop is undefined.

7: $lz$ – IntegerInput/Output

On initial entry: lz need not be set.

On intermediate exit: the number of features stored in z.

On intermediate re-entry: lz must remain unchanged.

On final exit: lz is undefined.

8: $z (m - ip)$ – Integer arrayInput/Output

On initial entry: z need not be set.

On intermediate exit:

z (i)

, for

i = 1, 2, \dots, lz

, contains the list of features which, along with those specified in a, define the subsets whose score is required. See irevcm for additional details.

On intermediate re-entry: z must remain unchanged.

On final exit: z is undefined.

9: $la$ – IntegerInput/Output

On initial entry: la need not be set.

On intermediate exit: if

la > 0

, the number of subsets for which a score must be returned.

la = 0

, the score for a single subset should be returned. See irevcm for additional details.

On intermediate re-entry: la must remain unchanged.

On final exit: the number of best subsets returned.

10: $a (\max (nbest, m))$ – Integer arrayInput/Output

On initial entry: a need not be set.

On intermediate exit:

a (j)

, for

j = 1, 2, \dots, la

, contains the list of features which, along with those specified in z, define the subsets whose score is required. See irevcm for additional details.

On intermediate re-entry: a must remain unchanged.

On final exit: a is undefined.

11: $bscore (\max (nbest, m))$ – Real (Kind=nag_wp) arrayInput/Output

On initial entry: bscore need not be set.

On intermediate exit: bscore is undefined.

On intermediate re-entry:

bscore (j)

must hold the score for the

j

th subset as described in irevcm.

On final exit: holds the score for the la best subsets returned in bz.

12: $bz (m - ip, nbest)$ – Integer arrayInput/Output

On initial entry: bz need not be set.

On intermediate exit: bz is used for storage between calls to h05aaf.

On intermediate re-entry: bz must remain unchanged.

On final exit: the

j

th best subset is constructed by dropping the features specified in

bz (i, j)

, for

i = 1, 2, \dots, m - ip

and

j = 1, 2, \dots, la

, from the set of all features,

Ω

. The score for the

j

th best subset is given in

bscore (j)

13: $mincnt$ – IntegerInput

On entry:

k

, the minimum number of times the effect of each feature,

x_{i}

, must have been observed before

f (S - \{x_{i}\})

is estimated from

f (S)

as opposed to being calculated directly.

k = 0

then

f (S - \{x_{i}\})

is never estimated. If

mincnt < 0

then

k

is set to

1

14: $gamma$ – Real (Kind=nag_wp)Input

On entry:

γ

, the scaling factor used when estimating scores. If

gamma < 0

then

γ = 1

is used.

15: $acc (2)$ – Real (Kind=nag_wp) arrayInput

On entry: a measure of the accuracy of the scoring function,

f

Letting

a_{i} = ε_{1} |f (S_{i})| + ε_{2}

, then when confirming whether the scoring function is strictly increasing or decreasing (as described in mincr), or when assessing whether a node defined by subset

S_{i}

can be trimmed, then any values in the range

f (S_{i}) \pm a_{i}

are treated as being numerically equivalent.

0 \leq acc (1) \leq 1

then

ε_{1} = acc (1)

, otherwise

ε_{1} = 0

acc (2) \geq 0

then

ε_{2} = acc (2)

, otherwise

ε_{2} = 0

In most situations setting both

ε_{1}

and

ε_{2}

to zero should be sufficient. Using a nonzero value, when one is not required, can significantly increase the number of subsets that need to be evaluated.

16: $icomm (licomm)$ – Integer arrayCommunication Array

On initial entry: icomm need not be set.

On intermediate exit: icomm is used for storage between calls to h05aaf.

On intermediate re-entry: icomm must remain unchanged.

On final exit: icomm is not defined. The first two elements,

icomm (1)

and

icomm (2)

contain the minimum required value for licomm and lrcomm respectively.

17: $licomm$ – IntegerInput

On entry: the length of the array icomm. If licomm is too small and

licomm \geq 2

then

ifail = 172

is returned and the minimum value for licomm and lrcomm are given by

icomm (1)

and

icomm (2)

respectively.

Constraints:

if $mincnt = 0$ , $licomm \geq 2 \times \max (nbest, m) + m (m + 2) + (m + 1) \times \max (m - ip, 1) + 27$ ;
otherwise $licomm \geq 2 \times \max (nbest, m) + m (m + 3) + (2 m + 1) \times \max (m - ip, 1) + 25$ .

18: $rcomm (lrcomm)$ – Real (Kind=nag_wp) arrayCommunication Array

On initial entry: rcomm need not be set.

On intermediate exit: rcomm is used for storage between calls to h05aaf.

On intermediate re-entry: rcomm must remain unchanged.

On final exit: rcomm is not defined.

19: $lrcomm$ – IntegerInput

On entry: the length of the array rcomm. If lrcomm is too small and

licomm \geq 2

then

ifail = 172

is returned and the minimum value for licomm and lrcomm are given by

icomm (1)

and

icomm (2)

respectively.

Constraints:

if $mincnt = 0$ , $lrcomm \geq 9 + nbest + m \times \max (m - ip, 1)$ ;
otherwise $lrcomm \geq 8 + m + nbest + m \times \max (m - ip, 1)$ .

20: $ifail$ – IntegerInput/Output

On initial entry: ifail must be set to

0

- 1 or 1

. If you are unfamiliar with this argument you should refer to Section 3.4 in How to Use the NAG Library and its Documentation for details.

For environments where it might be inappropriate to halt program execution when an error is detected, the value

- 1 or 1

is recommended. If the output of error messages is undesirable, then the value

1

is recommended. Otherwise, because for this routine the values of the output arguments may be useful even if

ifail \neq 0

on exit, the recommended value is

- 1

. When the value $- 1 or 1$ is used it is essential to test the value of ifail on exit.

On final exit:

ifail = 0

unless the routine detects an error or a warning has been flagged (see Section 6).

6

Error Indicators and Warnings

If on entry

ifail = 0

- 1

, explanatory error messages are output on the current error message unit (as defined by x04aaf).

Errors or warnings detected by the routine:

$ifail = 11$: On entry, $irevcm = 〈value〉$ .
Constraint: $irevcm = 0$ or $1$ .

$ifail = 21$: On entry, $mincr = 〈value〉$ .
Constraint: $mincr = 0$ or $1$ .

$ifail = 22$: mincr has changed between calls.
On intermediate entry, $mincr = 〈value〉$ .
On initial entry, $mincr = 〈value〉$ .

$ifail = 31$: On entry, $m = 〈value〉$ .
Constraint: $m \geq 2$ .

$ifail = 32$: m has changed between calls.
On intermediate entry, $m = 〈value〉$ .
On initial entry, $m = 〈value〉$ .

$ifail = 41$: On entry, $ip = 〈value〉$ and $m = 〈value〉$ .
Constraint: $1 \leq ip \leq m$ .

$ifail = 42$: ip has changed between calls.
On intermediate entry, $ip = 〈value〉$ .
On initial entry, $ip = 〈value〉$ .

$ifail = 51$: On entry, $nbest = 〈value〉$ .
Constraint: $nbest \geq 1$ .

$ifail = 52$: nbest has changed between calls.
On intermediate entry, $nbest = 〈value〉$ .
On initial entry, $nbest = 〈value〉$ .

$ifail = 53$: On entry, $nbest = 〈value〉$ .
But only $〈value〉$ best subsets could be calculated.

$ifail = 61$: drop has changed between calls.
On intermediate entry, $drop = 〈value〉$ .
On initial entry, $drop = 〈value〉$ .

$ifail = 71$: lz has changed between calls.
On entry, $lz = 〈value〉$ .
On previous exit, $lz = 〈value〉$ .

$ifail = 91$: la has changed between calls.
On entry, $la = 〈value〉$ .
On previous exit, $la = 〈value〉$ .

$ifail = 111$: $bscore (〈value〉) = 〈value〉$ , which is inconsistent with the score for the parent node. Score for the parent node is $〈value〉$ .

$ifail = 131$: mincnt has changed between calls.
On intermediate entry, $mincnt = 〈value〉$ .
On initial entry, $mincnt = 〈value〉$ .

$ifail = 141$: gamma has changed between calls.
On intermediate entry, $gamma = 〈value〉$ .
On initial entry, $gamma = 〈value〉$ .

$ifail = 151$: $acc (1)$ has changed between calls.
On intermediate entry, $acc (1) = 〈value〉$ .
On initial entry, $acc (1) = 〈value〉$ .

$ifail = 152$: $acc (2)$ has changed between calls.
On intermediate entry, $acc (2) = 〈value〉$ .
On initial entry, $acc (2) = 〈value〉$ .

$ifail = 161$: icomm has been corrupted between calls.

$ifail = 171$: On entry, $licomm = 〈value〉$ , $lrcomm = 〈value〉$ .
Constraint: $licomm \geq 〈value〉$ , $lrcomm \geq 〈value〉$ .
icomm is too small to return the required array sizes.

$ifail = 172$: On entry, $licomm = 〈value〉$ , $lrcomm = 〈value〉$ .
Constraint: $licomm \geq 〈value〉$ , $lrcomm \geq 〈value〉$ .
The minimum required values for licomm and lrcomm are returned in $icomm (1)$ and $icomm (2)$ respectively.

$ifail = 181$: rcomm has been corrupted between calls.

$ifail = - 99$: An unexpected error has been triggered by this routine. Please contact NAG.
See Section 3.9 in How to Use the NAG Library and its Documentation for further information.

$ifail = - 399$: Your licence key may have expired or may not have been installed correctly.
See Section 3.8 in How to Use the NAG Library and its Documentation for further information.

$ifail = - 999$: Dynamic memory allocation failed.
See Section 3.7 in How to Use the NAG Library and its Documentation for further information.

7

Accuracy

The subsets returned by h05aaf are guaranteed to be optimal up to the accuracy of your calculated scores.

8

Parallelism and Performance

h05aaf is threaded by NAG for parallel execution in multithreaded implementations of the NAG Library.

Please consult the X06 Chapter Introduction for information on how to control and interrogate the OpenMP environment used within this routine. Please also consult the Users' Note for your implementation for any additional implementation-specific information.

9

Further Comments

The maximum number of unique subsets of size

p

from a set of

m

features is

N = \frac{m!}{(m - p)! p!}

. The efficiency of the branch and bound algorithm implemented in h05aaf comes from evaluating subsets at internal nodes of the tree, that is subsets with more than

p

features, and where possible trimming branches of the tree based on the scores at these internal nodes as described in Narendra and Fukunaga (1977). Because of this it is possible, in some circumstances, for more than

N

subsets to be evaluated. This will tend to happen when most of the features have a similar effect on the subset score.

If multiple optimal subsets exist with the same score, and nbest is too small to return them all, then the choice of which of these optimal subsets is returned is arbitrary.

10

Example

This example finds the three linear regression models, with five variables, that have the smallest residual sums of squares when fitted to a supplied dataset. The data used in this example was simulated.

NAG Library Routine Document

h05aaf (best_subset_given_size_revcomm)

▸▿ Contents

1 Purpose

2 Specification

3 Description

4 References

5 Arguments

6 Error Indicators and Warnings

7 Accuracy

8 Parallelism and Performance

9 Further Comments

10 Example

10.1 Program Text

10.2 Program Data

10.3 Program Results

1

Purpose

2

Specification

3

Description

4

References

5

Arguments

6

Error Indicators and Warnings

7

Accuracy

8

Parallelism and Performance

9

Further Comments

10

Example

10.1

Program Text

10.2

Program Data

10.3

Program Results