NAG FL Interface
h05abf (best_subset_given_size)
1
Purpose
Given a set of features and a scoring mechanism for any subset of those features, h05abf selects the best subsets of size using a direct communication branch and bound algorithm.
2
Specification
Fortran Interface
Subroutine h05abf ( |
mincr, m, ip, nbest, la, bscore, bz, f, mincnt, gamma, acc, iuser, ruser, ifail) |
Integer, Intent (In) |
:: |
mincr, m, ip, nbest, mincnt |
Integer, Intent (Inout) |
:: |
iuser(*), ifail |
Integer, Intent (Out) |
:: |
la, bz(m-ip,nbest) |
Real (Kind=nag_wp), Intent (In) |
:: |
gamma, acc(2) |
Real (Kind=nag_wp), Intent (Inout) |
:: |
ruser(*) |
Real (Kind=nag_wp), Intent (Out) |
:: |
bscore(nbest) |
External |
:: |
f |
|
C Header Interface
#include <nag.h>
void |
h05abf_ (const Integer *mincr, const Integer *m, const Integer *ip, const Integer *nbest, Integer *la, double bscore[], Integer bz[], void (NAG_CALL *f)(const Integer *m, const Integer *drop, const Integer *lz, const Integer z[], const Integer *la, const Integer a[], double score[], Integer iuser[], double ruser[], Integer *info), const Integer *mincnt, const double *gamma, const double acc[], Integer iuser[], double ruser[], Integer *ifail) |
|
C++ Header Interface
#include <nag.h> extern "C" {
void |
h05abf_ (const Integer &mincr, const Integer &m, const Integer &ip, const Integer &nbest, Integer &la, double bscore[], Integer bz[], void (NAG_CALL *f)(const Integer &m, const Integer &drop, const Integer &lz, const Integer z[], const Integer &la, const Integer a[], double score[], Integer iuser[], double ruser[], Integer &info), const Integer &mincnt, const double &gamma, const double acc[], Integer iuser[], double ruser[], Integer &ifail) |
}
|
The routine may be called by the names h05abf or nagf_mip_best_subset_given_size.
3
Description
Given , a set of unique features and a scoring mechanism defined for all then h05abf is designed to find , an optimal subset of size . Here denotes the cardinality of , the number of elements in the set.
The definition of the optimal subset depends on the properties of the scoring mechanism, if
then the optimal subset is defined as one of the solutions to
else if
then the optimal subset is defined as one of the solutions to
If neither of these properties hold then h05abf cannot be used.
As well as returning the optimal subset,
,
h05abf can return the best
solutions of size
. If
denotes the
th best subset, for
, then the
th best subset is defined as the solution to either
or
depending on the properties of
.
The solutions are found using a branch and bound method, where each node of the tree is a subset of
. Assuming that
(1) holds then a particular node, defined by subset
, can be trimmed from the tree if
where
is the
th highest score we have observed so far for a subset of size
, i.e., our current best guess of the score for the
th best subset. In addition, because of
(1) we can also drop all nodes defined by any subset
where
, thus avoiding the need to enumerate the whole tree. Similar short cuts can be taken if
(2) holds. A full description of this branch and bound algorithm can be found in
Ridout (1988).
Rather than calculate the score at a given node of the tree
h05abf utilizes the fast branch and bound algorithm of
Somol et al. (2004), and attempts to estimate the score where possible. For each feature,
, two values are stored, a count
and
, an estimate of the contribution of that feature. An initial value of zero is used for both
and
. At any stage of the algorithm where both
and
have been calculated (as opposed to estimated), the estimated contribution of the feature
is updated to
and
is incremented by
, therefore at each stage
is the mean contribution of
observed so far and
is the number of observations used to calculate that mean.
As long as , for the user-supplied constant , then rather than calculating this routine estimates it using or if has been estimated, where is a user-supplied scaling factor. An estimated score is never used to trim a node or returned as the optimal score.
Setting
in this routine will cause the algorithm to always calculate the scores, returning to the branch and bound algorithm of
Ridout (1988). In most cases it is preferable to use the fast branch and bound algorithm, by setting
, unless the score function is iterative in nature, i.e.,
must have been calculated before
can be calculated.
h05abf is a direct communication version of
h05aaf.
4
References
Narendra P M and Fukunaga K (1977) A branch and bound algorithm for feature subset selection IEEE Transactions on Computers 9 917–922
Ridout M S (1988) Algorithm AS 233: An improved branch and bound algorithm for feature subset selection Journal of the Royal Statistics Society, Series C (Applied Statistics) (Volume 37) 1 139–147
Somol P, Pudil P and Kittler J (2004) Fast branch and bound algorithms for optimal feature selection IEEE Transactions on Pattern Analysis and Machine Intelligence (Volume 26) 7 900–912
5
Arguments
-
1:
– Integer
Input
-
On entry: flag indicating whether the scoring function
is increasing or decreasing.
- , i.e., the subsets with the largest score will be selected.
- , i.e., the subsets with the smallest score will be selected.
For all
and
.
Constraint:
or .
-
2:
– Integer
Input
-
On entry: , the number of features in the full feature set.
Constraint:
.
-
3:
– Integer
Input
-
On entry: , the number of features in the subset of interest.
Constraint:
.
-
4:
– Integer
Input
-
On entry:
, the maximum number of best subsets required. The actual number of subsets returned is given by
la on final exit. If on final exit
then
is returned.
Constraint:
.
-
5:
– Integer
Output
-
On exit: the number of best subsets returned.
-
6:
– Real (Kind=nag_wp) array
Output
-
On exit: holds the score for the
la best subsets returned in
bz.
-
7:
– Integer array
Output
-
On exit: the th best subset is constructed by dropping the features specified in
, for and , from the set of all features, . The score for the th best subset is given in .
-
8:
– Subroutine, supplied by the user.
External Procedure
-
f must evaluate the scoring function
.
The specification of
f is:
Fortran Interface
Integer, Intent (In) |
:: |
m, drop, lz, z(lz), la, a(la) |
Integer, Intent (Inout) |
:: |
iuser(*), info |
Real (Kind=nag_wp), Intent (Inout) |
:: |
ruser(*) |
Real (Kind=nag_wp), Intent (Out) |
:: |
score(max(la,1)) |
|
C Header Interface
void |
f_ (const Integer *m, const Integer *drop, const Integer *lz, const Integer z[], const Integer *la, const Integer a[], double score[], Integer iuser[], double ruser[], Integer *info) |
|
C++ Header Interface
#include <nag.h> extern "C" {
void |
f_ (const Integer &m, const Integer &drop, const Integer &lz, const Integer z[], const Integer &la, const Integer a[], double score[], Integer iuser[], double ruser[], Integer &info) |
}
|
-
1:
– Integer
Input
-
On entry: , the number of features in the full feature set.
-
2:
– Integer
Input
-
On entry: flag indicating whether the intermediate subsets should be constructed by dropping features from the full set (
) or adding features to the empty set (
). See
score for additional details.
-
3:
– Integer
Input
-
On entry: the number of features stored in
z.
-
4:
– Integer array
Input
-
On entry:
, for
, contains the list of features which, along with those specified in
a, define the subsets whose score is required. See
score for additional details.
-
5:
– Integer
Input
-
On entry: if
, the number of subsets for which a score must be returned.
If
, the score for a single subset should be returned. See
score for additional details.
-
6:
– Integer array
Input
-
On entry:
, for
, contains the list of features which, along with those specified in
z, define the subsets whose score is required. See
score for additional details.
-
7:
– Real (Kind=nag_wp) array
Output
-
On exit: the value
, for
, the score associated with the
th subset.
is constructed as follows:
- is constructed by dropping the features specified in the first lz elements of z and the single feature given in from the full set of features, . The subset will therefore contain features.
- is constructed by adding the features specified in the first lz elements of z and the single feature specified in to the empty set, . The subset will therefore contain features.
In both cases the individual features are referenced by the integers
to
m with
indicating the first feature,
the second, etc., for some arbitrary ordering of the features, chosen by you prior to calling
h05abf. For example,
might refer to the first variable in a particular set of data,
the second, etc..
If
, the score for a single subset should be returned. This subset is constructed by adding or removing only those features specified in the first
lz elements of
z. If
, this subset will either be
or
.
-
8:
– Integer array
User Workspace
-
9:
– Real (Kind=nag_wp) array
User Workspace
-
f is called with the arguments
iuser and
ruser as supplied to
h05abf. You should use the arrays
iuser and
ruser to supply information to
f.
-
10:
– Integer
Input/Output
-
On entry: .
On exit: set
info to a nonzero value if you wish
h05abf to terminate with
.
f must either be a module subprogram USEd by, or declared as EXTERNAL in, the (sub)program from which
h05abf is called. Arguments denoted as
Input must
not be changed by this procedure.
Note: f should not return floating-point NaN (Not a Number) or infinity values, since these are not handled by
h05abf. If your code inadvertently
does return any NaNs or infinities,
h05abf is likely to produce unexpected results.
-
9:
– Integer
Input
-
On entry:
, the minimum number of times the effect of each feature,
, must have been observed before
is estimated from
as opposed to being calculated directly.
If then is never estimated. If then is set to .
-
10:
– Real (Kind=nag_wp)
Input
-
On entry: , the scaling factor used when estimating scores. If then is used.
-
11:
– Real (Kind=nag_wp) array
Input
-
On entry: a measure of the accuracy of the scoring function,
.
Letting
, then when confirming whether the scoring function is strictly increasing or decreasing (as described in
mincr), or when assessing whether a node defined by subset
can be trimmed, then any values in the range
are treated as being numerically equivalent.
If then , otherwise .
If then , otherwise .
In most situations setting both and to zero should be sufficient. Using a nonzero value, when one is not required, can significantly increase the number of subsets that need to be evaluated.
-
12:
– Integer array
User Workspace
-
13:
– Real (Kind=nag_wp) array
User Workspace
-
iuser and
ruser are not used by
h05abf, but are passed directly to
f and may be used to pass information to this routine.
-
14:
– Integer
Input/Output
-
On entry:
ifail must be set to
,
or
to set behaviour on detection of an error; these values have no effect when no error is detected.
A value of causes the printing of an error message and program execution will be halted; otherwise program execution continues. A value of means that an error message is printed while a value of means that it is not.
If halting is not appropriate, the value
or
is recommended. If message printing is undesirable, then the value
is recommended. Otherwise, the value
is recommended.
When the value or is used it is essential to test the value of ifail on exit.
On exit:
unless the routine detects an error or a warning has been flagged (see
Section 6).
6
Error Indicators and Warnings
If on entry
or
, explanatory error messages are output on the current error message unit (as defined by
x04aaf).
Errors or warnings detected by the routine:
-
On entry, .
Constraint: or .
-
On entry, .
Constraint: .
-
On entry, and .
Constraint: .
-
On entry, .
Constraint: .
-
On entry, .
But only best subsets could be calculated.
-
On exit from
f,
, which is inconsistent with the score for the parent node. Score for the parent node is
.
-
A nonzero value for
info has been returned:
.
An unexpected error has been triggered by this routine. Please
contact
NAG.
See
Section 7 in the Introduction to the NAG Library FL Interface for further information.
Your licence key may have expired or may not have been installed correctly.
See
Section 8 in the Introduction to the NAG Library FL Interface for further information.
Dynamic memory allocation failed.
See
Section 9 in the Introduction to the NAG Library FL Interface for further information.
7
Accuracy
The subsets returned by h05abf are guaranteed to be optimal up to the accuracy of the calculated scores.
8
Parallelism and Performance
h05abf is threaded by NAG for parallel execution in multithreaded implementations of the NAG Library.
Please consult the
X06 Chapter Introduction for information on how to control and interrogate the OpenMP environment used within this routine. Please also consult the
Users' Note for your implementation for any additional implementation-specific information.
The maximum number of unique subsets of size
from a set of
features is
. The efficiency of the branch and bound algorithm implemented in
h05abf comes from evaluating subsets at internal nodes of the tree, that is subsets with more than
features, and where possible trimming branches of the tree based on the scores at these internal nodes as described in
Narendra and Fukunaga (1977). Because of this it is possible, in some circumstances, for more than
subsets to be evaluated. This will tend to happen when most of the features have a similar effect on the subset score.
If multiple optimal subsets exist with the same score, and
nbest is too small to return them all, then the choice of which of these optimal subsets is returned is arbitrary.
10
Example
This example finds the three linear regression models, with five variables, that have the smallest residual sums of squares when fitted to a supplied dataset. The data used in this example was simulated.
10.1
Program Text
10.2
Program Data
10.3
Program Results