Given a set of features and a scoring mechanism for any subset of those features, h05aac selects the best subsets of size using a reverse communication branch and bound algorithm.
The function may be called by the names: h05aac, nag_mip_best_subset_given_size_revcomm or nag_best_subset_given_size_revcomm.
3Description
Given , a set of unique features and a scoring mechanism defined for all then h05aac is designed to find , an optimal subset of size . Here denotes the cardinality of , the number of elements in the set.
The definition of the optimal subset depends on the properties of the scoring mechanism, if
(1)
then the optimal subset is defined as one of the solutions to
else if
(2)
then the optimal subset is defined as one of the solutions to
If neither of these properties hold then h05aac cannot be used.
As well as returning the optimal subset, , h05aac can return the best solutions of size . If
denotes the th best subset, for , then the th best subset is defined as the solution to either
or
depending on the properties of .
The solutions are found using a branch and bound method, where each node of the tree is a subset of . Assuming that (1) holds then a particular node, defined by subset , can be trimmed from the tree if
where
is the th highest score we have observed so far for a subset of size , i.e., our current best guess of the score for the th best subset. In addition, because of (1) we can also drop all nodes defined by any subset where , thus avoiding the need to enumerate the whole tree. Similar short cuts can be taken if (2) holds. A full description of this branch and bound algorithm can be found in Ridout (1988).
Rather than calculate the score at a given node of the tree h05aac utilizes the fast branch and bound algorithm of Somol et al. (2004), and attempts to estimate the score where possible. For each feature, , two values are stored, a count and , an estimate of the contribution of that feature. An initial value of zero is used for both and . At any stage of the algorithm where both
and
have been calculated (as opposed to estimated), the estimated contribution of the feature is updated to
and is incremented by , therefore, at each stage is the mean contribution of observed so far and is the number of observations used to calculate that mean.
As long as , for the user-supplied constant , then rather than calculating this function estimates it using or if has been estimated, where is a user-supplied scaling factor. An estimated score is never used to trim a node or returned as the optimal score.
Setting in this function will cause the algorithm to always calculate the scores, returning to the branch and bound algorithm of Ridout (1988). In most cases it is preferable to use the fast branch and bound algorithm, by setting , unless the score function is iterative in nature, i.e., must have been calculated before can be calculated.
4References
Narendra P M and Fukunaga K (1977) A branch and bound algorithm for feature subset selection IEEE Transactions on Computers9 917–922
Ridout M S (1988) Algorithm AS 233: An improved branch and bound algorithm for feature subset selection Journal of the Royal Statistics Society, Series C (Applied Statistics) (Volume 37)1 139–147
Somol P, Pudil P and Kittler J (2004) Fast branch and bound algorithms for optimal feature selection IEEE Transactions on Pattern Analysis and Machine Intelligence (Volume 26)7 900–912
5Arguments
Note: this function uses reverse communication. Its use involves an initial entry, intermediate exits and re-entries, and a final exit, as indicated by the argument irevcm. Between intermediate exits and re-entries, all arguments other thanbscore must remain unchanged.
1: – Integer *Input/Output
On initial entry: must be set to .
On intermediate exit:
and before re-entry the scores associated with la subsets must be calculated and returned in bscore.
The th subset is constructed by dropping the features specified in the first lz elements of z and the single feature given in from the full set of features, . The subset will, therefore, contain features.
The th subset is constructed by adding the features specified in the first lz elements of z and the single feature specified in to the empty set, . The subset will, therefore, contain features.
In both cases the individual features are referenced by the integers to m with indicating the first feature, the second, etc., for some arbitrary ordering of the features. The same ordering must be used in all calls to h05aac.
If , the score for a single subset should be returned. This subset is constructed by adding or removing only those features specified in the first lz elements of z.
If , this subset will either be or .
The score associated with the th subset must be returned in .
On intermediate re-entry: irevcm must remain unchanged.
On final exit: , and the algorithm has terminated.
Constraint:
or .
Note: any values you return to h05aac as part of the reverse communication procedure should not include floating-point NaN (Not a Number) or infinity values, since these are not handled by h05aac. If your code inadvertently does return any NaNs or infinities, h05aac is likely to produce unexpected results.
2: – IntegerInput
On entry: flag indicating whether the scoring function is increasing or decreasing.
, i.e., the subsets with the largest score will be selected.
, i.e., the subsets with the smallest score will be selected.
For all and .
Constraint:
or .
3: – IntegerInput
On entry: , the number of features in the full feature set.
Constraint:
.
4: – IntegerInput
On entry: , the number of features in the subset of interest.
Constraint:
.
5: – IntegerInput
On entry: , the maximum number of best subsets required. The actual number of subsets returned is given by la on final exit. If on final exit then NE_TOO_MANY is returned.
On intermediate exit:
flag indicating whether the intermediate subsets should be constructed by dropping features from the full set () or adding features to the empty set (). See irevcm for details.
On intermediate re-entry: drop must remain unchanged.
On intermediate exit:
, for , contains the list of features which, along with those specified in a, define the subsets whose score is required. See irevcm for additional details.
On intermediate re-entry: z must remain unchanged.
On intermediate exit:
, for , contains the list of features which, along with those specified in z, define the subsets whose score is required. See irevcm for additional details.
On intermediate re-entry: a must remain unchanged.
On intermediate exit:
bz is used for storage between calls to h05aac.
On intermediate re-entry: bz must remain unchanged.
On final exit: the th best subset is constructed by dropping the features specified in
, for and , from the set of all features, . The score for the th best subset is given in .
13: – IntegerInput
On entry: , the minimum number of times the effect of each feature, , must have been observed before is estimated from as opposed to being calculated directly.
If then is never estimated. If then is set to .
14: – doubleInput
On entry: , the scaling factor used when estimating scores. If then is used.
15: – const doubleInput
On entry: a measure of the accuracy of the scoring function, .
Letting , then when confirming whether the scoring function is strictly increasing or decreasing (as described in mincr), or when assessing whether a node defined by subset can be trimmed, then any values in the range are treated as being numerically equivalent.
If then , otherwise .
If then , otherwise .
In most situations setting both and to zero should be sufficient. Using a nonzero value, when one is not required, can significantly increase the number of subsets that need to be evaluated.
On intermediate exit:
icomm is used for storage between calls to h05aac.
On intermediate re-entry: icomm must remain unchanged.
On final exit: icomm is not defined. The first two elements, and contain the minimum required value for licomm and lrcomm respectively.
17: – IntegerInput
On entry: the length of the array icomm. If licomm is too small and then NE_ARRAY_SIZE is returned and the minimum value for licomm and lrcomm are given by and respectively.
On entry: the length of the array rcomm. If lrcomm is too small and then NE_ARRAY_SIZE is returned and the minimum value for licomm and lrcomm are given by and respectively.
Constraints:
if , ;
otherwise .
20: – NagError *Input/Output
The NAG error argument (see Section 7 in the Introduction to the NAG Library CL Interface).
6Error Indicators and Warnings
NE_ALLOC_FAIL
Dynamic memory allocation failed.
See Section 3.1.2 in the Introduction to the NAG Library CL Interface for further information.
NE_ARRAY_SIZE
On entry, , .
Constraint: , .
The minimum required values for licomm and lrcomm are returned in and respectively.
drop has changed between calls.
On intermediate entry, .
On initial entry, .
ip has changed between calls.
On intermediate entry, .
On initial entry, .
la has changed between calls.
On entry, .
On previous exit, .
lz has changed between calls.
On entry, .
On previous exit, .
m has changed between calls.
On intermediate entry, .
On initial entry, .
mincnt has changed between calls.
On intermediate entry, .
On initial entry, .
mincr has changed between calls.
On intermediate entry, .
On initial entry, .
nbest has changed between calls.
On intermediate entry, .
On initial entry, .
NE_INTERNAL_ERROR
An internal error has occurred in this function. Check the function call and any array sizes. If the call is correct then please contact NAG for assistance.
See Section 7.5 in the Introduction to the NAG Library CL Interface for further information.
NE_NO_LICENCE
Your licence key may have expired or may not have been installed correctly.
See Section 8 in the Introduction to the NAG Library CL Interface for further information.
NE_REAL
, which is inconsistent with the score for the parent node. Score for the parent node is .
NE_REAL_CHANGED
has changed between calls.
On intermediate entry, .
On initial entry, .
has changed between calls.
On intermediate entry, .
On initial entry, .
gamma has changed between calls.
On intermediate entry, .
On initial entry, .
NE_TOO_MANY
On entry, .
But only best subsets could be calculated.
NE_TOO_SMALL
On entry, , .
Constraint: , . icomm is too small to return the required array sizes.
7Accuracy
The subsets returned by h05aac are guaranteed to be optimal up to the accuracy of your calculated scores.
8Parallelism and Performance
h05aac is threaded by NAG for parallel execution in multithreaded implementations of the NAG Library.
Please consult the X06 Chapter Introduction for information on how to control and interrogate the OpenMP environment used within this function. Please also consult the Users' Note for your implementation for any additional implementation-specific information.
9Further Comments
The maximum number of unique subsets of size from a set of features is . The efficiency of the branch and bound algorithm implemented in h05aac comes from evaluating subsets at internal nodes of the tree, that is subsets with more than features, and where possible trimming branches of the tree based on the scores at these internal nodes as described in Narendra and Fukunaga (1977). Because of this it is possible, in some circumstances, for more than subsets to be evaluated. This will tend to happen when most of the features have a similar effect on the subset score.
If multiple optimal subsets exist with the same score, and nbest is too small to return them all, then the choice of which of these optimal subsets is returned is arbitrary.
10Example
This example finds the three linear regression models, with five variables, that have the smallest residual sums of squares when fitted to a supplied dataset. The data used in this example was simulated.