naginterfaces.library.mip.best_subset_given_size_revcomm¶

naginterfaces.library.mip.best_subset_given_size_revcomm(irevcm, mincr, m, ip, nbest, drop, lz, z, la, a, bscore, bz, mincnt, gamma, acc, comm, io_manager=None)[source]¶

Given a set of $m$ features and a scoring mechanism for any subset of those features, best_subset_given_size_revcomm selects the best $n$ subsets of size $p$ using a reverse communication branch and bound algorithm.

For full information please refer to the NAG Library document for h05aa

https://support.nag.com/numeric/nl/nagdoc_31.1/flhtml/h/h05aaf.html

Parameters

irevcmint

On initial entry: must be set to $0$ .

On intermediate entry: $i r e v c m$ must remain unchanged.

mincrint

Flag indicating whether the scoring function $f$ is increasing or decreasing.

$m i n c r = 1$

$f (S_{i}) \leq f (S_{j})$ , i.e., the subsets with the largest score will be selected.

$m i n c r = 0$

$f (S_{i}) \geq f (S_{j})$ , i.e., the subsets with the smallest score will be selected.

For all $S_{j} \subseteq Ω$ and $S_{i} \subseteq S_{j}$ .

mint

$m$ , the number of features in the full feature set.

ipint

$p$ , the number of features in the subset of interest.

nbestint

$n$ , the maximum number of best subsets required. The actual number of subsets returned is given by $l a$ on final exit. If on final exit $l a \neq n b e s t$ then $e r r n o$ = 53 is returned.

dropint

On initial entry: $d r o p$ need not be set.

On intermediate entry: $d r o p$ must remain unchanged.

lzint

On initial entry: $l z$ need not be set.

On intermediate entry: $l z$ must remain unchanged.

zint, ndarray, shape $(m - i p)$ , modified in place

On initial entry: $z$ need not be set.

On intermediate exit: $z [i - 1]$ , for $i = 1, 2, \dots, l z$ , contains the list of features which, along with those specified in $a$ , define the subsets whose score is required. See $i r e v c m$ for additional details.

On intermediate entry: $z$ must remain unchanged.

On final exit: $z$ is undefined.

laint

On initial entry: $l a$ need not be set.

On intermediate entry: $l a$ must remain unchanged.

aint, ndarray, shape $(max (n b e s t, m))$ , modified in place

On initial entry: $a$ need not be set.

On intermediate exit: $a [j - 1]$ , for $j = 1, 2, \dots, l a$ , contains the list of features which, along with those specified in $z$ , define the subsets whose score is required. See $i r e v c m$ for additional details.

On intermediate entry: $a$ must remain unchanged.

On final exit: $a$ is undefined.

bscorefloat, ndarray, shape $(max (n b e s t, m))$ , modified in place

On initial entry: $b s c o r e$ need not be set.

On intermediate exit: $b s c o r e$ is undefined.

On intermediate entry: $b s c o r e [j - 1]$ must hold the score for the $j$ th subset as described in $i r e v c m$ .

On final exit: holds the score for the $l a$ best subsets returned in $b z$ .

bzint, ndarray, shape $(m - i p, n b e s t)$ , modified in place

On initial entry: $b z$ need not be set.

On intermediate exit: $b z$ is used for storage between calls to best_subset_given_size_revcomm.

On intermediate entry: $b z$ must remain unchanged.

On final exit: the $j$ th best subset is constructed by dropping the features specified in $b z [i - 1, j - 1]$ , for $j = 1, 2, \dots, l a$ , for $i = 1, 2, \dots, m - i p$ , from the set of all features, $Ω$ . The score for the $j$ th best subset is given in $b s c o r e [j - 1]$ .

mincntint

$k$ , the minimum number of times the effect of each feature, $x_{i}$ , must have been observed before $f (S - {x_{i}})$ is estimated from $f (S)$ as opposed to being calculated directly.

If $k = 0$ then $f (S - {x_{i}})$ is never estimated.

If $m i n c n t < 0$ then $k$ is set to $1$ .

gammafloat

$γ$ , the scaling factor used when estimating scores. If $g a m m a < 0$ then $γ = 1$ is used.

accfloat, array-like, shape $(2)$

A measure of the accuracy of the scoring function, $f$ .

Letting $a_{i} = ϵ_{1} | f (S_{i}) | + ϵ_{2}$ , then when confirming whether the scoring function is strictly increasing or decreasing (as described in $m i n c r$ ), or when assessing whether a node defined by subset $S_{i}$ can be trimmed, then any values in the range $f (S_{i}) \pm a_{i}$ are treated as being numerically equivalent.

If $0 \leq a c c [0] \leq 1$ then $ϵ_{1} = a c c [0]$ , otherwise $ϵ_{1} = 0$ .

If $a c c [1] \geq 0$ then $ϵ_{2} = a c c [1]$ , otherwise $ϵ_{2} = 0$ .

In most situations setting both $ϵ_{1}$ and $ϵ_{2}$ to zero should be sufficient.

Using a nonzero value, when one is not required, can significantly increase the number of subsets that need to be evaluated.

commdict, communication object, modified in place

Communication structure.

On initial entry: need not be set.

io_managerFileObjManager, optional

Manager for I/O in this routine.

Returns

irevcmint

On intermediate exit: $i r e v c m = 1$ and before re-entry the scores associated with $l a$ subsets must be calculated and returned in $b s c o r e$ .

The $l a$ subsets are constructed as follows:

$d r o p = 1$

The $j$ th subset is constructed by dropping the features specified in the first $l z$ elements of $z$ and the single feature given in $a [j - 1]$ from the full set of features, $Ω$ . The subset will, therefore, contain $m - l z - 1$ features.

$d r o p = 0$

The $j$ th subset is constructed by adding the features specified in the first $l z$ elements of $z$ and the single feature specified in $a [j - 1]$ to the empty set, $\emptyset$ . The subset will, therefore, contain $l z + 1$ features.

In both cases the individual features are referenced by the integers $1$ to $m$ with $1$ indicating the first feature, $2$ the second, etc., for some arbitrary ordering of the features.

The same ordering must be used in all calls to best_subset_given_size_revcomm.

If $l a = 0$ , the score for a single subset should be returned.

This subset is constructed by adding or removing only those features specified in the first $l z$ elements of $z$ .

If $l z = 0$ , this subset will either be $Ω$ or $\emptyset$ .

The score associated with the $j$ th subset must be returned in $b s c o r e [j - 1]$ .

On final exit: $i r e v c m = 0$ , and the algorithm has terminated.

dropint

On intermediate exit: flag indicating whether the intermediate subsets should be constructed by dropping features from the full set ( $d r o p = 1$ ) or adding features to the empty set ( $d r o p = 0$ ). See $i r e v c m$ for details.

On final exit: $d r o p$ is undefined.

lzint

On intermediate exit: the number of features stored in $z$ .

On final exit: $l z$ is undefined.

laint

On intermediate exit: if $l a > 0$ , the number of subsets for which a score must be returned.

If $l a = 0$ , the score for a single subset should be returned.

See $i r e v c m$ for additional details.

On final exit: the number of best subsets returned.

Raises

NagValueError

(errno $11$ )

On entry, $i r e v c m = ⟨ v a l u e ⟩$ .

Constraint: $i r e v c m = 0$ or $1$ .

(errno $21$ )

On entry, $m i n c r = ⟨ v a l u e ⟩$ .

Constraint: $m i n c r = 0$ or $1$ .

(errno $22$ )

$m i n c r$ has changed between calls.

On intermediate entry, $m i n c r = ⟨ v a l u e ⟩$ .

On initial entry, $m i n c r = ⟨ v a l u e ⟩$ .

(errno $31$ )

On entry, $m = ⟨ v a l u e ⟩$ .

Constraint: $m \geq 2$ .

(errno $32$ )

$m$ has changed between calls.

On intermediate entry, $m = ⟨ v a l u e ⟩$ .

On initial entry, $m = ⟨ v a l u e ⟩$ .

(errno $41$ )

On entry, $i p = ⟨ v a l u e ⟩$ and $m = ⟨ v a l u e ⟩$ .

Constraint: $1 \leq i p \leq m$ .

(errno $42$ )

$i p$ has changed between calls.

On intermediate entry, $i p = ⟨ v a l u e ⟩$ .

On initial entry, $i p = ⟨ v a l u e ⟩$ .

(errno $51$ )

On entry, $n b e s t = ⟨ v a l u e ⟩$ .

Constraint: $n b e s t \geq 1$ .

(errno $52$ )

$n b e s t$ has changed between calls.

On intermediate entry, $n b e s t = ⟨ v a l u e ⟩$ .

On initial entry, $n b e s t = ⟨ v a l u e ⟩$ .

(errno $61$ )

$d r o p$ has changed between calls.

On intermediate entry, $d r o p = ⟨ v a l u e ⟩$ .

On initial entry, $d r o p = ⟨ v a l u e ⟩$ .

(errno $71$ )

$l z$ has changed between calls.

On entry, $l z = ⟨ v a l u e ⟩$ .

On previous exit, $l z = ⟨ v a l u e ⟩$ .

(errno $91$ )

$l a$ has changed between calls.

On entry, $l a = ⟨ v a l u e ⟩$ .

On previous exit, $l a = ⟨ v a l u e ⟩$ .

(errno $111$ )

$b s c o r e [⟨ v a l u e ⟩] = ⟨ v a l u e ⟩$ , which is inconsistent with the score for the parent node. Score for the parent node is $⟨ v a l u e ⟩$ .

(errno $131$ )

$m i n c n t$ has changed between calls.

On intermediate entry, $m i n c n t = ⟨ v a l u e ⟩$ .

On initial entry, $m i n c n t = ⟨ v a l u e ⟩$ .

(errno $141$ )

$g a m m a$ has changed between calls.

On intermediate entry, $g a m m a = ⟨ v a l u e ⟩$ .

On initial entry, $g a m m a = ⟨ v a l u e ⟩$ .

(errno $151$ )

$a c c [0]$ has changed between calls.

On intermediate entry, $a c c [0] = ⟨ v a l u e ⟩$ .

On initial entry, $a c c [0] = ⟨ v a l u e ⟩$ .

(errno $152$ )

$a c c [1]$ has changed between calls.

On intermediate entry, $a c c [1] = ⟨ v a l u e ⟩$ .

On initial entry, $a c c [1] = ⟨ v a l u e ⟩$ .

(errno $161$ )

$c o m m$ [‘icomm’] has been corrupted between calls.

(errno $181$ )

$c o m m$ [‘rcomm’] has been corrupted between calls.

Warns

NagAlgorithmicWarning

(errno $53$ )

On entry, $n b e s t = ⟨ v a l u e ⟩$ .

But only $⟨ v a l u e ⟩$ best subsets could be calculated.

Notes

Given $Ω = {x_{i} : i \in Z, 1 \leq i \leq m}$ , a set of $m$ unique features and a scoring mechanism $f (S)$ defined for all $S \subseteq Ω$ then best_subset_given_size_revcomm is designed to find $S_{o 1} \subseteq Ω, | S_{o 1} | = p$ , an optimal subset of size $p$ . Here $| S_{o 1} |$ denotes the cardinality of $S_{o 1}$ , the number of elements in the set.

The definition of the optimal subset depends on the properties of the scoring mechanism, if

\begin{matrix} f (S_{i}) \leq f (S_{j}), & for all S_{j} \subseteq Ω and S_{i} \subseteq S_{j} \end{matrix}

then the optimal subset is defined as one of the solutions to

{maximize}_{S \subseteq Ω} f (S) subject to | S | = p

else if

\begin{matrix} f (S_{i}) \geq f (S_{j}), & for all S_{j} \subseteq Ω and S_{i} \subseteq S_{j} \end{matrix}

then the optimal subset is defined as one of the solutions to

{minimize}_{S \subseteq Ω} f (S) subject to | S | = p .

If neither of these properties hold then best_subset_given_size_revcomm cannot be used.

As well as returning the optimal subset, $S_{o 1}$ , best_subset_given_size_revcomm can return the best $n$ solutions of size $p$ . If $S_{o i}$ denotes the $i$ th best subset, for $i = 1, 2, \dots, n - 1$ , then the $(i + 1)$ th best subset is defined as the solution to either

{maximize}_{S \subseteq Ω - {S_{o j} : j \in Z, 1 \leq j \leq i}} f (S) subject to | S | = p

or

{minimize}_{S \subseteq Ω - {S_{o j} : j \in Z, 1 \leq j \leq i}} f (S) subject to | S | = p

depending on the properties of $f$ .

The solutions are found using a branch and bound method, where each node of the tree is a subset of $Ω$ . Assuming that [equation] holds then a particular node, defined by subset $S_{i}$ , can be trimmed from the tree if $f (S_{i}) <^f (S_{o n})$ where $^f (S_{o n})$ is the $n$ th highest score we have observed so far for a subset of size $p$ , i.e., our current best guess of the score for the $n$ th best subset. In addition, because of [equation] we can also drop all nodes defined by any subset $S_{j}$ where $S_{j} \subseteq S_{i}$ , thus avoiding the need to enumerate the whole tree. Similar short cuts can be taken if [equation] holds. A full description of this branch and bound algorithm can be found in Ridout (1988).

Rather than calculate the score at a given node of the tree best_subset_given_size_revcomm utilizes the fast branch and bound algorithm of Somol et al. (2004), and attempts to estimate the score where possible. For each feature, $x_{i}$ , two values are stored, a count $c_{i}$ and ${^μ}_{i}$ , an estimate of the contribution of that feature. An initial value of zero is used for both $c_{i}$ and ${^μ}_{i}$ . At any stage of the algorithm where both $f (S)$ and $f (S - {x_{i}})$ have been calculated (as opposed to estimated), the estimated contribution of the feature $x_{i}$ is updated to

ci^μi+[f(S)−f(S−{xj})]ci+1

and $c_{i}$ is incremented by $1$ , therefore, at each stage ${^μ}_{i}$ is the mean contribution of $x_{i}$ observed so far and $c_{i}$ is the number of observations used to calculate that mean.

As long as $c_{i} \geq k$ , for the user-supplied constant $k$ , then rather than calculating $f (S - {x_{i}})$ this function estimates it using $^f(S−{xi})=f(S)−γ^μi$ or $^f (S) - γ {^μ}_{i}$ if $f (S)$ has been estimated, where $γ$ is a user-supplied scaling factor. An estimated score is never used to trim a node or returned as the optimal score.

Setting $k = 0$ in this function will cause the algorithm to always calculate the scores, returning to the branch and bound algorithm of Ridout (1988). In most cases it is preferable to use the fast branch and bound algorithm, by setting $k > 0$ , unless the score function is iterative in nature, i.e., $f (S)$ must have been calculated before $f (S - {x_{i}})$ can be calculated.

References

Narendra, P M and Fukunaga, K, 1977, A branch and bound algorithm for feature subset selection, IEEE Transactions on Computers (9), 917–922

Ridout, M S, 1988, Algorithm AS 233: An improved branch and bound algorithm for feature subset selection, Journal of the Royal Statistics Society, Series C (Applied Statistics) (Volume 37) (1), 139–147

Somol, P, Pudil, P and Kittler, J, 2004, Fast branch and bound algorithms for optimal feature selection, IEEE Transactions on Pattern Analysis and Machine Intelligence (Volume 26) (7), 900–912

NAG and Python

Return to Front

naginterfaces.library.mip.best_subset_given_size_revcomm¶

naginterfaces.library.mip.best_​subset_​given_​size_​revcomm¶

naginterfaces.library.mip.best_subset_given_size_revcomm¶