The routine may be called by the names g03eff or nagf_mv_cluster_kmeans.
3Description
Given objects with variables measured on each object, , for and , g03eff allocates each object to one of groups or clusters to minimize the within-cluster sum of squares:
where is the set of objects in the th cluster and is the mean for the variable over cluster . This is often known as -means clustering.
In addition to the data matrix, a matrix giving the initial cluster centres for the clusters is required. The objects are then initially allocated to the cluster with the nearest cluster mean. Given the initial allocation, the procedure is to iteratively search for the -partition with locally optimal within-cluster sum of squares by moving points from one cluster to another.
Optionally, weights for each object, , can be used so that the clustering is based on within-cluster weighted sums of squares:
where is the weighted mean for variable over cluster .
On entry: the total number of variables in array x.
Constraint:
.
4: – Real (Kind=nag_wp) arrayInput
On entry: must contain the value of the th variable for the th object, for and .
5: – IntegerInput
On entry: the first dimension of the array x as declared in the (sub)program from which g03eff is called.
Constraint:
.
6: – Integer arrayInput
On entry: indicates whether or not the th variable is to be included in the analysis. If , the variable contained in the th column of x is included, for .
On entry: , the number of variables included in the sums of squares calculations.
Constraint:
.
8: – IntegerInput
On entry: , the number of clusters.
Constraint:
.
9: – Real (Kind=nag_wp) arrayInput/Output
On entry: must contain the value of the th variable for the th initial cluster centre, for and .
On exit: contains the value of the th variable for the th computed cluster centre, for and .
10: – IntegerInput
On entry: the first dimension of the array cmeans as declared in the (sub)program from which g03eff is called.
Constraint:
.
11: – Real (Kind=nag_wp) arrayInput
Note: the dimension of the array wt
must be at least
if , and at least otherwise.
On entry: if , the first elements of wt must contain the weights to be used.
If , the th observation is not included in the analysis. The effective number of observation is the sum of the weights.
If , wt is not referenced and the effective number of observations is .
Constraint:
if ,
and for at least two values of , for .
12: – Integer arrayOutput
On exit: contains the cluster to which the th object has been allocated, for .
13: – Integer arrayOutput
On exit: contains the number of objects in the th cluster, for .
14: – Real (Kind=nag_wp) arrayOutput
On exit: contains the within-cluster (weighted) sum of squares of the th cluster, for .
15: – Real (Kind=nag_wp) arrayOutput
On exit: contains the within-cluster sum of weights of the th cluster, for . If , the sum of weights is the number of objects in the cluster.
16: – IntegerInput
On entry: the maximum number of iterations allowed in the analysis.
Suggested value:
.
Constraint:
.
17: – Integer arrayWorkspace
18: – Real (Kind=nag_wp) arrayWorkspace
19: – IntegerInput/Output
On entry: ifail must be set to , or to set behaviour on detection of an error; these values have no effect when no error is detected.
A value of causes the printing of an error message and program execution will be halted; otherwise program execution continues. A value of means that an error message is printed while a value of means that it is not.
If halting is not appropriate, the value or is recommended. If message printing is undesirable, then the value is recommended. Otherwise, the value is recommended. When the value or is used it is essential to test the value of ifail on exit.
On exit: unless the routine detects an error or a warning has been flagged (see Section 6).
6Error Indicators and Warnings
If on entry or , explanatory error messages are output on the current error message unit (as defined by x04aaf).
On entry, and values of .
Constraint: exactly nvar elements of .
At least one cluster is empty after the initial assignment. Try a different set of initial cluster centres in cmeans and also consider decreasing the value of k. The empty clusters may be found by examining the values in nic.
Convergence has not been achieved within the maximum number of iterations . Try increasing maxit and, if possible, use the returned values in cmeans as the initial cluster centres.
An unexpected error has been triggered by this routine. Please
contact NAG.
See Section 7 in the Introduction to the NAG Library FL Interface for further information.
Your licence key may have expired or may not have been installed correctly.
See Section 8 in the Introduction to the NAG Library FL Interface for further information.
Dynamic memory allocation failed.
See Section 9 in the Introduction to the NAG Library FL Interface for further information.
7Accuracy
g03eff produces clusters that are locally optimal; the within-cluster sum of squares may not be decreased by transferring a point from one cluster to another, but different partitions may have the same or smaller within-cluster sum of squares.
8Parallelism and Performance
Background information to multithreading can be found in the Multithreading documentation.
g03eff is not threaded in any implementation.
9Further Comments
The time per iteration is approximately proportional to .
10Example
The data consists of observations of five variables on twenty soils (see Hartigan and Wong (1979)). The data is read in, the -means clustering performed and the results printed.