The function may be called by the names: g03efc, nag_mv_cluster_kmeans or nag_mv_kmeans_cluster_analysis.
3Description
Given objects with variables measured on each object, for and , g03efc allocates each object to one of groups or clusters to minimize the within-cluster sum of squares:
where is the set of objects in the th cluster and is the mean for the variable over cluster . This is often known as -means clustering.
In addition to the data matrix, a matrix giving the initial cluster centres for the clusters is required. The objects are then initially allocated to the cluster with the nearest cluster mean. Given the initial allocation, the procedure is to iteratively search for the -partition with locally optimal within-cluster sum of squares by moving points from one cluster to another.
Optionally, weights for each object, , can be used so that the clustering is based on within-cluster weighted sums of squares:
where is the weighted mean for variable over cluster .
On entry: the number of variables included in the sum of squares calculations, .
Constraint:
.
7: – IntegerInput
On entry: the number of clusters, .
Constraint:
.
8: – doubleInput/Output
On entry: must contain the value of the th variable for the th initial cluster centre, for and .
On exit: contains the value of the th variable for the th computed cluster centre, for and .
9: – IntegerInput
On entry: the stride separating matrix column elements in the array cmeans.
Constraint:
.
10: – const doubleInput
On entry: the elements of wt must contain the weights to be used in the analysis. The effective number of observations is the sum of the weights. If then the th observation is not included in the analysis.
If weights are not provided then wt must be set to NULL and the effective number of observations is n.
Constraints:
, for ;
for at least two values of .
11: – IntegerOutput
On exit: contains the cluster to which the th object has been allocated, for .
12: – IntegerOutput
On exit: contains the number of objects in the th cluster, for .
13: – doubleOutput
On exit: contains the within-cluster (weighted) sum of squares of the th cluster, for .
14: – doubleOutput
On exit: contains the within-cluster sum of weights of the th cluster, for . If the sum of weights is the number of objects in the cluster.
15: – IntegerInput
On entry: the maximum number of iterations allowed in the analysis.
Suggested value:
.
Constraint:
.
16: – NagError *Input/Output
The NAG error argument (see Section 7 in the Introduction to the NAG Library CL Interface).
6Error Indicators and Warnings
NE_2_INT_ARG_LT
On entry, while . These arguments must satisfy .
On entry, while . These arguments must satisfy .
On entry, while . These arguments must satisfy .
NE_ALLOC_FAIL
Dynamic memory allocation failed.
NE_CLUSTER_EMPTY
At least one cluster is empty after the initial assignment.
Try a different set of initial cluster centres in cmeans and also consider decreasing the value of k. The empty clusters may be found by examining the values in nic.
NE_INT_ARG_LE
On entry, .
Constraint: .
NE_INT_ARG_LT
On entry, .
Constraint: .
On entry, .
Constraint: .
On entry, .
Constraint: .
NE_INTERNAL_ERROR
An internal error has occurred in this function. Check the function call
and any array sizes. If the call is correct then please contact NAG for
assistance.
NE_NEG_WEIGHT_ELEMENT
On entry, .
Constraint: When referenced, all elements of wt must be non-negative.
NE_TOO_MANY
Too many iterations (). Convergence has not been achieved within the maximum number of iterations given by maxit. Try increasing maxit and, if possible, use the returned values in cmeans as the initial cluster centres.
NE_VAR_INCL_INDICATED
The number of variables, nvar in the analysis , while number of variables included in the analysis via array .
Constraint: these two numbers must be the same.
NE_WT_ZERO
At least two elements of wt must be greater than zero.
7Accuracy
g03efc produces clusters that are locally optimal; the within-cluster sum of squares may not be decreased by transferring a point from one cluster to another, but different partitions may have the same or smaller within-cluster sum of squares.
8Parallelism and Performance
g03efc is not threaded in any implementation.
9Further Comments
The time per iteration is approximately proportional to .
10Example
The data consists of observations of five variables on twenty soils (Kendall and Stuart (1976)). The data is read in, the -means clustering performed and the results printed.