NAG Library Function Document
nag_mv_kmeans_cluster_analysis (g03efc)
1 Purpose
nag_mv_kmeans_cluster_analysis (g03efc) performs -means cluster analysis.
2 Specification
#include <nag.h> |
#include <nagg03.h> |
void |
nag_mv_kmeans_cluster_analysis (Integer n,
Integer m,
const double x[],
Integer tdx,
const Integer isx[],
Integer nvar,
Integer k,
double cmeans[],
Integer tdc,
const double wt[],
Integer inc[],
Integer nic[],
double css[],
double csw[],
Integer maxit,
NagError *fail) |
|
3 Description
Given
objects with
variables measured on each object,
for
and
, nag_mv_kmeans_cluster_analysis (g03efc) allocates each object to one of
groups or clusters to minimize the within-cluster sum of squares:
where
is the set of objects in the
th cluster and
is the mean for the variable
over cluster
. This is often known as
-means clustering.
In addition to the data matrix, a by matrix giving the initial cluster centres for the clusters is required. The objects are then initially allocated to the cluster with the nearest cluster mean. Given the initial allocation, the procedure is to iteratively search for the -partition with locally optimal within-cluster sum of squares by moving points from one cluster to another.
Optionally, weights for each object,
, can be used so that the clustering is based on within-cluster weighted sums of squares:
where
is the weighted mean for variable
over cluster
.
The function is based on the algorithm of
Hartigan and Wong (1979).
4 References
Everitt B S (1974) Cluster Analysis Heinemann
Hartigan J A and Wong M A (1979) Algorithm AS 136: A K-means clustering algorithm Appl. Statist. 28 100–108
Kendall M G and Stuart A (1976) The Advanced Theory of Statistics (Volume 3) (3rd Edition) Griffin
Krzanowski W J (1990) Principles of Multivariate Analysis Oxford University Press
5 Arguments
- 1:
n – IntegerInput
On entry: the number of observations, .
Constraint:
.
- 2:
m – IntegerInput
On entry: the number of variables in the array
x.
Constraint:
.
- 3:
x[] – const doubleInput
-
On entry: must contain the value of th variable for the th object, for and .
- 4:
tdx – IntegerInput
-
On entry: the stride separating matrix column elements in the array
x.
Constraint:
.
- 5:
isx[m] – const IntegerInput
-
On entry:
indicates whether or not the
th variable is to be included in the analysis.
If
, then the
th variable contained in the
th column of
x is included, for
.
Constraint:
for
nvar values of
.
- 6:
nvar – IntegerInput
On entry: the number of variables included in the sum of squares calculations, .
Constraint:
.
- 7:
k – IntegerInput
On entry: the number of clusters, .
Constraint:
.
- 8:
cmeans[] – doubleInput/Output
-
On entry: must contain the value of the th variable for the th initial cluster centre, for and .
On exit: contains the value of the th variable for the th computed cluster centre, for and .
- 9:
tdc – IntegerInput
-
On entry: the stride separating matrix column elements in the array
cmeans.
Constraint:
.
- 10:
wt[n] – const doubleInput
-
On entry: the elements of
wt must contain the weights to be used in the analysis. The effective number of observations is the sum of the weights. If
then the
th observation is not included in the analysis.
If weights are not provided then
wt must be set to
NULL and the effective number of observations is
n.
Constraints:
- , for ;
- for at least two values of .
- 11:
inc[n] – IntegerOutput
-
On exit: contains the cluster to which the th object has been allocated, for .
- 12:
nic[k] – IntegerOutput
-
On exit: contains the number of objects in the th cluster, for .
- 13:
css[k] – doubleOutput
-
On exit: contains the within-cluster (weighted) sum of squares of the th cluster, for .
- 14:
csw[k] – doubleOutput
-
On exit: contains the within-cluster sum of weights of the th cluster, for . If the sum of weights is the number of objects in the cluster.
- 15:
maxit – IntegerInput
-
On entry: the maximum number of iterations allowed in the analysis.
Suggested value:
.
Constraint:
.
- 16:
fail – NagError *Input/Output
-
The NAG error argument (see
Section 3.6 in the Essential Introduction).
6 Error Indicators and Warnings
- NE_2_INT_ARG_LT
-
On entry, while . These arguments must satisfy .
On entry, while . These arguments must satisfy .
On entry, while . These arguments must satisfy .
- NE_ALLOC_FAIL
-
Dynamic memory allocation failed.
- NE_CLUSTER_EMPTY
-
At least one cluster is empty after the initial assignment.
Try a different set of initial cluster centres in
cmeans and also consider decreasing the value of
k. The empty clusters may be found by examining the values in
nic.
- NE_INT_ARG_LE
-
On entry, .
Constraint: .
- NE_INT_ARG_LT
-
On entry, .
Constraint: .
On entry, .
Constraint: .
On entry, .
Constraint: .
- NE_INTERNAL_ERROR
-
An internal error has occurred in this function. Check the function call
and any array sizes. If the call is correct then please contact
NAG for
assistance.
- NE_NEG_WEIGHT_ELEMENT
-
On entry,
.
Constraint: When referenced, all elements of
wt must be non-negative.
- NE_TOO_MANY
-
Too many iterations (
). Convergence has not been achieved within the maximum number of iterations given by
maxit. Try increasing
maxit and, if possible, use the returned values in
cmeans as the initial cluster centres.
- NE_VAR_INCL_INDICATED
-
The number of variables,
nvar in the analysis
, while number of variables included in the analysis via array
.
Constraint: these two numbers must be the same.
- NE_WT_ZERO
-
At least two elements of
wt must be greater than zero.
7 Accuracy
nag_mv_kmeans_cluster_analysis (g03efc) produces clusters that are locally optimal; the within-cluster sum of squares may not be decreased by transferring a point from one cluster to another, but different partitions may have the same or smaller within-cluster sum of squares.
8 Parallelism and Performance
Not applicable.
The time per iteration is approximately proportional to .
10 Example
The data consists of observations of five variables on twenty soils (
Kendall and Stuart (1976)). The data is read in, the
-means clustering performed and the results printed.
10.1 Program Text
Program Text (g03efce.c)
10.2 Program Data
Program Data (g03efce.d)
10.3 Program Results
Program Results (g03efce.r)