void	g03efc (Integer n, Integer m, const double x[], Integer tdx, const Integer isx[], Integer nvar, Integer k, double cmeans[], Integer tdc, const double wt[], Integer inc[], Integer nic[], double css[], double csw[], Integer maxit, NagError *fail)

The function may be called by the names: g03efc, nag_mv_cluster_kmeans or nag_mv_kmeans_cluster_analysis.

3 Description

Given

n

objects with

p

variables measured on each object,

x_{i j}

for

i = 1, 2, \dots, n

and

j = 1, 2, \dots, p

, g03efc allocates each object to one of

K

groups or clusters to minimize the within-cluster sum of squares:

\sum_{k = 1}^{K} \sum_{i \in S_{k}} \sum_{j = 1}^{p} {(x_{i j} - {\bar{x}}_{k j})}^{2},

where

S_{k}

is the set of objects in the

k

th cluster and

{\bar{x}}_{k j}

is the mean for the variable

j

over cluster

k

. This is often known as

K

-means clustering.

In addition to the data matrix, a

K

p

matrix giving the initial cluster centres for the

K

clusters is required. The objects are then initially allocated to the cluster with the nearest cluster mean. Given the initial allocation, the procedure is to iteratively search for the

K

-partition with locally optimal within-cluster sum of squares by moving points from one cluster to another.

Optionally, weights for each object,

w_{i}

, can be used so that the clustering is based on within-cluster weighted sums of squares:

\sum_{k = 1}^{K} \sum_{i \in S_{k}} \sum_{j = 1}^{p} w_{i} {(x_{i j} - {\tilde{x}}_{k j})}^{2},

where

{\tilde{x}}_{k j}

is the weighted mean for variable

j

over cluster

k

The function is based on the algorithm of Hartigan and Wong (1979).

4 References

Everitt B S (1974) Cluster Analysis Heinemann

Hartigan J A and Wong M A (1979) Algorithm AS 136: A K-means clustering algorithm Appl. Statist. 28 100–108

Kendall M G and Stuart A (1976) The Advanced Theory of Statistics (Volume 3) (3rd Edition) Griffin

Krzanowski W J (1990) Principles of Multivariate Analysis Oxford University Press

5 Arguments

1: $n$ – Integer Input

On entry: the number of observations,

n

Constraint:

n \geq 2

2: $m$ – Integer Input

On entry: the number of variables in the array x.

Constraint:

m \geq nvar

3: $x [n \times tdx]$ – const double Input

On entry:

x [(i - 1) \times tdx + j - 1]

must contain the value of

j

th variable for the

i

th object, for

i = 1, 2, \dots, n

and

j = 1, 2, \dots, m

4: $tdx$ – Integer Input

On entry: the stride separating matrix column elements in the array x.

Constraint:

tdx \geq m

5: $isx [m]$ – const Integer Input

On entry:

isx [j - 1]

indicates whether or not the

j

th variable is to be included in the analysis.

isx [j - 1] > 0

, then the

j

th variable contained in the

j

th column of x is included, for

j = 1, 2, \dots, m

Constraint:

isx [j - 1] > 0

for nvar values of

j

6: $nvar$ – Integer Input

On entry: the number of variables included in the sum of squares calculations,

p

Constraint:

1 \leq nvar \leq m

7: $k$ – Integer Input

On entry: the number of clusters,

K

Constraint:

k \geq 2

8: $cmeans [k \times tdc]$ – double Input/Output

On entry:

cmeans [(i - 1) \times tdc + j - 1]

must contain the value of the

j

th variable for the

i

th initial cluster centre, for

i = 1, 2, \dots, K

and

j = 1, 2, \dots, p

On exit:

cmeans [(i - 1) \times tdc + j - 1]

contains the value of the

j

th variable for the

i

th computed cluster centre, for

i = 1, 2, \dots, K

and

j = 1, 2, \dots, p

9: $tdc$ – Integer Input

On entry: the stride separating matrix column elements in the array cmeans.

Constraint:

tdc \geq nvar

10: $wt [n]$ – const double Input

On entry: the elements of wt must contain the weights to be used in the analysis. The effective number of observations is the sum of the weights. If

wt [i - 1] = 0.0

then the

i

th observation is not included in the analysis.

If weights are not provided then wt must be set to NULL and the effective number of observations is n.

Constraints:

$wt [i - 1] \geq 0.0$ , for $i = 1, 2, \dots, n$ ;
$wt [i - 1] > 0.0$ for at least two values of $i$ .

11: $inc [n]$ – Integer Output

On exit:

inc [i - 1]

contains the cluster to which the

i

th object has been allocated, for

i = 1, 2, \dots, n

12: $nic [k]$ – Integer Output

On exit:

nic [i - 1]

contains the number of objects in the

i

th cluster, for

i = 1, 2, \dots, K

13: $css [k]$ – double Output

On exit:

css [i - 1]

contains the within-cluster (weighted) sum of squares of the

i

th cluster, for

i = 1, 2, \dots, K

14: $csw [k]$ – double Output

On exit:

csw [i - 1]

contains the within-cluster sum of weights of the

i

th cluster, for

i = 1, 2, \dots, K

. If

wt = NULL

the sum of weights is the number of objects in the cluster.

15: $maxit$ – Integer Input

On entry: the maximum number of iterations allowed in the analysis.

Suggested value:

maxit = 10

Constraint:

maxit > 0

16: $fail$ – NagError * Input/Output

The NAG error argument (see Section 7 in the Introduction to the NAG Library CL Interface).

6 Error Indicators and Warnings

NE_2_INT_ARG_LT: On entry, $m = 〈value〉$ while $nvar = 〈value〉$ . These arguments must satisfy $m \geq nvar$ .

On entry, $tdc = 〈value〉$ while $nvar = 〈value〉$ . These arguments must satisfy $tdc \geq nvar$ .

On entry, $tdx = 〈value〉$ while $m = 〈value〉$ . These arguments must satisfy $tdx \geq m$ .
NE_ALLOC_FAIL: Dynamic memory allocation failed.
NE_CLUSTER_EMPTY: At least one cluster is empty after the initial assignment.
Try a different set of initial cluster centres in cmeans and also consider decreasing the value of k. The empty clusters may be found by examining the values in nic.
NE_INT_ARG_LE: On entry, $maxit = 〈value〉$ .
Constraint: $maxit > 0$ .
NE_INT_ARG_LT: On entry, $k = 〈value〉$ .
Constraint: $k \geq 2$ .

On entry, $n = 〈value〉$ .
Constraint: $n \geq 2$ .

On entry, $nvar = 〈value〉$ .
Constraint: $nvar \geq 1$ .
NE_INTERNAL_ERROR: An internal error has occurred in this function. Check the function call and any array sizes. If the call is correct then please contact NAG for assistance.
NE_NEG_WEIGHT_ELEMENT: On entry, $wt [〈value〉] = 〈value〉$ .
Constraint: When referenced, all elements of wt must be non-negative.
NE_TOO_MANY: Too many iterations ( $〈value〉$ ). Convergence has not been achieved within the maximum number of iterations given by maxit. Try increasing maxit and, if possible, use the returned values in cmeans as the initial cluster centres.
NE_VAR_INCL_INDICATED: The number of variables, nvar in the analysis $= 〈value〉$ , while number of variables included in the analysis via array $isx = 〈value〉$ .
Constraint: these two numbers must be the same.
NE_WT_ZERO: At least two elements of wt must be greater than zero.

7 Accuracy

g03efc produces clusters that are locally optimal; the within-cluster sum of squares may not be decreased by transferring a point from one cluster to another, but different partitions may have the same or smaller within-cluster sum of squares.

8 Parallelism and Performance

g03efc is not threaded in any implementation.

9 Further Comments

The time per iteration is approximately proportional to

n p K

10 Example

The data consists of observations of five variables on twenty soils (Kendall and Stuart (1976)). The data is read in, the

K

-means clustering performed and the results printed.

Interfaces: FL CL AD

NAG CL Interface Introduction

G03 (Mv) Chapter Contents

G03 (Mv) Chapter Introduction

g03ef: FL CL AD

NAG CL Interfaceg03efc (cluster_​kmeans)

▸▿ Contents

1 Purpose

2 Specification

3 Description

4 References

5 Arguments

6 Error Indicators and Warnings

7 Accuracy

8 Parallelism and Performance

9 Further Comments

10 Example

10.1 Program Text

10.2 Program Data

10.3 Program Results

NAG CL Interface
g03efc (cluster_kmeans)