NAG FL Interface
g03eff (cluster_kmeans)
1
Purpose
g03eff performs -means cluster analysis.
2
Specification
Fortran Interface
Subroutine g03eff ( |
weight, n, m, x, ldx, isx, nvar, k, cmeans, ldc, wt, inc, nic, css, csw, maxit, iwk, wk, ifail) |
Integer, Intent (In) |
:: |
n, m, ldx, isx(m), nvar, k, ldc, maxit |
Integer, Intent (Inout) |
:: |
ifail |
Integer, Intent (Out) |
:: |
inc(n), nic(k), iwk(n+3*k) |
Real (Kind=nag_wp), Intent (In) |
:: |
x(ldx,m), wt(*) |
Real (Kind=nag_wp), Intent (Inout) |
:: |
cmeans(ldc,nvar) |
Real (Kind=nag_wp), Intent (Out) |
:: |
css(k), csw(k), wk(n+2*k) |
Character (1), Intent (In) |
:: |
weight |
|
C Header Interface
#include <nag.h>
void |
g03eff_ (const char *weight, const Integer *n, const Integer *m, const double x[], const Integer *ldx, const Integer isx[], const Integer *nvar, const Integer *k, double cmeans[], const Integer *ldc, const double wt[], Integer inc[], Integer nic[], double css[], double csw[], const Integer *maxit, Integer iwk[], double wk[], Integer *ifail, const Charlen length_weight) |
|
C++ Header Interface
#include <nag.h> extern "C" {
void |
g03eff_ (const char *weight, const Integer &n, const Integer &m, const double x[], const Integer &ldx, const Integer isx[], const Integer &nvar, const Integer &k, double cmeans[], const Integer &ldc, const double wt[], Integer inc[], Integer nic[], double css[], double csw[], const Integer &maxit, Integer iwk[], double wk[], Integer &ifail, const Charlen length_weight) |
}
|
The routine may be called by the names g03eff or nagf_mv_cluster_kmeans.
3
Description
Given
objects with
variables measured on each object,
, for
and
,
g03eff allocates each object to one of
groups or clusters to minimize the within-cluster sum of squares:
where
is the set of objects in the
th cluster and
is the mean for the variable
over cluster
. This is often known as
-means clustering.
In addition to the data matrix, a by matrix giving the initial cluster centres for the clusters is required. The objects are then initially allocated to the cluster with the nearest cluster mean. Given the initial allocation, the procedure is to iteratively search for the -partition with locally optimal within-cluster sum of squares by moving points from one cluster to another.
Optionally, weights for each object,
, can be used so that the clustering is based on within-cluster weighted sums of squares:
where
is the weighted mean for variable
over cluster
.
The routine is based on the algorithm of
Hartigan and Wong (1979).
4
References
Everitt B S (1974) Cluster Analysis Heinemann
Hartigan J A and Wong M A (1979) Algorithm AS 136: A K-means clustering algorithm Appl. Statist. 28 100–108
Kendall M G and Stuart A (1976) The Advanced Theory of Statistics (Volume 3) (3rd Edition) Griffin
Krzanowski W J (1990) Principles of Multivariate Analysis Oxford University Press
5
Arguments
-
1:
– Character(1)
Input
-
On entry: indicates if weights are to be used.
- No weights are used.
- Weights are used and must be supplied in wt.
Constraint:
or .
-
2:
– Integer
Input
-
On entry: , the number of objects.
Constraint:
.
-
3:
– Integer
Input
-
On entry: the total number of variables in array
x.
Constraint:
.
-
4:
– Real (Kind=nag_wp) array
Input
-
On entry: must contain the value of the th variable for the th object, for and .
-
5:
– Integer
Input
-
On entry: the first dimension of the array
x as declared in the (sub)program from which
g03eff is called.
Constraint:
.
-
6:
– Integer array
Input
-
On entry:
indicates whether or not the
th variable is to be included in the analysis. If
, the variable contained in the
th column of
x is included, for
.
Constraint:
for
nvar values of
.
-
7:
– Integer
Input
-
On entry: , the number of variables included in the sums of squares calculations.
Constraint:
.
-
8:
– Integer
Input
-
On entry: , the number of clusters.
Constraint:
.
-
9:
– Real (Kind=nag_wp) array
Input/Output
-
On entry: must contain the value of the th variable for the th initial cluster centre, for and .
On exit: contains the value of the th variable for the th computed cluster centre, for and .
-
10:
– Integer
Input
-
On entry: the first dimension of the array
cmeans as declared in the (sub)program from which
g03eff is called.
Constraint:
.
-
11:
– Real (Kind=nag_wp) array
Input
-
Note: the dimension of the array
wt
must be at least
if
, and at least
otherwise.
On entry: if
, the first
elements of
wt must contain the weights to be used.
If , the th observation is not included in the analysis. The effective number of observation is the sum of the weights.
If
,
wt is not referenced and the effective number of observations is
.
Constraint:
if ,
and for at least two values of , for .
-
12:
– Integer array
Output
-
On exit: contains the cluster to which the th object has been allocated, for .
-
13:
– Integer array
Output
-
On exit: contains the number of objects in the th cluster, for .
-
14:
– Real (Kind=nag_wp) array
Output
-
On exit: contains the within-cluster (weighted) sum of squares of the th cluster, for .
-
15:
– Real (Kind=nag_wp) array
Output
-
On exit: contains the within-cluster sum of weights of the th cluster, for . If , the sum of weights is the number of objects in the cluster.
-
16:
– Integer
Input
-
On entry: the maximum number of iterations allowed in the analysis.
Suggested value:
.
Constraint:
.
-
17:
– Integer array
Workspace
-
-
18:
– Real (Kind=nag_wp) array
Workspace
-
-
19:
– Integer
Input/Output
-
On entry:
ifail must be set to
,
or
to set behaviour on detection of an error; these values have no effect when no error is detected.
A value of causes the printing of an error message and program execution will be halted; otherwise program execution continues. A value of means that an error message is printed while a value of means that it is not.
If halting is not appropriate, the value
or
is recommended. If message printing is undesirable, then the value
is recommended. Otherwise, the value
is recommended.
When the value or is used it is essential to test the value of ifail on exit.
On exit:
unless the routine detects an error or a warning has been flagged (see
Section 6).
6
Error Indicators and Warnings
If on entry
or
, explanatory error messages are output on the current error message unit (as defined by
x04aaf).
Errors or warnings detected by the routine:
-
On entry, .
Constraint: .
On entry, and .
Constraint: .
On entry, and .
Constraint: .
On entry, and .
Constraint: .
On entry, .
Constraint: .
On entry, .
Constraint: .
On entry, .
Constraint: .
On entry, .
Constraint: or .
-
On entry, and .
Constraint: .
On entry,
wt has less than two positive values.
-
On entry,
and
values of
.
Constraint: exactly
nvar elements of
.
-
At least one cluster is empty after the initial assignment. Try a different set of initial cluster centres in
cmeans and also consider decreasing the value of
k. The empty clusters may be found by examining the values in
nic.
-
Convergence has not been achieved within the maximum number of iterations
. Try increasing
maxit and, if possible, use the returned values in
cmeans as the initial cluster centres.
An unexpected error has been triggered by this routine. Please
contact
NAG.
See
Section 7 in the Introduction to the NAG Library FL Interface for further information.
Your licence key may have expired or may not have been installed correctly.
See
Section 8 in the Introduction to the NAG Library FL Interface for further information.
Dynamic memory allocation failed.
See
Section 9 in the Introduction to the NAG Library FL Interface for further information.
7
Accuracy
g03eff produces clusters that are locally optimal; the within-cluster sum of squares may not be decreased by transferring a point from one cluster to another, but different partitions may have the same or smaller within-cluster sum of squares.
8
Parallelism and Performance
g03eff is not threaded in any implementation.
The time per iteration is approximately proportional to .
10
Example
The data consists of observations of five variables on twenty soils (see
Hartigan and Wong (1979)). The data is read in, the
-means clustering performed and the results printed.
10.1
Program Text
10.2
Program Data
10.3
Program Results