PDF version (NAG web site
, 64-bit version, 64-bit version)
NAG Toolbox: nag_mv_cluster_hier (g03ec)
Purpose
nag_mv_cluster_hier (g03ec) performs hierarchical cluster analysis.
Syntax
Description
Given a distance or dissimilarity matrix for
objects (see
nag_mv_distance_mat (g03ea)), cluster analysis aims to group the
objects into a number of more or less homogeneous groups or clusters. With agglomerative clustering methods, a hierarchical tree is produced by starting with
clusters, each with a single object and then at each of
stages, merging two clusters to form a larger cluster, until all objects are in a single cluster. This process may be represented by a dendrogram (see
nag_mv_cluster_hier_dendrogram (g03eh)).
At each stage, the clusters that are nearest are merged, methods differ as to how the distances between the new cluster and other clusters are computed. For three clusters
,
and
let
,
and
be the number of objects in each cluster and let
,
and
be the distances between the clusters. Let clusters
and
be merged to give cluster
, then the distance from cluster
to cluster
,
can be computed in the following ways.
1. |
Single link or nearest neighbour : . |
2. |
Complete link or furthest neighbour : . |
3. |
Group average : . |
4. |
Centroid : . |
5. |
Median : . |
6. |
Minimum variance : . |
If the clusters are numbered then, for convenience, if clusters and , , merge then the new cluster will be referred to as cluster . Information on the clustering history is given by the values of , and for each of the clustering steps. In order to produce a dendrogram, the ordering of the objects such that the clusters that merge are adjacent is required. This ordering is computed so that the first element is . The associated distances with this ordering are also computed.
References
Everitt B S (1974) Cluster Analysis Heinemann
Krzanowski W J (1990) Principles of Multivariate Analysis Oxford University Press
Parameters
Compulsory Input Parameters
- 1:
– int64int32nag_int scalar
-
Indicates which clustering method is used.
- Single link.
- Complete link.
- Group average.
- Centroid.
- Median.
- Minimum variance.
Constraint:
, , , , or .
- 2:
– int64int32nag_int scalar
-
, the number of objects.
Constraint:
.
- 3:
– double array
-
The strictly lower triangle of the distance matrix. must be stored packed by rows, i.e., , must contain .
Constraint:
, for .
Optional Input Parameters
None.
Output Parameters
- 1:
– double array
-
Is overwritten.
- 2:
– int64int32nag_int array
-
contains the number,
, of the cluster merged with cluster
(see
iuc),
, at step
, for
.
- 3:
– int64int32nag_int array
-
contains the number, , of the cluster merged with cluster , , at step , for .
- 4:
– double array
-
contains the distance , between clusters and , , merged at step , for .
- 5:
– int64int32nag_int array
-
The objects in dendrogram order.
- 6:
– double array
-
The clustering distances corresponding to the order in
iord.
contains the distance at which cluster
and
merge, for
.
contains the maximum distance.
- 7:
– int64int32nag_int scalar
unless the function detects an error (see
Error Indicators and Warnings).
Error Indicators and Warnings
Errors or warnings detected by the function:
-
-
On entry, | , , , , or , |
or | . |
-
-
On entry, | for some . |
-
-
A true dendrogram cannot be formed because the distances at which clusters have merged are not increasing for all steps, i.e., for some . This can occur for the median and centroid methods.
-
An unexpected error has been triggered by this routine. Please
contact
NAG.
-
Your licence key may have expired or may not have been installed correctly.
-
Dynamic memory allocation failed.
Accuracy
For slight rounding errors may occur in the calculations of the updated distances. These would not normally significantly affect the results, however there may be an effect if distances are (almost) equal.
If at a stage, two distances and , () or (), and , are equal then clusters and will be merged rather than clusters and . For single link clustering this choice will only affect the order of the objects in the dendrogram. However, for other methods the choice of rather than may affect the shape of the dendrogram. If either of the distances and is affected by rounding errors then their equality, and hence the dendrogram, may be affected.
Further Comments
The dendrogram may be formed using
nag_mv_cluster_hier_dendrogram (g03eh). Groupings based on the clusters formed at a given distance can be computed using
nag_mv_cluster_hier_indicator (g03ej).
Example
Data consisting of three variables on five objects are read in. Euclidean squared distances based on two variables are computed using
nag_mv_distance_mat (g03ea), the objects are clustered using
nag_mv_cluster_hier (g03ec) and the dendrogram computed using
nag_mv_cluster_hier_dendrogram (g03eh). The dendrogram is then printed.
Open in the MATLAB editor:
g03ec_example
function g03ec_example
fprintf('g03ec example results\n\n');
x = [1, 5, 2;
2, 1, 1;
3, 4, 3;
4, 1, 2;
5, 5, 0];
[n,m] = size(x);
isx = ones(m,1,'int64');
isx(1) = int64(0);
s = ones(m,1);
ld = (n*(n-1))/2;
d = zeros(ld,1);
update = 'I';
dist = 'S';
scal = 'U';
[s, d, ifail] = g03ea( ...
update, dist, scal, x, isx, s, d);
method = int64(5);
n = int64(n);
[d, ilc, iuc, cd, iord, dord, ifail] = ...
g03ec(method, n, d);
row = {'A'; 'B'; 'C'; 'D'; 'E'};
fprintf(' Distance Clusters Joined\n\n');
for i = 1:n-1
fprintf('%10.3f %s %s\n', cd(i), row{ilc(i)}, row{iuc(i)})
end
g03ec example results
Distance Clusters Joined
1.000 B D
2.000 A C
6.500 A E
14.125 A B
PDF version (NAG web site
, 64-bit version, 64-bit version)
© The Numerical Algorithms Group Ltd, Oxford, UK. 2009–2015