The function may be called by the names: g03ecc, nag_mv_cluster_hier or nag_mv_hierar_cluster_analysis.
3Description
Given a distance or dissimilarity matrix for objects (see g03eac), cluster analysis aims to group the objects into a number of more or less homogeneous groups or clusters. With agglomerative clustering methods, a hierarchical tree is produced by starting with clusters, each with a single object and then at each of stages, merging two clusters to form a larger cluster, until all objects are in a single cluster. This process may be represented by a dendrogram (see g03ehc).
At each stage, the clusters that are nearest are merged, methods differ as to how the distance between the new cluster and other clusters are computed. For three clusters , and let , and be the number of objects in each cluster and let , and be the distances between the clusters. Let clusters and be merged to give cluster , then the distance from cluster to cluster , can be computed in the following ways:
If the clusters are numbered then, for convenience, if clusters and , , merge then the new cluster will be referred to as cluster . Information on the clustering history is given by the values of , and for each of the clustering steps. In order to produce a dendrogram, the ordering of the objects such that the clusters that merge are adjacent is required. This ordering is computed so that the first element is . The associated distances with this ordering are also computed.
4References
Everitt B S (1974) Cluster Analysis Heinemann
Krzanowski W J (1990) Principles of Multivariate Analysis Oxford University Press
5Arguments
1: – Nag_ClusterMethodInput
On entry: indicates which clustering.
Single link.
Complete link.
Group average.
Centroid.
Median.
Minimum variance.
Constraint:
, , , , or .
2: – IntegerInput
On entry: the number of objects, .
Constraint:
.
3: – doubleInput/Output
On entry: the strictly lower triangle of the distance matrix. must be stored packed by rows, i.e., , must contain .
On exit: is overwritten.
Constraint:
, for .
4: – IntegerOutput
On exit: contains the number, , of the cluster merged with cluster (see iuc), , at step , for .
5: – IntegerOutput
On exit: contains the number, , of the cluster merged with cluster , , at step , for .
6: – doubleOutput
On exit: contains the distance , between clusters and , , merged at step , for .
7: – IntegerOutput
On exit: the objects in dendrogram order.
8: – doubleOutput
On exit: the clustering distances corresponding to the order in iord. contains the distance at which cluster and merge, for . contains the maximum distance.
9: – NagError *Input/Output
The NAG error argument (see Section 7 in the Introduction to the NAG Library CL Interface).
A true dendrogram cannot be formed because the distances at which clusters
have merged are not increasing for all steps, i.e., for
some . This can occur for the and methods.
NE_INT_ARG_LT
On entry, .
Constraint: .
NE_INTERNAL_ERROR
An internal error has occurred in this function. Check the function call and any array sizes. If the call is correct then please contact NAG for assistance.
NE_REALARR
On entry, .
Constraint: , for .
7Accuracy
For methods other than or , slight rounding errors may occur in the calculations of the updated distances. These would not normally significantly affect the results, however there may be an effect if distances are (almost) equal.
If at a stage, two distances and , or and , are equal then clusters and will be merged rather than clusters and . For single link clustering this choice will only affect the order of the objects in the dendrogram. However, for other methods the choice of rather than may affect the shape of the dendrogram. If either of the distances or are affected by rounding errors then their equality, and hence the dendrogram, may be affected.
8Parallelism and Performance
Background information to multithreading can be found in the Multithreading documentation.
g03ecc is not threaded in any implementation.
9Further Comments
The dendrogram may be formed using g03ehc. Groupings based on the clusters formed at a given distance can be computed using g03ejc.
10Example
Data consisting of three variables on five objects are read in. Euclidean squared distances based on two variables are computed using g03eac, the objects are clustered using g03ecc and the dendrogram computed using g03ehc. The dendrogram is then printed.