NAG FL Interface
g03ecf (cluster_hier)
1
Purpose
g03ecf performs hierarchical cluster analysis.
2
Specification
Fortran Interface
Integer, Intent (In) 
:: 
method, n 
Integer, Intent (Inout) 
:: 
ifail 
Integer, Intent (Out) 
:: 
ilc(n1), iuc(n1), iord(n), iwk(2*n) 
Real (Kind=nag_wp), Intent (Inout) 
:: 
d(n*(n1)/2) 
Real (Kind=nag_wp), Intent (Out) 
:: 
cd(n1), dord(n) 

C Header Interface
#include <nag.h>
void 
g03ecf_ (const Integer *method, const Integer *n, double d[], Integer ilc[], Integer iuc[], double cd[], Integer iord[], double dord[], Integer iwk[], Integer *ifail) 

C++ Header Interface
#include <nag.h> extern "C" {
void 
g03ecf_ (const Integer &method, const Integer &n, double d[], Integer ilc[], Integer iuc[], double cd[], Integer iord[], double dord[], Integer iwk[], Integer &ifail) 
}

The routine may be called by the names g03ecf or nagf_mv_cluster_hier.
3
Description
Given a distance or dissimilarity matrix for
$n$ objects (see
g03eaf), cluster analysis aims to group the
$n$ objects into a number of more or less homogeneous groups or clusters. With agglomerative clustering methods, a hierarchical tree is produced by starting with
$n$ clusters, each with a single object and then at each of
$n1$ stages, merging two clusters to form a larger cluster, until all objects are in a single cluster. This process may be represented by a dendrogram (see
g03ehf).
At each stage, the clusters that are nearest are merged, methods differ as to how the distances between the new cluster and other clusters are computed. For three clusters
$i$,
$j$ and
$k$ let
${n}_{i}$,
${n}_{j}$ and
${n}_{k}$ be the number of objects in each cluster and let
${d}_{ij}$,
${d}_{ik}$ and
${d}_{jk}$ be the distances between the clusters. Let clusters
$j$ and
$k$ be merged to give cluster
$jk$, then the distance from cluster
$i$ to cluster
$jk$,
${d}_{i.jk}$ can be computed in the following ways.

1.Single link or nearest neighbour : ${d}_{i.jk}=\mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({d}_{ij},{d}_{ik}\right)$.

2.Complete link or furthest neighbour : ${d}_{i.jk}=\mathrm{max}\phantom{\rule{0.125em}{0ex}}\left({d}_{ij},{d}_{ik}\right)$.

3.Group average : ${d}_{i.jk}=\frac{{n}_{j}}{{n}_{j}+{n}_{k}}{d}_{ij}+\frac{{n}_{k}}{{n}_{j}+{n}_{k}}{d}_{ik}$.

4.Centroid : ${d}_{i.jk}=\frac{{n}_{j}}{{n}_{j}+{n}_{k}}{d}_{ij}+\frac{{n}_{k}}{{n}_{j}+{n}_{k}}{d}_{ik}\frac{{n}_{j}{n}_{k}}{{\left({n}_{j}+{n}_{k}\right)}^{2}}{d}_{jk}$.

5.Median : ${d}_{i.jk}=\frac{1}{2}{d}_{ij}+\frac{1}{2}{d}_{ik}\frac{1}{4}{d}_{jk}$.

6.Minimum variance : ${d}_{i.jk}=\left\{\left({n}_{i}+{n}_{j}\right){d}_{ij}+\left({n}_{i}+{n}_{k}\right){d}_{ik}{n}_{i}{d}_{jk}\right\}/\left({n}_{i}+{n}_{j}+{n}_{k}\right)$.
If the clusters are numbered $1,2,\dots ,n$ then, for convenience, if clusters $j$ and $k$, $j<k$, merge then the new cluster will be referred to as cluster $j$. Information on the clustering history is given by the values of $j$, $k$ and ${d}_{jk}$ for each of the $n1$ clustering steps. In order to produce a dendrogram, the ordering of the objects such that the clusters that merge are adjacent is required. This ordering is computed so that the first element is $1$. The associated distances with this ordering are also computed.
4
References
Everitt B S (1974) Cluster Analysis Heinemann
Krzanowski W J (1990) Principles of Multivariate Analysis Oxford University Press
5
Arguments

1:
$\mathbf{method}$ – Integer
Input

On entry: indicates which clustering method is used.
 ${\mathbf{method}}=1$
 Single link.
 ${\mathbf{method}}=2$
 Complete link.
 ${\mathbf{method}}=3$
 Group average.
 ${\mathbf{method}}=4$
 Centroid.
 ${\mathbf{method}}=5$
 Median.
 ${\mathbf{method}}=6$
 Minimum variance.
Constraint:
${\mathbf{method}}=1$, $2$, $3$, $4$, $5$ or $6$.

2:
$\mathbf{n}$ – Integer
Input

On entry: $n$, the number of objects.
Constraint:
${\mathbf{n}}\ge 2$.

3:
$\mathbf{d}\left({\mathbf{n}}\times \left({\mathbf{n}}1\right)/2\right)$ – Real (Kind=nag_wp) array
Input/Output

On entry: the strictly lower triangle of the distance matrix. $D$ must be stored packed by rows, i.e., ${\mathbf{d}}\left(\left(i1\right)\left(i2\right)/2+j\right)$, $i>j$ must contain ${d}_{ij}$.
On exit: is overwritten.
Constraint:
${\mathbf{d}}\left(\mathit{i}\right)\ge 0.0$, for $\mathit{i}=1,2,\dots ,n\left(n1\right)/2$.

4:
$\mathbf{ilc}\left({\mathbf{n}}1\right)$ – Integer array
Output

On exit:
${\mathbf{ilc}}\left(\mathit{l}\right)$ contains the number,
$j$, of the cluster merged with cluster
$k$ (see
iuc),
$j<k$, at step
$\mathit{l}$, for
$\mathit{l}=1,2,\dots ,n1$.

5:
$\mathbf{iuc}\left({\mathbf{n}}1\right)$ – Integer array
Output

On exit: ${\mathbf{iuc}}\left(\mathit{l}\right)$ contains the number, $k$, of the cluster merged with cluster $j$, $j<k$, at step $\mathit{l}$, for $\mathit{l}=1,2,\dots ,n1$.

6:
$\mathbf{cd}\left({\mathbf{n}}1\right)$ – Real (Kind=nag_wp) array
Output

On exit: ${\mathbf{cd}}\left(\mathit{l}\right)$ contains the distance ${d}_{jk}$, between clusters $j$ and $k$, $j<k$, merged at step $\mathit{l}$, for $\mathit{l}=1,2,\dots ,n1$.

7:
$\mathbf{iord}\left({\mathbf{n}}\right)$ – Integer array
Output

On exit: the objects in dendrogram order.

8:
$\mathbf{dord}\left({\mathbf{n}}\right)$ – Real (Kind=nag_wp) array
Output

On exit: the clustering distances corresponding to the order in
iord.
${\mathbf{dord}}\left(\mathit{l}\right)$ contains the distance at which cluster
${\mathbf{iord}}\left(\mathit{l}\right)$ and
${\mathbf{iord}}\left(\mathit{l}+1\right)$ merge, for
$\mathit{l}=1,2,\dots ,n1$.
${\mathbf{dord}}\left(n\right)$ contains the maximum distance.

9:
$\mathbf{iwk}\left(2\times {\mathbf{n}}\right)$ – Integer array
Workspace


10:
$\mathbf{ifail}$ – Integer
Input/Output

On entry:
ifail must be set to
$0$,
$1\text{or}1$. If you are unfamiliar with this argument you should refer to
Section 4 in the Introduction to the NAG Library FL Interface for details.
For environments where it might be inappropriate to halt program execution when an error is detected, the value
$1\text{or}1$ is recommended. If the output of error messages is undesirable, then the value
$1$ is recommended. Otherwise, if you are not familiar with this argument, the recommended value is
$0$.
When the value $\mathbf{1}\text{or}\mathbf{1}$ is used it is essential to test the value of ifail on exit.
On exit:
${\mathbf{ifail}}={\mathbf{0}}$ unless the routine detects an error or a warning has been flagged (see
Section 6).
6
Error Indicators and Warnings
If on entry
${\mathbf{ifail}}=0$ or
$1$, explanatory error messages are output on the current error message unit (as defined by
x04aaf).
Errors or warnings detected by the routine:
 ${\mathbf{ifail}}=1$

On entry, ${\mathbf{method}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{method}}=1$, $2$, $3$, $4$, $5$ or $6$.
On entry, ${\mathbf{n}}=\u2329\mathit{\text{value}}\u232a$.
Constraint: ${\mathbf{n}}\ge 2$.
 ${\mathbf{ifail}}=2$

On entry, at least one element of
d is negative.
 ${\mathbf{ifail}}=3$

A true dendrogram cannot be formed because the distances at which clusters have merged are not increasing for all steps, i.e., ${\mathbf{cd}}\left(l\right)<{\mathbf{cd}}\left(l1\right)$ for some $l=2,3,\dots ,n1$. This can occur for the median and centroid methods.
 ${\mathbf{ifail}}=99$
An unexpected error has been triggered by this routine. Please
contact
NAG.
See
Section 7 in the Introduction to the NAG Library FL Interface for further information.
 ${\mathbf{ifail}}=399$
Your licence key may have expired or may not have been installed correctly.
See
Section 8 in the Introduction to the NAG Library FL Interface for further information.
 ${\mathbf{ifail}}=999$
Dynamic memory allocation failed.
See
Section 9 in the Introduction to the NAG Library FL Interface for further information.
7
Accuracy
For ${\mathbf{method}}\ge 3$ slight rounding errors may occur in the calculations of the updated distances. These would not normally significantly affect the results, however there may be an effect if distances are (almost) equal.
If at a stage, two distances ${d}_{ij}$ and ${d}_{kl}$, ($i<k$) or ($i=k$), and $j<l$, are equal then clusters $k$ and $l$ will be merged rather than clusters $i$ and $j$. For single link clustering this choice will only affect the order of the objects in the dendrogram. However, for other methods the choice of $kl$ rather than $ij$ may affect the shape of the dendrogram. If either of the distances ${d}_{ij}$ and ${d}_{kl}$ is affected by rounding errors then their equality, and hence the dendrogram, may be affected.
8
Parallelism and Performance
g03ecf is threaded by NAG for parallel execution in multithreaded implementations of the NAG Library.
Please consult the
X06 Chapter Introduction for information on how to control and interrogate the OpenMP environment used within this routine. Please also consult the
Users' Note for your implementation for any additional implementationspecific information.
The dendrogram may be formed using
g03ehf. Groupings based on the clusters formed at a given distance can be computed using
g03ejf.
10
Example
Data consisting of three variables on five objects are read in. Euclidean squared distances based on two variables are computed using
g03eaf, the objects are clustered using
g03ecf and the dendrogram computed using
g03ehf. The dendrogram is then printed.
10.1
Program Text
10.2
Program Data
10.3
Program Results