For a set of
observations classified by two variables, with
and
levels respectively, a two-way table of frequencies with
rows and
columns can be computed.
To measure the association between the two classification variables two statistics that can be used are, the Pearson
statistic,
, and the likelihood ratio test statistic,
, where
are the fitted values from the model that assumes the effects due to the classification variables are additive, i.e., there is no association. These values are the expected cell frequencies and are given by
Under the hypothesis of no association between the two classification variables, both these statistics have, approximately, a
-distribution with
degrees of freedom. This distribution is arrived at under the assumption that the expected cell frequencies,
, are not too small. For a discussion of this point see
Everitt (1977). He concludes by saying, ‘... in the majority of cases the chi-square criterion may be used for tables with expectations in excess of
in the smallest cell’.
In the case of the
table, i.e.,
and
, the
approximation can be improved by using Yates' continuity correction factor. This decreases the absolute value of
by
. For
tables with a small value of
the exact probabilities from Fisher's test are computed. These are based on the hypergeometric distribution and are computed using
g01blf. A two tail probability is computed as
, where
and
are the upper and lower one-tail probabilities from the hypergeometric distribution.
-
1:
– Integer
Input
-
On entry: , the number of rows in the contingency table.
Constraint:
.
-
2:
– Integer
Input
-
On entry: , the number of columns in the contingency table.
Constraint:
.
-
3:
– Integer array
Input
-
On entry: the contingency table
must contain , for and .
Constraint:
, for and .
-
4:
– Integer
Input
-
On entry: the first dimension of the arrays
nobs,
expt and
chist as declared in the (sub)program from which
g11aaf is called.
Constraint:
.
-
5:
– Real (Kind=nag_wp) array
Output
-
On exit: the table of expected values.
contains , for and .
-
6:
– Real (Kind=nag_wp) array
Output
-
On exit: the table of contributions.
contains , for and .
-
7:
– Real (Kind=nag_wp)
Output
-
On exit: if
,
and
then
prob contains the two tail significance level for Fisher's exact test, otherwise
prob contains the significance level from the Pearson
statistic.
-
8:
– Real (Kind=nag_wp)
Output
-
On exit: the Pearson statistic.
-
9:
– Real (Kind=nag_wp)
Output
-
On exit: the likelihood ratio test statistic.
-
10:
– Real (Kind=nag_wp)
Output
-
On exit: the degrees of freedom for the statistics.
-
11:
– Integer
Input/Output
-
On entry:
ifail must be set to
,
or
to set behaviour on detection of an error; these values have no effect when no error is detected.
A value of causes the printing of an error message and program execution will be halted; otherwise program execution continues. A value of means that an error message is printed while a value of means that it is not.
If halting is not appropriate, the value
or
is recommended. If message printing is undesirable, then the value
is recommended. Otherwise, the value
is recommended since useful values can be provided in some output arguments even when
on exit.
When the value or is used it is essential to test the value of ifail on exit.
On exit:
unless the routine detects an error or a warning has been flagged (see
Section 6).
If on entry
or
, explanatory error messages are output on the current error message unit (as defined by
x04aaf).
For the accuracy of the probabilities for Fisher's exact test see
g01blf.
Background information to multithreading can be found in the
Multithreading documentation.
The routine
g01aff allows for the automatic amalgamation of rows and columns. In most circumstances this is not recommended; see
Everitt (1977).
Multidimensional contingency tables can be analysed using log-linear models fitted by
g02gbf.
The data below, taken from
Everitt (1977), is from
patients with brain tumours. The row classification variable is the site of the tumour: frontal lobes, temporal lobes and other cerebral areas. The column classification variable is the type of tumour: benign, malignant and other cerebral tumours.
The data is read in and the statistics computed and printed.