The routines in this chapter are for the analysis of discrete multivariate data. One suite of routines computes tables while other routines are for the analysis of two-way contingency tables, conditional logistic models and one-factor analysis of binary data.
Routines in Chapter G02 may be used to fit generalized linear models to discrete data including binary data and contingency tables.
2Background to the Problems
2.1Discrete Data
Discrete variables can be defined as variables which take a limited range of values. Discrete data can be usefully categorized into three types.
Binary data. The variables can take one of two values: for example, yes or no. The data may be grouped: for example, the number of yes responses in ten questions.
Categorical data. The variables can take one of two or more values or levels, but the values are not considered to have any ordering: for example, the values may be red, green, blue or brown.
Ordered categorical data. This is similar to categorical data but an ordering can be placed on the levels: for example, poor, average or good.
Data containing discrete variables can be analysed by computing summaries and measures of association and by fitting models.
2.2Tabulation
The basic summary for multivariate discrete data is the multidimensional table in which each dimension is specified by a discrete variable. If the cells of the table are the number of observations with the corresponding values of the discrete variables then it is a contingency table. The discrete variables that can be used to classify a table are known as factors. For example, the factor sex would have the levels male and female. These can be coded as and respectively. Given several factors a multi-way table can be constructed such that each cell of the table represents one level from each factor. For example, a sample of observations with the two factors sex and habitat, habitat having three levels (inner-city, suburban and rural), would give the contingency table
Habitat
Sex
Inner-city
Suburban
Rural
Male
32
27
15
Female
21
19
6
If the sample also contains continuous variables such as age, the average for the observations in each cell could be computed:
Habitat
Sex
Inner-city
Suburban
Rural
Male
25.5
30.3
35.6
Female
23.2
29.1
30.4
or other summary statistics.
Given a table, the totals or means for rows, columns etc. may be required. Thus the above contingency table with marginal totals is
Habitat
Sex
Inner-city
Suburban
Rural
Total
Male
32
27
15
74
Female
21
19
6
46
Total
53
46
21
120
Note that the marginal totals for columns is itself a table. Also, other summary statistics could be used to produce the marginal tables such as means or medians. Having computed the marginal tables, the cells of the original table may be expressed in terms of the margins, for example in the above table the cells could be expressed as percentages of the column totals.
2.3Discrete Response Variables and Logistic Regression
A second important categorization in addition to that given in Section 2.1 is whether one of the discrete variables can be considered as a response variable or whether it is just the association between the discrete variables that is being considered.
If the response variable is binary, for example, success or failure, then a logistic or probit regression model can be used. The logistic regression model relates the logarithm of the odds-ratio to a linear model. So if is the probability of success, the model relates to the explanatory variables. If the responses are independent then these models are special cases of the generalized linear model with binomial errors. However, there are cases when the binomial model is not suitable. For example, in a case-control study a number of cases (successes) and number of controls (failures) is chosen for a number of sets of case-controls. In this situation a conditional logistic analysis is required.
Handling a categorical or ordered categorical response variable is more complex, for a discussion on the appropriate models see McCullagh and Nelder (1983). These models generally use a Poisson distribution.
Note that if the response variable is a continuous variable and it is only the explanatory variables that are discrete then the regression models described in Chapter G02 should be used.
2.4Contingency Tables
If there is no response variable then to investigate the association between discrete variables a contingency table can be computed and a suitable test performed on the table. The simplest case is the two-way table formed when considering two discrete variables. For a dataset of observations classified by the two variables with and levels respectively, a two-way table of frequencies or counts with rows and columns can be computed.
If is the probability of an observation in cell then the model which assumes no association between the two variables is the model
where is the marginal probability for the row variable and is the marginal probability for the column variable, the marginal probability being the probability of observing a particular value of the variable ignoring all other variables. The appropriateness of this model can be assessed by two commonly used statistics:
the Pearson statistic
and the likelihood ratio test statistic
The are the fitted values from the model; these values are the expected cell frequencies and are given by
Under the hypothesis of no association between the two classification variables, both these statistics have, approximately, a -distribution with degrees of freedom. This distribution is arrived at under the assumption that the expected cell frequencies, , are not too small.
In the case of the table, i.e., and , the approximation can be improved by using Yates's continuity correction factor. This decreases the absolute value of () by . For tables with a small values of the exact probabilities can be computed; this is known as Fisher's exact test.
An alternative approach, which can easily be generalized to more than two variables, is to use log-linear models. A log-linear model for two variables can be written as
A model like this can be fitted as a generalized linear model with Poisson error with the cell counts, , as the response variable.
2.5Latent Variable Models
Latent variable models play an important role in the analysis of multivariate data. They have arisen in response to practical needs in many sciences, especially in psychology, educational testing and other social sciences.
Large-scale statistical enquiries, such as social surveys, generate much more information than can be easily absorbed without condensation. Elementary statistical methods help to summarise the data by looking at individual variables or the relationship between a small number of variables. However, with many variables it may still be difficult to see any pattern of inter-relationships. Our ability to visualize relationships is limited to two or three dimensions putting us under strong pressure to reduce the dimensionality of the data and yet preserve as much of the structure as possible. The question is thus one of how to replace the many variables with which we start by a much smaller number, with as little loss of information as possible.
One approach to the problem is to set up a model in which the dependence between the observed variables is accounted for by one or more latent variables. Such a model links the large number of observable variables with a much smaller number of latent variables.
Factor analysis, as described in Chapter G03, is based on a linear model of this kind when the observed variables are continuous. Here we consider the case where the observed variables are binary (e.g., coded or true/false) and where there is one latent variable. In educational testing this is known as latent trait analysis, but, more generally, as factor analysis of binary data.
A variety of methods and models have been proposed for this problem. The models used here are derived from the general approach of Bartholomew (1980) and Bartholomew (1984). You are referred to Bartholomew (1980) for further information on the models and to Bartholomew (1987) for details of the method and application.
3Recommendations on Choice and Use of Available Routines
3.1Tabulation
The following routines can be used to perform the tabulation of discrete data:
g11baf computes a multidimensional table from a set of discrete variables or classification factors. The cells of the table may be counts or a summary statistic (total, mean, variance, largest or smallest) computed for an associated continuous variable. Alternatively, g11baf will update an existing table with further data.
g11bbf computes a multidimensional table from a set of discrete variables or classification factor where the cells are the percentile or quantile for an associated variable. For example, g11bbf can be used to produce a table of medians.
g11bcf computes a marginal table from a table computed by g11baforg11bbf using a summary statistic (total, mean, median variance, largest or smallest).
3.2Analysis of Contingency Tables
g11aaf computes the Pearson and likelihood ratio statistics for a two-way contingency table. For tables Yates's correction factor is used and for small samples, , Fisher's exact test is used.
In addition, g02gcf can be used to fit a log-linear model to a contingency table.
3.3Binary data
The following routines can be used to analyse binary data:
g11saf fits a latent variable model to binary data. The frequency distribution of score patterns is required as input data. If your data is in the form of individual score patterns, then the service routine g11sbf may be used to calculate the frequency distribution.
g11caf estimates the parameters for a conditional logistic model.
In addition, g02gbf fits generalized linear models to binary data.