NAG CL Interface
g22ycc (lm_design_matrix)
1
Purpose
g22ycc generates a design matrix from a data matrix and model description.
2
Specification
void |
g22ycc (void *hform,
void *hddesc,
const double dat[],
Integer pddat,
Integer sddat,
void **hxdesc,
double x[],
Integer pdx,
Integer sdx,
Integer *mx,
NagError *fail) |
|
The function may be called by the names: g22ycc or nag_blgm_lm_design_matrix.
3
Description
g22ycc generates a design matrix from a data matrix and a model description. Design matrices encapsulate the observed values of the independent variables and the required model in a form that can be used by many of the model fitting functions available in the NAG Library, for example those in
Chapter G02.
3.1
Notation
Let denote a data matrix with observations on independent variables, denoted by , for . If is a categorical variable, let denote the number of levels associated with it. If is a binary, ordinal or continuous variable, let .
Let denote the th value of .
Let denote a model made up of one or more terms, denoted by . Each term consists of either a main effect or an interaction and hence can be described using one or more variable names and the interaction operator ‘’. The operator ‘’ is used to denote the addition of a term to the model. Therefore, denotes a model with three terms, the first two terms being the main effects for variables and and the last term the interaction between them. For simplicity we reorder the terms of the model by the number of variables in them, so main effects come first, then two-way interactions, then three-way interactions etc. By default it is assumed that the model contains a mean effect (or intercept term), if the mean effect is excluded, this will be denoted by ‘’, so is a model with one term and a mean effect and is the same model with the mean effect dropped.
g22ycc generates an by design matrix, , from and .
3.2
Dummy Variables
When constructing a design matrix, we cannot work directly with categorical variables. Categorical variables must first be recoded into dummy variables. A categorical variable
requires
dummy variables. Let
denote an
matrix of dummy variables for
defined as
where
is the
th column of
and
is the
th element of
.
For a binary, ordinal or continuous variable, .
3.3
Full Design Matrix
Given a model,
, and the matrices of dummy variables constructing the full design matrix
is trivial. Each term is processed in order and
-
1.If term is a main effect, that is for some , is copied into .
-
2.If term is a two-way interaction, that is , for some , then
-
(i)Loop over .
-
(ii)Loop over .
-
(iii)Add a column to corresponding to the element-wise product of and .
-
3.Higher interaction terms are handled in a similar manner as the two-way interactions by adding columns constructed from multiplying all combinations of the columns of the corresponding s that correspond to the variables involved. In all cases, the variables towards the right hand side of a term are iterated over the quickest.
3.4
Contrasts
Using the full design matrix in an analysis can result in an overparameterized model. This is due to often not being of full rank as the sum of all the dummy variables for a particular variable is a vector of ones. This source of overparameterization can be alleviated by using a design matrix where (some) dummy variables are replaced by contrasts. For a categorical variable the contrasts are a set of functionally independent linear combinations of the dummy variables.
Whilst the choice of contrasts used in term will affect the individual model coefficients (parameters), it has no effect on the overall contribution of .
For a given variable , the contrasts can be represented by an by matrix, . The rows of correspond to a particular value of and the columns correspond to the values to use in the design matrix.
Six types of contrast are available in
g22ycc; two types of treatment contrasts, two types of sum contrasts, Helmert contrasts and polynomial contrasts. Unless specified otherwise, the contrasts used by
g22ycc are treatment contrasts relative to the first level. See the description of the optional parameter
in
g22yac for ways of changing the contrasts used.
3.4.1
Treatment Contrasts
Treatment contrasts are taken relative to either the first or last level of the variable. For example, if
,
would be the contrast matrix for
using treatment contrasts relative to the first level. The contrast matrix obtained when using treatment contrasts relative to the last level is similar, but the row of zeros appears at the bottom and all other rows are shifted up one.
Strictly speaking, the term contrast implies that each row in the contrast matrix sums to zero. That is not the case for treatment contrasts, however they are included as this coding is commonly used in practice.
3.4.2
Sum Contrasts
Sum contrasts are similar to treatment contrasts and again can be taken relative to the first or last level of the variable. Unlike treatment contrasts, sum contrasts effectively constrain the coefficients related to the variable to sum to zero. For example, if
,
would be the contrast matrix for
using treatment contrasts relative to the last level. The contrast matrix obtained when using treatment contrasts relative to the first level is similar, but the row of
s appears at the top and all other rows are shifted down one.
3.4.3
Helmert Contrasts
With Helmert contrasts level
of the variable is compared with the average effect of all previous levels. For example, if
,
would be the contrast matrix for
using Helmert contrasts.
3.4.4
Polynomial Contrasts
With polynomial contrasts the entries in the columns of
correspond in linear, quadratic, cubic, quartic, etc. terms to a hypothetical underlying numeric variable that takes equally spaced values at each level. For example, if
,
would be the contrast matrix for
using polynomial contrasts.
3.4.5
When Contrasts Can Be Used
Depending on the specifics of the model,
, it may not be possible to always replace the
dummy variables with
contrasts for all variables in all terms and retain the same model. A simple example of this is a data matrix,
, with four observations and two variables which have two and three levels respectively. This data matrix might look something like:
For the sake of argument, assume that our model contains the main effect for each variable, but does not contain a mean effect (or intercept term). So using the notation established earlier,
. The full design matrix,
, for this data matrix and model would be
However, is not of full rank (and hence is overparameterized) because the sum of the first two columns is a vector of ones as is the sum of the last three columns.
In order to alleviate this we might try constructing
where the dummy variables have been replaced by contrasts. Assuming treatment contrasts, relative to the first level, we would have
However, using
makes an implicit assumption that the expected value of the dependent variable (the quantity being modelled) is zero when
and
. This assumption was not made when we used
and hence the two design matrices are not equivalent. One solution would be to use dummy variables for
and contrasts for
, which would result in a design matrix,
of
Using would give an equivalent model to using .
The algorithm used by g22ycc to decide which variables, in which terms, can be coded as contrasts and which need to be coded as dummy variables is described below.
Suppose is any variable that appears in term , let denote the term obtained by dropping from . For example, if , . In this context, the empty term is taken to be the mean effect (or intercept term). We say that appears in if there exists a term , , that contains all of the variables appearing in . In most cases , but this is not required. Note, as stated earlier, the terms in are ordered by the number of variables in them.
A variable, in term is coded by contrasts if appears in and by dummy variables otherwise. It is therefore possible for variable to be coded by contrasts in some terms and dummy variables in others within the same .
The above rule assumes the presence of a mean effect. If no such effect is present in the model, the main effect of the first categorical variable is coded by dummy variables to compensate. If no main effects appear in the model, the warning
NW_POTENTIAL_PROBLEM is returned.
A longer description and informal proof that the resulting
is a suitable design matrix for the model of interest can be found in chapter two of
Chambers and Hastie (1992).
3.5
Mean Effect
The mean effect (or intercept term) is included in a design matrix by adding a column of ones as the first column of
. However, many model fitting functions in the NAG Library handle the mean effect as a special case and do not require it to be explicitly added to the design matrix. Therefore, by default,
g22ycc does not explicitly add the mean effect to the design matrix. This behaviour can be changed via the optional parameter
in
g22yac.
4
References
Chambers J M and Hastie T J (1992) Statistical Models in S Wadsworth and Brooks/Cole Computer Science Series
5
Arguments
-
1:
– void *
Input
-
On entry: a G22 handle to the internal data structure containing a description of the model
as returned in
hform by
g22yac.
-
2:
– void *
Input
-
On entry: a G22 handle to the internal data structure containing a description of the data matrix,
as returned in
hddesc by
g22ybc.
-
3:
– const double
Input
-
Note: the th element of the matrix is stored in .
On entry: the data matrix,
. By default
, the
th value for the
th variable, for
and
, should be supplied in
.
If the optional parameter
, described in
g22ybc, is set to
,
should be supplied in
.
-
4:
– Integer
Input
-
On entry: the stride separating matrix row elements in the array
dat.
Constraints:
- if the optional parameter , described in g22ybc, is set to , ;
- otherwise .
-
5:
– Integer
Input
-
On entry: the secondary dimension of
dat.
Constraints:
- if the optional parameter , described in g22ybc, is set to , ;
- otherwise .
-
6:
– void **
Input/Output
-
On entry: must be set to
NULL, alternatively an existing G22 handle may be supplied in which case this function will destroy the supplied G22 handle as if
g22zac had been called.
On exit: holds a G22 handle to the internal data structure containing a description of the design matrix,
. You
must not change the G22 handle other than through the functions in
Chapter G22.
-
7:
– double
Output
-
Note: the th element of the matrix is stored in .
On exit: the design matrix,
. By default
, the
th value for the
th column, for
and
, is returned in
If the optional parameter
, described in
g22yac, is set to
,
is returned in
.
If
pdx or
sdx are too small to hold
x, the number of columns required to hold the design matrix is returned in
mx.
Under some conditions it is possible to use the data matrix in place of the design matrix. Specifically, if
has no categorical variables,
has only main effects and either has no mean effect or the mean effect does not need to be explicitly added to the design matrix. If
pdx or
sdx are too small under such circumstances,
NW_ALTERNATIVE is returned and
hxdesc is set up in such a way as to allow
dat to be used as the design matrix.
If
pdx and
sdx are both zero,
x is not referenced and may be
NULL.
-
8:
– Integer
Input
-
On entry: the stride separating matrix row elements in the array
x.
Constraints:
- if the optional parameter , described in g22yac, is set to , ;
- otherwise .
-
9:
– Integer
Input
-
On entry: the secondary dimension of
x.
Constraints:
- if the optional parameter , described in g22yac, is set to , ;
- otherwise .
-
10:
– Integer *
Output
-
On exit: the minimum number of columns required to hold the design matrix.
In most cases
. The one exception is when
NW_ALTERNATIVE, that is the size of
x was too small but the data matrix given in
dat can be used as the design matrix. In this case
mx holds the number of columns that would be required if only the relevant parts of
dat were copied into a new array.
-
11:
– NagError *
Input/Output
-
The NAG error argument (see
Section 7 in the Introduction to the NAG Library CL Interface).
6
Error Indicators and Warnings
- NE_ALLOC_FAIL
-
Dynamic memory allocation failed.
See
Section 3.1.2 in the Introduction to the NAG Library CL Interface for further information.
- NE_ARRAY_SIZE
-
On entry, and .
Constraint: .
On entry, and .
Constraint: .
On entry, and .
Constraint: .
On entry, and .
Constraint: .
- NE_BAD_PARAM
-
On entry, argument had an illegal value.
- NE_FIELD_UNKNOWN
-
A variable name used when creating
hform is not present in
hddesc.
Variable name:
.
- NE_HANDLE
-
hddesc has not been initialized or is corrupt.
hddesc is not a G22 handle as generated by
g22ybc.
hform has not been initialized or is corrupt.
hform is not a G22 handle as generated by
g22yac.
On entry,
hxdesc is not
NULL or a recognised G22 handle.
- NE_INTERNAL_ERROR
-
An internal error has occurred in this function. Check the function call and any array sizes. If the call is correct then please contact
NAG for assistance.
See
Section 7.5 in the Introduction to the NAG Library CL Interface for further information.
- NE_NO_LICENCE
-
Your licence key may have expired or may not have been installed correctly.
See
Section 8 in the Introduction to the NAG Library CL Interface for further information.
- NE_REAL_ARRAY
-
On entry, column
of the data matrix,
, is not consistent with information supplied in
hddesc,
.
- NW_ALTERNATIVE
-
On entry, the size of
x is too small to hold the design matrix.
dat can be used instead.
- NW_ARRAY_SIZE
-
On entry, and .
Constraint: .
On entry, and .
Constraint: .
On entry, and .
Constraint: .
On entry, and .
Constraint: .
- NW_POTENTIAL_PROBLEM
-
Column of the data matrix, , required rounding more than expected when being treated as a categorical variable, .
All output is returned using the rounded value(s).
The model contains categorical variables, but no intercept or main effects terms have been requested.
Please check the design matrix returned matches the model you require.
7
Accuracy
Not applicable.
8
Parallelism and Performance
g22ycc is threaded by NAG for parallel execution in multithreaded implementations of the NAG Library.
g22ycc makes calls to BLAS and/or LAPACK routines, which may be threaded within the vendor library used by this implementation. Consult the documentation for the vendor library for further information.
Please consult the
X06 Chapter Introduction for information on how to control and interrogate the OpenMP environment used within this function. Please also consult the
Users' Note for your implementation for any additional implementation-specific information.
g22ydc can be used to obtain labels for the columns of the design matrix
.
Many of the analysis functions that require a design matrix to be supplied allow submodels to be defined through the use of a vector of ones or zeros indicating whether a column of
should be included or excluded from the analyses (see for example
sx in
g02dac or
g02gac). This allows nested models to be fit without having to reconstruct the design matrix for each analysis.
g22ydc offers a mechanism for constructing these vectors using submodels specified using
g22yac.
10
Example
This example creates and outputs two design matrices for a simple linear regression model. The first design matrix uses sum contrasts for all variables and the second uses a combination of polynomial and Helmert contrasts. Column labels are generated using
g22ydc.
See also the examples for
g22yac,
g22ybc and
g22ydc.
10.1
Program Text
10.2
Program Data
10.3
Program Results
11
Optional Parameters
As well as the optional parameters common to all G22 handles described in
g22zmc and
g22znc, a number of additional optional parameters can be specified for a G22 handle holding the description of a design matrix as returned by
g22ycc in
hxdesc.
The value of an optional parameter can be queried using
g22znc.
The remainder of this section can be skipped if you wish to use the default values for all optional parameters.
The following is a list of the optional parameters available. A full description of each optional parameter is provided in
Section 11.1.
11.1
Description of the Optional Parameters
For each option, we give a summary line, a description of the optional parameter and details of constraints.
The summary line contains:
- a parameter value,
where the letters , and denote options that take character, integer and real values respectively;
Keywords and character values are case and white space insensitive.
This optional parameter returns a verbose formula string describing the model, , used to create the design matrix. This formula will only contain variable names, the operators ‘’ and ‘’ and any contrast identifiers present.
This optional parameter returns the minimum number of columns required to hold the design matrix,
. In most cases
. The one exception is when
NW_ALTERNATIVE, that is the size of
x was too small but the data matrix given in
dat can be used as the design matrix. In this case,
and
holds the number of columns that would be required if only the relevant parts of
dat were copied into a new array.
This optional parameter returns , the number of columns in the design matrix.
This optional parameter returns , the number of observations in the design matrix.
This optional parameter returns how the design matrix,
, is stored in
x.
If , , the value for the th variable of the th observation of the design matrix is stored in .
If , , the value for the th variable of the th observation of the design matrix is stored in .
It should be noted that
is not writeable. If you wish to change the storage order of the design matrix you need to change
in
hform as described in
Section 11 in
g22yac prior to calling
g22ycc.