Integer, Intent (In)	::	lddat, sddat, ldx, sdx
Integer, Intent (Inout)	::	ifail
Integer, Intent (Out)	::	mx
Real (Kind=nag_wp), Intent (In)	::	dat(lddat,sddat)
Real (Kind=nag_wp), Intent (Inout)	::	x(ldx,sdx)
Type (c_ptr), Intent (In)	::	hform, hddesc
Type (c_ptr), Intent (Inout)	::	hxdesc

C Header Interface

#include <nag.h>

void	g22ycf_ (void hform, void hddesc, const double dat[], const Integer lddat, const Integer sddat, void *hxdesc, double x[], const Integer ldx, const Integer sdx, Integer mx, Integer *ifail)

The routine may be called by the names g22ycf or nagf_blgm_lm_design_matrix.

3 Description

g22ycf generates a design matrix from a data matrix and a model description. Design matrices encapsulate the observed values of the independent variables and the required model in a form that can be used by many of the model fitting routines available in the NAG Library, for example those in Chapter G02.

3.1 Notation

Let

D

denote a data matrix with

n

observations on

m_{d}

independent variables, denoted by

V_{j}

, for

j = 1, 2, \dots, m_{d}

. If

V_{j}

is a categorical variable, let

L_{j}

denote the number of levels associated with it. If

V_{j}

is a binary, ordinal or continuous variable, let

L_{j} = 1

Let

V_{j i}

denote the

i

th value of

V_{j}

Let

M

denote a model made up of one or more terms, denoted by

T_{i}

. Each term consists of either a main effect or an interaction and hence can be described using one or more variable names

V_{j}

and the interaction operator ‘

.

’. The operator ‘

+

’ is used to denote the addition of a term to the model. Therefore,

M = T_{1} + T_{2} + T_{3} = V_{1} + V_{2} + V_{1} . V_{2}

denotes a model with three terms, the first two terms being the main effects for variables

V_{1}

and

V_{2}

and the last term the interaction between them. For simplicity we reorder the terms of the model by the number of variables in them, so main effects come first, then two-way interactions, then three-way interactions etc. By default it is assumed that the model

M

contains a mean effect (or intercept term), if the mean effect is excluded, this will be denoted by ‘

- 1

’, so

M = T_{1}

is a model with one term and a mean effect and

M = T_{1} - 1

is the same model with the mean effect dropped.

g22ycf generates an

n \times m_{x}

design matrix,

X

, from

D

and

M

3.2 Dummy Variables

When constructing a design matrix, we cannot work directly with categorical variables. Categorical variables must first be recoded into dummy variables. A categorical variable

V_{j}

requires

L_{j}

dummy variables. Let

D^{j}

denote an

n \times L_{j}

matrix of dummy variables for

V_{j}

defined as

D_{l i}^{j} = {\begin{cases} 1;  if ​ V_{j i} = l, \\ 0;  otherwise \end{cases}

where

D_{l}^{j}

is the

l

th column of

D^{j}

and

D_{l i}^{j}

is the

i

th element of

D_{l}^{j}

For a binary, ordinal or continuous variable,

D_{1 i}^{j} = V_{j i}

3.3 Full Design Matrix

Given a model,

M

, and the matrices of dummy variables constructing the full design matrix

X_{F}

is trivial. Each term is processed in order and

1.If term $i$ is a main effect, that is $T_{i} = V_{j}$ for some $j$ , $D^{j}$ is copied into $X_{F}$ .
2.If term $i$ is a two-way interaction, that is Ti= Vj. Vk, for some $j \neq k$ , then
1. (i)Loop over $l_{j} = 1, 2, \dots L_{j}$ .
2. (ii)Loop over $l_{k} = 1, 2, \dots L_{k}$ .
3. (iii)Add a column to $X_{F}$ corresponding to the element-wise product of $D_{l_{j}}^{j}$ and $D_{l_{k}}^{k}$ .
3.Higher interaction terms are handled in a similar manner as the two-way interactions by adding columns constructed from multiplying all combinations of the columns of the corresponding $D$ s that correspond to the variables involved. In all cases, the variables towards the right hand side of a term are iterated over the quickest.

3.4 Contrasts

Using the full design matrix

X_{F}

in an analysis can result in an overparameterized model. This is due to

X_{F}

often not being of full rank as the sum of all the dummy variables for a particular variable is a vector of ones. This source of overparameterization can be alleviated by using a design matrix

X

where (some) dummy variables are replaced by contrasts. For a categorical variable

V_{j}

the contrasts are a set of

L_{j} - 1

functionally independent linear combinations of the dummy variables.

Whilst the choice of contrasts used in term

T_{i}

will affect the individual model coefficients (parameters), it has no effect on the overall contribution of

T_{i}

For a given variable

V_{j}

, the contrasts can be represented by an

L_{j} \times L_{j} - 1

matrix,

C_{j}

. The rows of

C_{j}

correspond to a particular value of

V_{j}

and the columns correspond to the values to use in the design matrix.

Six types of contrast are available in g22ycf; two types of treatment contrasts, two types of sum contrasts, Helmert contrasts and polynomial contrasts. Unless specified otherwise, the contrasts used by g22ycf are treatment contrasts relative to the first level. See the description of the optional parameter Contrast in g22yaf for ways of changing the contrasts used.

3.4.1 Treatment Contrasts

Treatment contrasts are taken relative to either the first or last level of the variable. For example, if

L_{j} = 4

C_{j} = (\begin{array}{r} 0 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array})

would be the contrast matrix for

V_{j}

using treatment contrasts relative to the first level. The contrast matrix obtained when using treatment contrasts relative to the last level is similar, but the row of zeros appears at the bottom and all other rows are shifted up one.

Strictly speaking, the term contrast implies that each row in the contrast matrix sums to zero. That is not the case for treatment contrasts, however they are included as this coding is commonly used in practice.

3.4.2 Sum Contrasts

Sum contrasts are similar to treatment contrasts and again can be taken relative to the first or last level of the variable. Unlike treatment contrasts, sum contrasts effectively constrain the coefficients related to the variable to sum to zero. For example, if

L_{j} = 4

C_{j} = (\begin{array}{r} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ - 1 & - 1 & - 1 \end{array})

would be the contrast matrix for

V_{j}

using treatment contrasts relative to the last level. The contrast matrix obtained when using treatment contrasts relative to the first level is similar, but the row of

- 1

s appears at the top and all other rows are shifted down one.

3.4.3 Helmert Contrasts

With Helmert contrasts level

l

of the variable is compared with the average effect of all previous levels. For example, if

L_{j} = 4

C_{j} = (\begin{array}{r} −1 & −1 & −1 \\ 1 & −1 & −1 \\ 0 & 2 & −1 \\ 0 & 0 & 3 \end{array})

would be the contrast matrix for

V_{j}

using Helmert contrasts.

3.4.4 Polynomial Contrasts

With polynomial contrasts the entries in the columns of

C_{j}

correspond in linear, quadratic, cubic, quartic, etc. terms to a hypothetical underlying numeric variable that takes equally spaced values at each level. For example, if

L_{j} = 4

C_{j} = (\begin{array}{r} - 0.67 & 0.50 & - 0.22 \\ - 0.22 & - 0.50 & 0.67 \\ 0.22 & - 0.50 & - 0.67 \\ 0.67 & 0.50 & 0.22 \end{array})

would be the contrast matrix for

V_{j}

using polynomial contrasts.

3.4.5 When Contrasts Can Be Used

Depending on the specifics of the model,

M

, it may not be possible to always replace the

L_{j}

dummy variables with

L_{j} - 1

contrasts for all variables in all terms and retain the same model. A simple example of this is a data matrix,

D

, with four observations and two variables which have two and three levels respectively. This data matrix might look something like:

D = (\begin{array}{r} 1 & 1 \\ 2 & 3 \\ 1 & 2 \\ 2 & 2 \end{array})

For the sake of argument, assume that our model contains the main effect for each variable, but does not contain a mean effect (or intercept term). So using the notation established earlier,

M = V_{1} + V_{2} - 1

. The full design matrix,

X_{F}

, for this data matrix and model would be

X_{F} = (\begin{array}{r} 1 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 & 0 \end{array})

However,

X_{F}

is not of full rank (and hence

M

is overparameterized) because the sum of the first two columns is a vector of ones as is the sum of the last three columns.

In order to alleviate this we might try constructing

X_{C}

where the dummy variables have been replaced by contrasts. Assuming treatment contrasts, relative to the first level, we would have

X_{C} = (\begin{array}{r} 0 & 0 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 1 & 0 \end{array})

However, using

X_{C}

makes an implicit assumption that the expected value of the dependent variable (the quantity being modelled) is zero when

V_{1} = 1

and

V_{2} = 1

. This assumption was not made when we used

X_{F}

and hence the two design matrices are not equivalent. One solution would be to use dummy variables for

V_{1}

and contrasts for

V_{2}

, which would result in a design matrix,

X

X = (\begin{array}{r} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 0 & 1 & 0 \\ 0 & 1 & 1 & 0 \end{array})

Using

X

would give an equivalent model to using

X_{F}

The algorithm used by g22ycf to decide which variables, in which terms, can be coded as contrasts and which need to be coded as dummy variables is described below.

Suppose

V_{j}

is any variable that appears in term

T_{i}

, let

T_{i (j)}

denote the term obtained by dropping

V_{j}

from

T_{i}

. For example, if

T_{3} = V_{1} . V_{2} . V_{3}

T_{3 (2)} = V_{1} . V_{3}

. In this context, the empty term is taken to be the mean effect (or intercept term). We say that

T_{i (j)}

appears in

M

if there exists a term

T_{k}

k < i

, that contains all of the variables appearing in

T_{i (j)}

. In most cases

T_{k} = T_{i (j)}

, but this is not required. Note, as stated earlier, the terms in

M

are ordered by the number of variables in them.

A variable,

V_{j}

in term

T_{i}

is coded by contrasts if

T_{i (j)}

appears in

M

and by dummy variables otherwise. It is, therefore, possible for variable

V_{j}

to be coded by contrasts in some terms and dummy variables in others within the same

X

The above rule assumes the presence of a mean effect. If no such effect is present in the model, the main effect of the first categorical variable is coded by dummy variables to compensate. If no main effects appear in the model, the warning

ifail = 14

is returned.

A longer description and informal proof that the resulting

X

is a suitable design matrix for the model of interest can be found in chapter two of Chambers and Hastie (1992).

3.5 Mean Effect

The mean effect (or intercept term) is included in a design matrix by adding a column of ones as the first column of

X

. However, many model fitting routines in the NAG Library handle the mean effect as a special case and do not require it to be explicitly added to the design matrix. Therefore, by default, g22ycf does not explicitly add the mean effect to the design matrix. This behaviour can be changed via the optional parameter Explicit Mean in g22yaf.

4 References

Chambers J M and Hastie T J (1992) Statistical Models in S Wadsworth and Brooks/Cole Computer Science Series

5 Arguments

1: $hform$ – Type (c_ptr) Input

On entry: a G22 handle to the internal data structure containing a description of the model

M

as returned in hform by g22yaf.

2: $hddesc$ – Type (c_ptr) Input

On entry: a G22 handle to the internal data structure containing a description of the data matrix,

D

as returned in hddesc by g22ybf.

3: $dat (lddat, sddat)$ – Real (Kind=nag_wp) array Input

On entry: the data matrix,

D

. By default

D_{i j}

, the

i

th value for the

j

th variable, for

i = 1, 2, \dots, n

and

j = 1, 2, \dots, m_{d}

, should be supplied in

dat (i, j)

If the optional parameter Storage Order, described in g22ybf, is set to

VAROBS

D_{i j}

should be supplied in

dat (j, i)

4: $lddat$ – Integer Input

On entry: the first dimension of the array dat as declared in the (sub)program from which g22ycf is called.

Constraints:

if the optional parameter Storage Order, described in g22ybf, is set to $VAROBS$ , $lddat \geq m_{d}$ ;
otherwise $lddat \geq n$ .

5: $sddat$ – Integer Input

On entry: the second dimension of the array dat as declared in the (sub)program from which g22ycf is called.

Constraints:

if the optional parameter Storage Order, described in g22ybf, is set to $VAROBS$ , $sddat \geq n$ ;
otherwise $sddat \geq m_{d}$ .

6: $hxdesc$ – Type (c_ptr) Input/Output

On entry: must be set to c_null_ptr, alternatively an existing G22 handle may be supplied in which case this routine will destroy the supplied G22 handle as if g22zaf had been called.

On exit: holds a G22 handle to the internal data structure containing a description of the design matrix,

X

. You must not change the G22 handle other than through the routines in Chapter G22.

7: $x (ldx, sdx)$ – Real (Kind=nag_wp) array Output

On exit: the design matrix,

X

. By default

X_{i j}

, the

i

th value for the

j

th column, for

i = 1, 2, \dots, n

and

j = 1, 2, \dots, m_{x}

, is returned in

x (i, j)

If the optional parameter Storage Order, described in g22yaf, is set to

VAROBS

X_{i j}

is returned in

x (j, i)

If ldx or sdx are too small to hold x, the number of columns required to hold the design matrix is returned in mx.

Under some conditions it is possible to use the data matrix in place of the design matrix. Specifically, if

D

has no categorical variables,

M

has only main effects and either has no mean effect or the mean effect does not need to be explicitly added to the design matrix. If ldx or sdx are too small under such circumstances,

ifail = 71

is returned and hxdesc is set up in such a way as to allow dat to be used as the design matrix.

8: $ldx$ – Integer Input

On entry: the first dimension of the array x as declared in the (sub)program from which g22ycf is called.

Constraints:

if the optional parameter Storage Order, described in g22yaf, is set to $VAROBS$ , $ldx \geq m_{x}$ ;
otherwise $ldx \geq n$ .

9: $sdx$ – Integer Input

On entry: the second dimension of the array x as declared in the (sub)program from which g22ycf is called.

Constraints:

if the optional parameter Storage Order, described in g22yaf, is set to $VAROBS$ , $sdx \geq n$ ;
otherwise $sdx \geq m_{x}$ .

10: $mx$ – Integer Output

On exit: the minimum number of columns required to hold the design matrix.

In most cases

mx = m_{x}

. The one exception is when

ifail = 71

, that is the size of x was too small but the data matrix given in dat can be used as the design matrix. In this case mx holds the number of columns that would be required if only the relevant parts of dat were copied into a new array.

11: $ifail$ – Integer Input/Output

On entry: ifail must be set to

0

−1

1

to set behaviour on detection of an error; these values have no effect when no error is detected.

A value of

0

causes the printing of an error message and program execution will be halted; otherwise program execution continues. A value of

−1

means that an error message is printed while a value of

1

means that it is not.

If halting is not appropriate, the value

−1

1

is recommended. If message printing is undesirable, then the value

1

is recommended. Otherwise, the value

0

is recommended. When the value $- 1$ or $1$ is used it is essential to test the value of ifail on exit.

On exit:

ifail = 0

unless the routine detects an error or a warning has been flagged (see Section 6).

6 Error Indicators and Warnings

If on entry

ifail = 0

−1

, explanatory error messages are output on the current error message unit (as defined by x04aaf).

Errors or warnings detected by the routine:

$ifail = 11$: hform has not been initialized or is corrupt.

$ifail = 12$: hform is not a G22 handle as generated by g22yaf.

$ifail = 13$: A variable name used when creating hform is not present in hddesc.
Variable name: $⟨ value ⟩$ .

$ifail = 14$: The model contains categorical variables, but no intercept or main effects terms have been requested.
Please check the design matrix returned matches the model you require.

$ifail = 21$: hddesc has not been initialized or is corrupt.

$ifail = 22$: hddesc is not a G22 handle as generated by g22ybf.

$ifail = 31$: On entry, column $j$ of the data matrix, $D$ , is not consistent with information supplied in hddesc, $j = ⟨ value ⟩$ .

$ifail = 32$: Column $j$ of the data matrix, $D$ , required rounding more than expected when being treated as a categorical variable, $j = ⟨ value ⟩$ .
All output is returned using the rounded value(s).

$ifail = 41$: On entry, $n = ⟨ value ⟩$ and $lddat = ⟨ value ⟩$ .
Constraint: $lddat \geq n$ .

$ifail = 42$: On entry, $m_{d} = ⟨ value ⟩$ and $lddat = ⟨ value ⟩$ .
Constraint: $lddat \geq m_{d}$ .

$ifail = 51$: On entry, $m_{d} = ⟨ value ⟩$ and $sddat = ⟨ value ⟩$ .
Constraint: $sddat \geq m_{d}$ .

$ifail = 52$: On entry, $n = ⟨ value ⟩$ and $sddat = ⟨ value ⟩$ .
Constraint: $sddat \geq n$ .

$ifail = 61$: On entry, hxdesc is not c_null_ptr or a recognised G22 handle.

$ifail = 71$: On entry, the size of x is too small to hold the design matrix. dat can be used instead.

$ifail = 81$: On entry, $n = ⟨ value ⟩$ and $ldx = ⟨ value ⟩$ .
Constraint: $ldx \geq n$ .

$ifail = 82$: On entry, $m_{x} = ⟨ value ⟩$ and $ldx = ⟨ value ⟩$ .
Constraint: $ldx \geq m_{x}$ .

$ifail = 91$: On entry, $m_{x} = ⟨ value ⟩$ and $sdx = ⟨ value ⟩$ .
Constraint: $sdx \geq m_{x}$ .

$ifail = 92$: On entry, $n = ⟨ value ⟩$ and $sdx = ⟨ value ⟩$ .
Constraint: $sdx \geq n$ .

$ifail = - 99$: An unexpected error has been triggered by this routine. Please contact NAG.
See Section 7 in the Introduction to the NAG Library FL Interface for further information.

$ifail = - 399$: Your licence key may have expired or may not have been installed correctly.
See Section 8 in the Introduction to the NAG Library FL Interface for further information.

$ifail = - 999$: Dynamic memory allocation failed.
See Section 9 in the Introduction to the NAG Library FL Interface for further information.

7 Accuracy

Not applicable.

8 Parallelism and Performance

Background information to multithreading can be found in the Multithreading documentation.

g22ycf is threaded by NAG for parallel execution in multithreaded implementations of the NAG Library.

g22ycf makes calls to BLAS and/or LAPACK routines, which may be threaded within the vendor library used by this implementation. Consult the documentation for the vendor library for further information.

Please consult the X06 Chapter Introduction for information on how to control and interrogate the OpenMP environment used within this routine. Please also consult the Users' Note for your implementation for any additional implementation-specific information.

9 Further Comments

g22ydf can be used to obtain labels for the columns of the design matrix

X

Many of the analysis routines that require a design matrix to be supplied allow submodels to be defined through the use of a vector of ones or zeros indicating whether a column of

X

should be included or excluded from the analyses (see for example isx in g02daf or g02gaf). This allows nested models to be fit without having to reconstruct the design matrix for each analysis. g22ydf offers a mechanism for constructing these vectors using submodels specified using g22yaf.

10 Example

This example creates and outputs two design matrices for a simple linear regression model. The first design matrix uses sum contrasts for all variables and the second uses a combination of polynomial and Helmert contrasts. Column labels are generated using g22ydf.

See also the examples for g22yaf, g22ybf and g22ydf.

11 Optional Parameters

As well as the optional parameters common to all G22 handles described in g22zmf and g22znf, a number of additional optional parameters can be specified for a G22 handle holding the description of a design matrix as returned by g22ycf in hxdesc.

The value of an optional parameter can be queried using g22znf.

The remainder of this section can be skipped if you wish to use the default values for all optional parameters.

The following is a list of the optional parameters available. A full description of each optional parameter is provided in Section 11.1.

Formula
Min Number of Columns
Number of Columns
Number of Observations
Storage Order

11.1 Description of the Optional Parameters

For each option, we give a summary line, a description of the optional parameter and details of constraints.

The summary line contains:

a parameter value, where the letters $a$ , $i$ and $r$ denote options that take character, integer and real values respectively;

Keywords and character values are case and white space insensitive.

Formula

a

Read Only

This optional parameter returns a verbose formula string describing the model,

M

, used to create the design matrix. This formula will only contain variable names, the operators ‘

+

’ and ‘

.

’ and any contrast identifiers present.

Min Number of Columns

i

Read Only

This optional parameter returns the minimum number of columns required to hold the design matrix,

X

. In most cases

Min Number of Columns = Number of Columns

. The one exception is when

ifail = 71

, that is the size of x was too small but the data matrix given in dat can be used as the design matrix. In this case,

Number of Columns = m_{x} = m_{d}

and

Min Number of Columns

holds the number of columns that would be required if only the relevant parts of dat were copied into a new array.

Number of Columns

i

Read Only

This optional parameter returns

m_{x}

, the number of columns in the design matrix.

Number of Observations

i

Read Only

This optional parameter returns

n

, the number of observations in the design matrix.

Storage Order

a

Read Only

This optional parameter returns how the design matrix,

X

, is stored in x.

Storage Order = OBSVAR

X_{i j}

, the value for the

j

th variable of the

i

th observation of the design matrix is stored in

x (i, j)

Storage Order = VAROBS

X_{i j}

, the value for the

j

th variable of the

i

th observation of the design matrix is stored in

x (j, i)

It should be noted that Storage Order is not writeable. If you wish to change the storage order of the design matrix you need to change Storage Order in hform as described in Section 11 in g22yaf prior to calling g22ycf.

NAG Library Manual, Mark 28.6

Interfaces: FL CL CPP AD PY MB

NAG FL Interface Introduction

G22 (Blgm) Chapter Contents

G22 (Blgm) Chapter Introduction

g22yc: FL CL CPP AD PY MB

NAG FL Interfaceg22ycf (lm_​design_​matrix)

▸▿ Contents

1 Purpose

2 Specification

3 Description

3.1 Notation

3.2 Dummy Variables

3.3 Full Design Matrix

3.4 Contrasts

3.4.1 Treatment Contrasts

3.4.2 Sum Contrasts

3.4.3 Helmert Contrasts

3.4.4 Polynomial Contrasts

3.4.5 When Contrasts Can Be Used

3.5 Mean Effect

4 References

5 Arguments

6 Error Indicators and Warnings

7 Accuracy

8 Parallelism and Performance

9 Further Comments

10 Example

10.1 Program Text

10.2 Program Data

10.3 Program Results

11 Optional Parameters

11.1 Description of the Optional Parameters

NAG FL Interface
g22ycf (lm_design_matrix)