naginterfaces.library.blgm.lm_design_matrix¶

naginterfaces.library.blgm.lm_design_matrix(hform, hddesc, dat, hxdesc)[source]¶

lm_design_matrix generates a design matrix from a data matrix and model description.

Note: this function uses optional algorithmic parameters, see also: optset(), optget().

For full information please refer to the NAG Library document for g22yc

https://support.nag.com/numeric/nl/nagdoc_30.3/flhtml/g22/g22ycf.html

Parameters

hformHandle

A G22 handle to the internal data structure containing a description of the model $M$ as returned in $hform$ by lm_formula().

hddescHandle

A G22 handle to the internal data structure containing a description of the data matrix, $D$ as returned in $hddesc$ by lm_describe_data().

datfloat, array-like, shape $(:, :)$

The data matrix, $D$ . By default $D_{i j}$ , the $i$ th value for the $j$ th variable, for $j = 1, 2, \dots, m_{d}$ , for $i = 1, 2, \dots, n$ , should be supplied in $d a t [i - 1, j - 1]$ .

If the option ‘Storage Order’, described in lm_describe_data(), is set to ‘VAROBS’, $D_{i j}$ should be supplied in $d a t [j - 1, i - 1]$ .

hxdescHandle, modified in place

On entry: must be set to a null Handle, alternatively an existing G22 handle may be supplied in which case this function will destroy the supplied G22 handle as if handle_free() had been called.

On exit: holds a G22 handle to the internal data structure containing a description of the design matrix, $X$ . You must not change the G22 handle other than through the functions in submodule blgm.

Returns

xfloat, ndarray, shape $(:, :)$

The design matrix, $X$ . By default $X_{i j}$ , the $i$ th value for the $j$ th column, for $j = 1, 2, \dots, m_{x}$ , for $i = 1, 2, \dots, n$ , is returned in $x [i - 1, j - 1]$ .

If the option ‘Storage Order’, described in lm_formula(), is set to ‘VAROBS’, $X_{i j}$ is returned in $x [j - 1, i - 1]$ .

Other Parameters

‘Formula’str

This option returns a verbose formula string describing the model, $M$ , used to create the design matrix. This formula will only contain variable names, the operators ‘ $+$ ’ and ‘ $.$ ’ and any contrast identifiers present.

‘Min Number of Columns’int

This option returns the minimum number of columns required to hold the design matrix, $X$ . In most cases $‘Min Number of Columns' = ‘Number of Columns'$ . The one exception is when $e r r n o$ = 71, that is the size of $x$ was too small but the data matrix given in $d a t$ can be used as the design matrix. In this case, $‘Number of Columns' = m_{x} = m_{d}$ and $‘Min Number of Columns'$ holds the number of columns that would be required if only the relevant parts of $d a t$ were copied into a new array.

‘Number of Columns’int

This option returns $m_{x}$ , the number of columns in the design matrix.

‘Number of Observations’int

This option returns $n$ , the number of observations in the design matrix.

‘Storage Order’str

This option returns how the design matrix, $X$ , is stored in $x$ .

If $‘Storage Order' ='OBSVAR'$ , $X_{i j}$ , the value for the $j$ th variable of the $i$ th observation of the design matrix is stored in $x [i - 1, j - 1]$ .

If $‘Storage Order' ='VAROBS'$ , $X_{i j}$ , the value for the $j$ th variable of the $i$ th observation of the design matrix is stored in $x [j - 1, i - 1]$ .

It should be noted that ‘Storage Order’ is not writeable. If you wish to change the storage order of the design matrix you need to change ‘Storage Order’ in $h f o r m$ as described in Other Parameters for lm_formula prior to calling lm_design_matrix.

Raises

NagValueError

(errno $11$ )

$h f o r m$ has not been initialized or is corrupt.

(errno $12$ )

$h f o r m$ is not a G22 handle as generated by lm_formula().

(errno $13$ )

A variable name used when creating $h f o r m$ is not present in $h d d e s c$ .

Variable name: $⟨ v a l u e ⟩$ .

(errno $21$ )

$h d d e s c$ has not been initialized or is corrupt.

(errno $22$ )

$h d d e s c$ is not a G22 handle as generated by lm_describe_data().

(errno $31$ )

On entry, column $j$ of the data matrix, $D$ , is not consistent with information supplied in $h d d e s c$ , $j = ⟨ v a l u e ⟩$ .

(errno $41$ )

On entry, $n = ⟨ v a l u e ⟩$ and $lddat = ⟨ v a l u e ⟩$ .

Constraint: $lddat \geq n$ .

(errno $42$ )

On entry, $m_{d} = ⟨ v a l u e ⟩$ and $lddat = ⟨ v a l u e ⟩$ .

Constraint: $lddat \geq m_{d}$ .

(errno $51$ )

On entry, $m_{d} = ⟨ v a l u e ⟩$ and $sddat = ⟨ v a l u e ⟩$ .

Constraint: $sddat \geq m_{d}$ .

(errno $52$ )

On entry, $n = ⟨ v a l u e ⟩$ and $sddat = ⟨ v a l u e ⟩$ .

Constraint: $sddat \geq n$ .

(errno $61$ )

On entry, $h x d e s c$ is not a null Handle or a recognised G22 handle.

Warns

NagAlgorithmicWarning

(errno $14$ )

The model contains categorical variables, but no intercept or main effects terms have been requested.

Please check the design matrix returned matches the model you require.

(errno $32$ )

Column $j$ of the data matrix, $D$ , required rounding more than expected when being treated as a categorical variable, $j = ⟨ v a l u e ⟩$ .

Notes

lm_design_matrix generates a design matrix from a data matrix and a model description. Design matrices encapsulate the observed values of the independent variables and the required model in a form that can be used by many of the model fitting functions available in the NAG Library, for example those in submodule correg.

Notation

Let $D$ denote a data matrix with $n$ observations on $m_{d}$ independent variables, denoted by $V_{j}$ , for $j = 1, 2, \dots, m_{d}$ . If $V_{j}$ is a categorical variable, let $L_{j}$ denote the number of levels associated with it. If $V_{j}$ is a binary, ordinal or continuous variable, let $L_{j} = 1$ .

Let $V_{j i}$ denote the $i$ th value of $V_{j}$ .

Let $M$ denote a model made up of one or more terms, denoted by $T_{i}$ . Each term consists of either a main effect or an interaction and hence can be described using one or more variable names $V_{j}$ and the interaction operator ‘ $.$ ’. The operator ‘ $+$ ’ is used to denote the addition of a term to the model. Therefore, $M = T_{1} + T_{2} + T_{3} = V_{1} + V_{2} + V_{1} . V_{2}$ denotes a model with three terms, the first two terms being the main effects for variables $V_{1}$ and $V_{2}$ and the last term the interaction between them. For simplicity we reorder the terms of the model by the number of variables in them, so main effects come first, then two-way interactions, then three-way interactions etc. By default it is assumed that the model $M$ contains a mean effect (or intercept term), if the mean effect is excluded, this will be denoted by ‘ $- 1$ ’, so $M = T_{1}$ is a model with one term and a mean effect and $M = T_{1} - 1$ is the same model with the mean effect dropped.

lm_design_matrix generates an $n \times m_{x}$ design matrix, $X$ , from $D$ and $M$ .

Dummy Variables

When constructing a design matrix, we cannot work directly with categorical variables. Categorical variables must first be recoded into dummy variables. A categorical variable $V_{j}$ requires $L_{j}$ dummy variables. Let $D^{j}$ denote an $n \times L_{j}$ matrix of dummy variables for $V_{j}$ defined as

\begin{matrix} D_{l i}^{j} = {\begin{matrix} 1; if V_{j i} = l, 0; otherwise \end{matrix} \end{matrix}

where $D_{l}^{j}$ is the $l$ th column of $D^{j}$ and $D_{l i}^{j}$ is the $i$ th element of $D_{l}^{j}$ .

For a binary, ordinal or continuous variable, $D_{1 i}^{j} = V_{j i}$ .

Full Design Matrix

Given a model, $M$ , and the matrices of dummy variables constructing the full design matrix $X_{F}$ is trivial. Each term is processed in order and

If term $i$ is a main effect, that is $T_{i} = V_{j}$ for some $j$ , $D^{j}$ is copied into $X_{F}$ .
If term $i$ is a two-way interaction, that is $T_{i} = V_{j} . V_{k}$ , for some $j \neq k$ , then
1. Loop over $l_{j} = 1, 2, \dots L_{j}$ .
2. Loop over $l_{k} = 1, 2, \dots L_{k}$ .
3. Add a column to $X_{F}$ corresponding to the element-wise product of $D_{l_{j}}^{j}$ and $D_{l_{k}}^{k}$ .
Higher interaction terms are handled in a similar manner as the two-way interactions by adding columns constructed from multiplying all combinations of the columns of the corresponding $D$ s that correspond to the variables involved. In all cases, the variables towards the right hand side of a term are iterated over the quickest.

Contrasts

Using the full design matrix $X_{F}$ in an analysis can result in an overparameterized model. This is due to $X_{F}$ often not being of full rank as the sum of all the dummy variables for a particular variable is a vector of ones. This source of overparameterization can be alleviated by using a design matrix $X$ where (some) dummy variables are replaced by contrasts. For a categorical variable $V_{j}$ the contrasts are a set of $L_{j} - 1$ functionally independent linear combinations of the dummy variables.

Whilst the choice of contrasts used in term $T_{i}$ will affect the individual model coefficients (parameters), it has no effect on the overall contribution of $T_{i}$ .

For a given variable $V_{j}$ , the contrasts can be represented by an $L_{j} \times L_{j} - 1$ matrix, $C_{j}$ . The rows of $C_{j}$ correspond to a particular value of $V_{j}$ and the columns correspond to the values to use in the design matrix.

Six types of contrast are available in lm_design_matrix; two types of treatment contrasts, two types of sum contrasts, Helmert contrasts and polynomial contrasts. Unless specified otherwise, the contrasts used by lm_design_matrix are treatment contrasts relative to the first level. See the description of the option ‘Contrast’ in lm_formula() for ways of changing the contrasts used.

Treatment Contrasts

Treatment contrasts are taken relative to either the first or last level of the variable. For example, if $L_{j} = 4$ ,

\begin{matrix} C_{j} = ⎛ ⎜ ⎜ ⎜ ⎝ \begin{matrix} 0 & 0 & 0 1 & 0 & 0 0 & 1 & 0 0 & 0 & 1 \end{matrix} ⎞ ⎟ ⎟ ⎟ ⎠ \end{matrix}

would be the contrast matrix for $V_{j}$ using treatment contrasts relative to the first level. The contrast matrix obtained when using treatment contrasts relative to the last level is similar, but the row of zeros appears at the bottom and all other rows are shifted up one.

Strictly speaking, the term contrast implies that each row in the contrast matrix sums to zero. That is not the case for treatment contrasts, however they are included as this coding is commonly used in practice.

Sum Contrasts

Sum contrasts are similar to treatment contrasts and again can be taken relative to the first or last level of the variable. Unlike treatment contrasts, sum contrasts effectively constrain the coefficients related to the variable to sum to zero. For example, if $L_{j} = 4$ ,

\begin{matrix} C_{j} = ⎛ ⎜ ⎜ ⎜ ⎝ \begin{matrix} 1 & 0 & 0 0 & 1 & 0 0 & 0 & 1 - 1 & - 1 & - 1 \end{matrix} ⎞ ⎟ ⎟ ⎟ ⎠ \end{matrix}

would be the contrast matrix for $V_{j}$ using treatment contrasts relative to the last level. The contrast matrix obtained when using treatment contrasts relative to the first level is similar, but the row of $- 1$ s appears at the top and all other rows are shifted down one.

Helmert Contrasts

With Helmert contrasts level $l$ of the variable is compared with the average effect of all previous levels. For example, if $L_{j} = 4$ ,

\begin{matrix} C_{j} = ⎛ ⎜ ⎜ ⎜ ⎝ \begin{matrix} - 1 & - 1 & - 1 1 & - 1 & - 1 0 & 2 & - 1 0 & 0 & 3 \end{matrix} ⎞ ⎟ ⎟ ⎟ ⎠ \end{matrix}

would be the contrast matrix for $V_{j}$ using Helmert contrasts.

Polynomial Contrasts

With polynomial contrasts the entries in the columns of $C_{j}$ correspond in linear, quadratic, cubic, quartic, etc. terms to a hypothetical underlying numeric variable that takes equally spaced values at each level. For example, if $L_{j} = 4$ ,

\begin{matrix} C_{j} = ⎛ ⎜ ⎜ ⎜ ⎝ \begin{matrix} - 0.67 & 0.50 & - 0.22 - 0.22 & - 0.50 & 0.67 0.22 & - 0.50 & - 0.67 0.67 & 0.50 & 0.22 \end{matrix} ⎞ ⎟ ⎟ ⎟ ⎠ \end{matrix}

would be the contrast matrix for $V_{j}$ using polynomial contrasts.

When Contrasts Can Be Used

Depending on the specifics of the model, $M$ , it may not be possible to always replace the $L_{j}$ dummy variables with $L_{j} - 1$ contrasts for all variables in all terms and retain the same model. A simple example of this is a data matrix, $D$ , with four observations and two variables which have two and three levels respectively. This data matrix might look something like:

\begin{matrix} D = ⎛ ⎜ ⎜ ⎜ ⎝ \begin{matrix} 1 & 1 2 & 3 1 & 2 2 & 2 \end{matrix} ⎞ ⎟ ⎟ ⎟ ⎠ \end{matrix}

For the sake of argument, assume that our model contains the main effect for each variable, but does not contain a mean effect (or intercept term). So using the notation established earlier, $M = V_{1} + V_{2} - 1$ . The full design matrix, $X_{F}$ , for this data matrix and model would be

\begin{matrix} X_{F} = ⎛ ⎜ ⎜ ⎜ ⎝ \begin{matrix} 1 & 0 & 1 & 0 & 0 0 & 1 & 0 & 0 & 1 1 & 0 & 0 & 1 & 0 0 & 1 & 0 & 1 & 0 \end{matrix} ⎞ ⎟ ⎟ ⎟ ⎠ \end{matrix}

However, $X_{F}$ is not of full rank (and hence $M$ is overparameterized) because the sum of the first two columns is a vector of ones as is the sum of the last three columns.

In order to alleviate this we might try constructing $X_{C}$ where the dummy variables have been replaced by contrasts. Assuming treatment contrasts, relative to the first level, we would have

\begin{matrix} X_{C} = ⎛ ⎜ ⎜ ⎜ ⎝ \begin{matrix} 0 & 0 & 0 1 & 0 & 1 0 & 1 & 0 1 & 1 & 0 \end{matrix} ⎞ ⎟ ⎟ ⎟ ⎠ \end{matrix}

However, using $X_{C}$ makes an implicit assumption that the expected value of the dependent variable (the quantity being modelled) is zero when $V_{1} = 1$ and $V_{2} = 1$ . This assumption was not made when we used $X_{F}$ and hence the two design matrices are not equivalent. One solution would be to use dummy variables for $V_{1}$ and contrasts for $V_{2}$ , which would result in a design matrix, $X$ of

\begin{matrix} X = ⎛ ⎜ ⎜ ⎜ ⎝ \begin{matrix} 1 & 0 & 0 & 0 0 & 1 & 0 & 1 1 & 0 & 1 & 0 0 & 1 & 1 & 0 \end{matrix} ⎞ ⎟ ⎟ ⎟ ⎠ \end{matrix}

Using $X$ would give an equivalent model to using $X_{F}$ .

The algorithm used by lm_design_matrix to decide which variables, in which terms, can be coded as contrasts and which need to be coded as dummy variables is described below.

Suppose $V_{j}$ is any variable that appears in term $T_{i}$ , let $T_{i (j)}$ denote the term obtained by dropping $V_{j}$ from $T_{i}$ . For example, if $T_{3} = V_{1} . V_{2} . V_{3}$ , $T_{3 (2)} = V_{1} . V_{3}$ . In this context, the empty term is taken to be the mean effect (or intercept term). We say that $T_{i (j)}$ appears in $M$ if there exists a term $T_{k}$ , $k < i$ , that contains all of the variables appearing in $T_{i (j)}$ . In most cases $T_{k} = T_{i (j)}$ , but this is not required. Note, as stated earlier, the terms in $M$ are ordered by the number of variables in them.

A variable, $V_{j}$ in term $T_{i}$ is coded by contrasts if $T_{i (j)}$ appears in $M$ and by dummy variables otherwise. It is, therefore, possible for variable $V_{j}$ to be coded by contrasts in some terms and dummy variables in others within the same $X$ .

The above rule assumes the presence of a mean effect. If no such effect is present in the model, the main effect of the first categorical variable is coded by dummy variables to compensate. If no main effects appear in the model, the warning $e r r n o$ = 14 is returned.

A longer description and informal proof that the resulting $X$ is a suitable design matrix for the model of interest can be found in module two of Chambers and Hastie (1992).

Mean Effect

The mean effect (or intercept term) is included in a design matrix by adding a column of ones as the first column of $X$ . However, many model fitting functions in the NAG Library handle the mean effect as a special case and do not require it to be explicitly added to the design matrix. Therefore, by default, lm_design_matrix does not explicitly add the mean effect to the design matrix. This behaviour can be changed via the option ‘Explicit Mean’ in lm_formula().

References: Chambers, J M and Hastie, T J, 1992, Statistical Models in S, Wadsworth and Brooks/Cole Computer Science Series

NAG and Python

Return to Front

naginterfaces.library.blgm.lm_design_matrix¶

naginterfaces.library.blgm.lm_​design_​matrix¶

naginterfaces.library.blgm.lm_design_matrix¶