naginterfaces.library.correg.linregm_fit¶

naginterfaces.library.correg.linregm_fit(x, isx, y, mean='M', wt=None, tol=1e-06)[source]¶

linregm_fit performs a general multiple linear regression when the independent variables may be linearly dependent. Parameter estimates, standard errors, residuals and influence statistics are computed. linregm_fit may be used to perform a weighted regression.

For full information please refer to the NAG Library document for g02da

https://support.nag.com/numeric/nl/nagdoc_30.3/flhtml/g02/g02daf.html

Parameters

xfloat, array-like, shape $(n, m)$

$x [i - 1, j - 1]$ must contain the $i$ th observation for the $j$ th independent variable, for $j = 1, 2, \dots, m$ , for $i = 1, 2, \dots, n$ .

isxint, array-like, shape $(m)$

Indicates which independent variables are to be included in the model.

$i s x [j - 1] > 0$

The variable contained in the $j$ th column of $x$ is included in the regression model.

yfloat, array-like, shape $(n)$

$y$ , the observations on the dependent variable.

meanstr, length 1, optional

Indicates if a mean term is to be included.

$m e a n ='M'$

A mean term, intercept, will be included in the model.

$m e a n ='Z'$

The model will pass through the origin, zero-point.

wtNone or float, array-like, shape $(n)$ , optional

If provided $w t$ must contain the weights to be used with the model.

If $w t [i - 1] = 0.0$ , the $i$ th observation is not included in the model, in which case the effective number of observations is the number of observations with nonzero weights.

The values of $r e s$ and $h$ will be set to zero for observations with zero weights.

If $w t$ is not provided the effective number of observations is $n$ .

tolfloat, optional

The value of $t o l$ is used to decide if the independent variables are of full rank and if not what is the rank of the independent variables. The smaller the value of $t o l$ the stricter the criterion for selecting the singular value decomposition. If $t o l = 0.0$ , the singular value decomposition will never be used; this may cause run time errors or inaccurate results if the independent variables are not of full rank.

Returns

rssfloat

The residual sum of squares for the regression.

idfint

The degrees of freedom associated with the residual sum of squares.

bfloat, ndarray, shape $(ip)$

$b [i - 1]$ , $i = 1, 2, \dots, ip$ contains the least squares estimates of the parameters of the regression model, $^β$ .

If $m e a n ='M'$ , $b [0]$ will contain the estimate of the mean parameter and $b [i]$ will contain the coefficient of the variable contained in column $j$ of $x$ , where $i s x [j - 1]$ is the $i$ th positive value in the array $i s x$ .

If $m e a n ='Z'$ , $b [i - 1]$ will contain the coefficient of the variable contained in column $j$ of $x$ , where $i s x [j - 1]$ is the $i$ th positive value in the array $i s x$ .

sefloat, ndarray, shape $(ip)$

$s e [i - 1]$ , $i = 1, 2, \dots, ip$ contains the standard errors of the $ip$ parameter estimates given in $b$ .

covfloat, ndarray, shape $(ip \times (ip + 1) / 2)$

The first $ip \times (ip + 1) / 2$ elements of $c o v$ contain the upper triangular part of the variance-covariance matrix of the $ip$ parameter estimates given in $b$ . They are stored packed by column, i.e., the covariance between the parameter estimate given in $b [i - 1]$ and the parameter estimate given in $b [j - 1]$ , $j \geq i$ , is stored in $c o v [j \times (j - 1) / 2 + i - 1]$ .

resfloat, ndarray, shape $(n)$

The (weighted) residuals, $r_{i}$ , for $i = 1, 2, \dots, n$ .

hfloat, ndarray, shape $(n)$

The diagonal elements of $H$ , $h_{i}$ , for $i = 1, 2, \dots, n$ .

qfloat, ndarray, shape $(n, ip + 1)$

The results of the $Q R$ decomposition:

the first column of $q$ contains $c$ ;

the upper triangular part of columns $2$ to $ip + 1$ contain the $R$ matrix;

the strictly lower triangular part of columns $2$ to $ip + 1$ contain details of the $Q$ matrix.

svdbool

If a singular value decomposition has been performed then $s v d$ will be $T r u e$ , otherwise $s v d$ will be $F a l s e$ .

irankint

The rank of the independent variables.

If $s v d = F a l s e$ , $i r a n k = ip$ .

If $s v d = T r u e$ , $i r a n k$ is an estimate of the rank of the independent variables.

$i r a n k$ is calculated as the number of singular values greater that $t o l \times$ (largest singular value).

It is possible for the SVD to be carried out but $i r a n k$ to be returned as $ip$ .

pfloat, ndarray, shape $(2 \times ip + ip \times ip)$

Details of the $Q R$ decomposition and SVD if used.

If $s v d = F a l s e$ , only the first $ip$ elements of $p$ are used these will contain the zeta values for the $Q R$ decomposition (see lapackeig.dgeqrf for details).

If $s v d = T r u e$ , the first $ip$ elements of $p$ will contain the zeta values for the $Q R$ decomposition (see lapackeig.dgeqrf for details) and the next $ip$ elements of $p$ contain singular values.

The following $ip$ by $ip$ elements contain the matrix $P^{*}$ stored by columns.

wkfloat, ndarray, shape $(max (2, 5 \times (ip - 1) + ip \times ip))$

If on exit $s v d = T r u e$ , $w k$ contains information which is needed by linregm_fit_newvar(); otherwise $w k$ is used as workspace.

Raises

NagValueError

(errno $1$ )

On entry, $ip = ⟨ v a l u e ⟩$ .

Constraint: $ip \geq 1$ .

(errno $1$ )

On entry, $t o l = ⟨ v a l u e ⟩$ .

Constraint: $t o l \geq 0.0$ .

(errno $1$ )

On entry, $ip = ⟨ v a l u e ⟩$ and $n = ⟨ v a l u e ⟩$ .

Constraint: $ip \leq n$ .

(errno $1$ )

On entry, $m = ⟨ v a l u e ⟩$ .

Constraint: $m \geq 1$ .

(errno $1$ )

On entry, $n = ⟨ v a l u e ⟩$ .

Constraint: $n \geq 2$ .

(errno $2$ )

On entry, $weight = ⟨ v a l u e ⟩$ .

Constraint: $weight ='U'$ or $'W'$ .

(errno $2$ )

On entry, $m e a n = ⟨ v a l u e ⟩$ .

Constraint: $m e a n ='M'$ or $'Z'$ .

(errno $3$ )

On entry, $w t [⟨ v a l u e ⟩] < 0.0$ .

Constraint: $w t [i - 1] \geq 0.0$ , for $i = 1, 2, \dots, n$ .

(errno $4$ )

On entry, $i s x [⟨ v a l u e ⟩] < 0$ .

Constraint: $i s x [i - 1] \geq 0.0$ , for $i = 1, 2, \dots, m$ .

(errno $4$ )

On entry, $ip = ⟨ v a l u e ⟩$ .

Constraint: $ip$ must be compatible with the number of nonzero elements in $i s x$ .

(errno $6$ )

SVD solution failed to converge.

Warns

NagAlgorithmicWarning

(errno $5$ ): The degrees of freedom for the residuals are zero, i.e., the designated number of arguments is equal to the effective number of observations. In this case the parameter estimates will be returned along with the diagonal elements of $H$ , but neither standard errors nor the variance-covariance matrix will be calculated.

Notes

In the NAG Library the traditional C interface for this routine uses a different algorithmic base. Please contact NAG if you have any questions about compatibility.

The general linear regression model is defined by

y = X β + ϵ,

where

$y$ is a vector of $n$ observations on the dependent variable,

$X$ is an $n \times p$ matrix of the independent variables of column rank $k$ ,

$β$ is a vector of length $p$ of unknown parameters, and

$ϵ$ is a vector of length $n$ of unknown random errors such that $v a r (ϵ) = V σ^{2}$ , where $V$ is a known diagonal matrix.

If $V = I$ , the identity matrix, then least squares estimation is used. If $V \neq I$ , then for a given weight matrix $W \propto V^{- 1}$ , weighted least squares estimation is used.

The least squares estimates $^β$ of the parameters $β$ minimize ${(y - X β)}^{T} (y - X β)$ while the weighted least squares estimates minimize ${(y - X β)}^{T} W (y - X β)$ .

linregm_fit finds a $Q R$ decomposition of $X$ (or $W^{1 / 2} X$ in weighted case), i.e.,

X = Q R^{*} (or W^{1 / 2} X = Q R^{*}),

where $R^{*} = (\begin{matrix} R 0 \end{matrix})$ and $R$ is a $p \times p$ upper triangular matrix and $Q$ is an $n \times n$ orthogonal matrix. If $R$ is of full rank, then $^β$ is the solution to

R^β = c_{1},

where $c = Q^{T} y$ (or $Q^{T} W^{1 / 2} y$ ) and $c_{1}$ is the first $p$ elements of $c$ . If $R$ is not of full rank a solution is obtained by means of a singular value decomposition (SVD) of $R$ ,

\begin{matrix} R = Q_{*} (\begin{matrix} D & 0 0 & 0 \end{matrix}) P^{T}, \end{matrix}

where $D$ is a $k \times k$ diagonal matrix with nonzero diagonal elements, $k$ being the rank of $R$ , and $Q_{*}$ and $P$ are $p \times p$ orthogonal matrices. This gives the solution

^β = P_{1} D^{- 1} Q_{*_{1}}^{T} c_{1},

$P_{1}$ being the first $k$ columns of $P$ , i.e., $P = (\begin{matrix} P_{1} & P_{0} \end{matrix})$ , and $Q_{*_{1}}$ being the first $k$ columns of $Q_{*}$ .

Details of the SVD, are made available, in the form of the matrix $P^{*}$ :

\begin{matrix} P^{*} = (\begin{matrix} D^{- 1} P_{1}^{T} P_{0}^{T} \end{matrix}) . \end{matrix}

This will be only one of the possible solutions. Other estimates may be obtained by applying constraints to the parameters. These solutions can be obtained by using linregm_constrain() after using linregm_fit. Only certain linear combinations of the parameters will have unique estimates; these are known as estimable functions.

The fit of the model can be examined by considering the residuals, $r_{i} = y_{i} -^y$ , where $^y = X^β$ are the fitted values. The fitted values can be written as $H y$ for an $n \times n$ matrix $H$ . The $i$ th diagonal elements of $H$ , $h_{i}$ , give a measure of the influence of the $i$ th values of the independent variables on the fitted regression model. The values $h_{i}$ are sometimes known as leverages. Both $r_{i}$ and $h_{i}$ are provided by linregm_fit.

The output of linregm_fit also includes $^β$ , the residual sum of squares and associated degrees of freedom, $(n - k)$ , the standard errors of the parameter estimates and the variance-covariance matrix of the parameter estimates.

In many linear regression models the first term is taken as a mean term or an intercept, i.e., $X_{i, 1} = 1$ , for $i = 1, 2, \dots, n$ . This is provided as an option. Also only some of the possible independent variables are required to be included in a model, a facility to select variables to be included in the model is provided.

Details of the $Q R$ decomposition and, if used, the SVD, are made available. These allow the regression to be updated by adding or deleting an observation using linregm_obs_edit(), adding or deleting a variable using linregm_var_add() and linregm_var_del() or estimating and testing an estimable function using linregm_estfunc(). For the same matrix of independent variables, a new set of parameter estimates can be quickly calculated from a new vector of dependent variables using linregm_fit_newvar(). The details of the factorizations held in $q$ , $p$ and $wk$ are only for use by this suite of functions and cannot be used by other functions that use such factorizations, e.g., lapackeig.dormqr since these will expect a different storage scheme for the input factorization.

References

Cook, R D and Weisberg, S, 1982, Residuals and Influence in Regression, Chapman and Hall

Draper, N R and Smith, H, 1985, Applied Regression Analysis, (2nd Edition), Wiley

Golub, G H and Van Loan, C F, 1996, Matrix Computations, (3rd Edition), Johns Hopkins University Press, Baltimore

Hammarling, S, 1985, The singular value decomposition in multivariate statistics, SIGNUM Newsl. (20(3)), 2–25

McCullagh, P and Nelder, J A, 1983, Generalized Linear Models, Chapman and Hall

Searle, S R, 1971, Linear Models, Wiley

NAG and Python

Return to Front

naginterfaces.library.correg.linregm_fit¶

naginterfaces.library.correg.linregm_​fit¶

naginterfaces.library.correg.linregm_fit¶