G12 (Surviv)

Survival Analysis

This chapter is concerned with statistical techniques used in the analysis of survival/reliability/failure time data.

Other chapters contain routines which are also used to analyse this type of data. Chapter G02 contains generalized linear models, Chapter G07
contains routines to fit distribution models, and
Chapter G08 contains rank based methods.

This chapter is concerned with the analysis on the time,
$t$, to a single event. This type of analysis occurs commonly in two areas. In medical research it is known as survival analysis and is often the time from the start of treatment to the occurrence of a particular condition or of death. In engineering it is concerned with reliability and the analysis of failure times, that is how long a component can be used until it fails. In this chapter the time $t$ will be referred to as the **failure time**.

Let the probability density function of the failure time be $f\left(t\right)$, then the **survivor function**,
$S\left(t\right)$, which is the probability of surviving to at least time $t$, is given by

where $F\left(t\right)$ is the cumulative density function. The
**hazard function**, $\lambda \left(t\right)$, is the probability that failure occurs at time $t$ given that the individual survived up to time $t$, and is given by

The **cumulative hazard** rate is defined as

hence $S\left(t\right)={e}^{-\Lambda \left(t\right)}$.

$$S\left(t\right)=\underset{t}{\overset{\infty}{\int}}f\left(\tau \right)d\tau =1-F\left(t\right)$$ |

$$\lambda \left(t\right)=f\left(t\right)/S\left(t\right)\text{.}$$ |

$$\Lambda \left(t\right)=\underset{0}{\overset{t}{\int}}\lambda \left(\tau \right)d\tau \text{,}$$ |

It is common in survival analysis for some of the data to be
**right-censored**. That is, the exact failure time is not known, only that failure occurred after a known time. This may be due to the experiment being terminated before all the individuals have failed, or an individual being removed from the experiment for a reason not connected with effects being tested in the experiment. The presence of censored data leads to complications in the analysis.

There are a number of different rank statistics described in the literature, the most common being the logrank statistic. All of these statistics are designed to test the null hypothesis

- ${H}_{0}:{S}_{1}\left(t\right)={S}_{2}\left(t\right)=\cdots ={S}_{g}\left(t\right),\forall t\le \tau $

- ${H}_{1}:$ at least one of the ${S}_{j}\left(t\right)$ differ, for some $t\le \tau $.

A rank statistics $T$ is calculated as follows:

Let
${t}_{i}$, for $i=1,2,\dots ,{n}_{d}$, denote the list of distinct failure times across all $g$ groups and ${w}_{i}$ a series of ${n}_{d}$ weights.

Let ${d}_{ij}$ denote the number of failures at time ${t}_{i}$ in group $j$ and ${n}_{ij}$ denote the number of observations in the group $j$ that are known to have not failed prior to time ${t}_{i}$, i.e., the size of the risk set for group $j$ at time ${t}_{i}$. If a censored observation occurs at time ${t}_{i}$ then that observation is treated as if the censoring had occurred slightly after ${t}_{i}$ and, therefore, the observation is counted as being part of the risk set at time ${t}_{i}$.

Finally let

$${d}_{i}=\sum _{\mathit{j}=1}^{g}{d}_{ij}\text{\hspace{1em} and \hspace{1em}}{n}_{i}=\sum _{\mathit{j}=1}^{g}{n}_{ij}\text{.}$$ |

The (weighted) number of observed failures in the $j$th group, ${O}_{j}$, is, therefore, given by

and the (weighted) number of expected failures in the $j$th group, ${E}_{j}$, by

and if $x$ denote the vector of differences $x=({O}_{1}-{E}_{1},{O}_{2}-{E}_{2},\dots ,{O}_{g}-{E}_{g})$

where
${I}_{jk}=1$
if $j=k$ and $0$ otherwise, then the rank statistic, $T$, is calculated as

where ${V}^{-}$ denotes a generalized inverse of the matrix $V$.

$${O}_{j}=\sum _{\mathit{i}=1}^{{n}_{d}}{w}_{i}{d}_{ij}$$ |

$${E}_{j}=\sum _{\mathit{i}=1}^{{n}_{d}}{w}_{i}\frac{{n}_{ij}{d}_{i}}{{n}_{i}}$$ |

$${V}_{jk}=\sum _{\mathit{i}=1}^{{n}_{d}}{w}_{i}^{2}\left(\frac{{d}_{i}({n}_{i}-{d}_{i})({n}_{i}{n}_{ik}{I}_{jk}-{n}_{ij}{n}_{ik})}{{n}_{i}^{2}({n}_{i}-1)}\right)$$ |

$$T=x{V}^{-}{x}^{T}$$ |

Under the null hypothesis,
$T\sim {\chi}_{\nu}^{2}$ where the degrees of freedom, $\nu $, is taken as the rank of the matrix $V$.

The different rank statistics are defined by using different weights in the above calculations, for example

logrank statistic | ${w}_{i}=1$ |

Wilcoxon rank statistic | ${w}_{i}={n}_{i}$ |

Tarone–Ware rank statistic | ${w}_{i}=\sqrt{{n}_{i}}$ |

Peto–Peto rank statistic | ${w}_{i}=\stackrel{~}{S}\left({t}_{i}\right)$ where $\stackrel{~}{S}\left({t}_{i}\right)={\displaystyle \prod _{{t}_{j}\le {t}_{i}}}\phantom{\rule{0.25em}{0ex}}\frac{{n}_{j}-{d}_{j}+1}{{n}_{j}+1}$ |

The most common estimate of the survivor function for censored data is the **Kaplan–Meier** or **product-limit**
estimate,

where ${d}_{j}$ is the number of failures occurring at time ${t}_{j}$ out of ${n}_{j}$ surviving to ${t}_{j}$. This is a step function with steps at each failure time but not at censored times.

$$\hat{S}\left(t\right)=\prod _{j=1}^{i}\left(\frac{{n}_{j}-{d}_{j}}{{n}_{j}}\right)\text{, \hspace{1em}}{t}_{i}\le t<{t}_{i+1}$$ |

As $S\left(t\right)={e}^{-\Lambda \left(t\right)}$ the cumulative hazard rate can be estimated by

A plot of $\hat{\Lambda}\left(t\right)$ or $\mathrm{log}\left(\hat{\Lambda}\left(t\right)\right)$ against $t$ or $\mathrm{log}\left(t\right)$ is often useful in identifying a suitable parametric model for the survivor times. The following relationships can be used in the identification.

$$\hat{\Lambda}\left(t\right)=-\mathrm{log}\left(\hat{S}\left(t\right)\right)\text{.}$$ |

- (a)Exponential distribution: $\Lambda \left(t\right)=\lambda t$.
- (b)Weibull distribution: $\mathrm{log}\left(\Lambda \left(t\right)\right)=\mathrm{log}\lambda +\gamma \mathrm{log}\left(t\right)$.
- (c)Gompertz distribution: $\mathrm{log}\left(\Lambda \left(t\right)\right)=\mathrm{log}\lambda +\gamma t$.
- (d)Extreme value (smallest) distribution: $\mathrm{log}\left(\Lambda \left(t\right)\right)=\lambda (t-\gamma )$.

Often in the analysis of survival data the relationship between the hazard function and the number of explanatory variables or covariates is modelled. The covariates may be, for example, group or treatment indicators or measures of the state of the individual at the start of the observational period. There are two types of covariate time independent covariates such as those described above which do not change value during the observational period and time dependent covariates. The latter can be classified as either external covariates, in which case they are not directly involved with the failure mechanism, or as internal covariates which are time dependent measurements taken on the individual.

The most common function relating the covariates to the hazard function is the proportional hazard function

where ${\lambda}_{0}\left(t\right)$ is a baseline hazard function,
$z$ is a vector of covariates and $\beta $ is a vector of unknown parameters. The assumption is that the covariates have a multiplicative effect on the hazard.

$$\lambda (t,z)={\lambda}_{0}\left(t\right)\mathrm{exp}\left({\beta}^{\mathrm{T}}z\right)$$ |

The form of ${\lambda}_{0}\left(t\right)$ can be one of the distributions considered above or a nonparametric function. In the case of the exponential, Weibull and extreme value distributions the proportional hazard model can be fitted to censored data using the method described by Aitkin and Clayton (1980) which uses a generalized linear model with Poisson errors. Other possible models are the gamma distribution and the log-normal distribution.

Rather than using a specified form for the hazard function, Cox (1972) considered the case when ${\lambda}_{0}\left(t\right)$ was an unspecified function of time. To fit such a model assuming fixed covariates a marginal likelihood is used. For each of the times at which a failure occurred,
${t}_{i}$, the set of those who were still in the study is considered this includes any that were censored at ${t}_{i}$. This set is known as the risk set for time ${t}_{i}$ and denoted by $R\left({t}_{i}\right)$. Given the risk set the probability that out of all possible sets of ${d}_{i}$ subjects that could have failed the actual observed ${d}_{i}$ cases failed can be written as

where ${s}_{i}$ is the sum of the covariates of the ${d}_{i}$ individuals observed to fail at ${t}_{i}$ and the summation is over all distinct sets of ${n}_{i}$ individuals drawn from $R\left({t}_{i}\right)$. This leads to a complex likelihood. If there are no ties in failure times the likelihood reduces to

where ${n}_{d}$ is the number of distinct failure times. For cases where there are ties the following approximation, due to
Peto [2], can be used:

$$\frac{\mathrm{exp}\left({s}_{i}^{\mathrm{T}}\beta \right)}{\sum \mathrm{exp}\left({z}_{l}^{\mathrm{T}}\beta \right)}$$ | (1) |

$$L=\prod _{i=1}^{{n}_{d}}\frac{\mathrm{exp}\left({z}_{i}^{\mathrm{T}}\beta \right)}{\left[{\sum}_{l\in R\left({t}_{i}\right)}\mathrm{exp}\left({z}_{l}^{\mathrm{T}}\beta \right)\right]}$$ | (2) |

$$L=\prod _{i=1}^{{n}_{d}}\frac{\mathrm{exp}\left({s}_{i}^{\mathrm{T}}\beta \right)}{{\left[{\sum}_{l\in R\left({t}_{i}\right)}\mathrm{exp}\left({z}_{l}^{\mathrm{T}}\beta \right)\right]}^{{d}_{i}}}\text{.}$$ | (3) |

Having fitted the model an estimate of the baseline survivor function (derived from ${\lambda}_{0}\left(t\right)$ and the residuals) can be computed to examine the suitability of the model, in particular the proportional hazard assumption.

The following routines are available.

g12aaf computes Kaplan–Meier estimates of the survivor function and their standard deviations.

g12abf performs a comparison of survival curves using rank statistics.

g12baf fits the Cox proportional hazards model for fixed covariates.

g12zaf creates the risk sets associated with the Cox proportional hazards model for fixed covariates.

Depending on the rank statistic required, it may be necessary to call g12abf twice, once to calculate the number of failures (${d}_{i}$) and the total number of observations (${n}_{i}$) at time ${t}_{i}$, to facilitate in the computation of the required weights, and once to calculate the required rank statistics.

The following routines from other chapters may also be useful in the analysis of survival data.

g01mbf computes the reciprocal of Mills' Ratio, that is the hazard rate for the Normal distribution.

g02gcf fits a generalized linear model with Poisson errors (see Aitkin and Clayton (1980)).

g02gdf fits a generalized linear model with gamma errors.

g07bbf fits a Normal distribution to censored data.

g07bef fits a Weibull distribution to censored data.

g08rbf fits a linear model using likelihood based on ranks to censored data
(see Kalbfleisch and Prentice (1980)).

g11caf fits a conditional logistic model. When applied to the risk sets generated by g12zaf it fits the Cox proportional hazards model by exact marginal likelihood in the presence of tied observations.

Cox's proportional hazard model, |

create the risk sets | g12zaf |

parameter estimates and other statistics | g12baf |

Survival, |

Rank statistics | g12abf |

Survivor function | g12aaf |

None.

None.

Aitkin M and Clayton D (1980) The fitting of exponential, Weibull and extreme value distributions to complex censored survival data using GLIM *Appl. Statist.* **29** 156–163

Cox D R (1972) Regression models in life tables (with discussion) *J. Roy. Statist. Soc. Ser. B* **34** 187–220

Gross A J and Clark V A (1975) *Survival Distributions: Reliability Applications in the Biomedical Sciences* Wiley

Kalbfleisch J D and Prentice R L (1980) *The Statistical Analysis of Failure Time Data* Wiley