4 minute read

Survival analysis is a statistical method used to analyze the time until an event of interest occurs. It is used in various fields, including medical research, engineering, and social sciences. In this blog, we will explore the basics of survival analysis, including the definition of survival function, hazard function, and Kaplan-Meier estimator.

Definition of Survival Function

The survival function is a fundamental concept in survival analysis. It describes the probability that an individual survives beyond a certain time. Mathematically, the survival function is defined as:

\[S(t) = P(T > t)\]

where $T$ is the random variable representing the time until the event of interest occurs, and $t$ is a specific time point. The survival function can also be interpreted as the proportion of individuals who survive beyond time $t$.

Hazard Function

The hazard function is another important concept in survival analysis. It describes the instantaneous rate of occurrence of the event of interest at time $t$, given that the individual has survived up to time $t$. The hazard function is defined as:

\[h(t) = \lim_{\Delta t \rightarrow 0} \frac{P(t \leq T < t + \Delta t | T \geq t)}{\Delta t}\]

The hazard function can also be interpreted as the probability of the event of interest occurring in the next infinitesimal time interval, given that the individual has survived up to time $t$.

Kaplan-Meier Estimator

The Kaplan-Meier estimator is a non-parametric method used to estimate the survival function. It is particularly useful when the distribution of survival times is unknown or non-normal. The estimator is based on the observed survival times of a sample of individuals. Let $t_1, t_2, …, t_n$ be the observed survival times, and let $d_1, d_2, …, d_n$ be the corresponding number of events (i.e., deaths) at each time point. The Kaplan-Meier estimator is defined as:

\[\hat{S}(t) = \prod_{i:t_i \leq t} \left( 1 - \frac{d_i}{n_i} \right)\]

where $n_i$ is the number of individuals at risk at time $t_i$. The estimator can be interpreted as the product of the probabilities of survival up to each time point. The denominator in the product is the number of individuals at risk at each time point, which is equal to the total sample size minus the number of events up to that time point.

Log-Rank Test

The log-rank test is a statistical test used to compare the survival distributions of two or more groups. The null hypothesis is that the survival distributions are the same across all groups. The test is based on the difference between the observed and expected number of events in each group, assuming that the null hypothesis is true. The test statistic is given by:

\[Z = \frac{(O_1 - E_1)^2}{V_1} + \frac{(O_2 - E_2)^2}{V_2} + ... + \frac{(O_k - E_k)^2}{V_k}\]

where $O_i$ is the observed number of events in group $i$, $E_i$ is the expected number of events in group $i$ under the null hypothesis, and $V_i$ is the variance of the number of events in group $i$ under the null hypothesis. The test statistic follows a chi-squared distribution with $k-1$ degrees of freedom, where $k$ is the number of groups being compared.

Cox Proportional Hazards Model

The Cox proportional hazards model is a widely used semi-parametric method in survival analysis. It is used to model the relationship between covariates (i.e., explanatory variables) and the hazard function, while assuming that the hazard ratios are constant over time. The hazard ratio represents the relative risk of the event of interest for individuals with a certain value of the covariate, compared to individuals with a reference value of the covariate. The Cox model assumes that the hazard function is given by:

\[h(t, \boldsymbol{X}) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p)\]

where $h_0(t)$ is the baseline hazard function (i.e., the hazard function for individuals with a reference value of all covariates), $\boldsymbol{X}$ is a vector of $p$ covariates, and $\beta_1, \beta_2, …, \beta_p$ are the corresponding regression coefficients. The Cox model does not require any assumptions about the shape of the baseline hazard function, making it more flexible than parametric models.

The likelihood function for the Cox model can be written as:

\[L(\beta_1, \beta_2, ..., \beta_p) = \prod_{i=1}^n \left( \frac{\exp(\beta_1 X_{i1} + \beta_2 X_{i2} + ... + \beta_p X_{ip})}{\sum_{j \in R_i} \exp(\beta_1 X_{j1} + \beta_2 X_{j2} + ... + \beta_p X_{jp})} \right)^{\delta_i}\]

where $X_{i1}, X_{i2}, …, X_{ip}$ are the covariate values for individual $i$, $\delta_i$ is the event indicator for individual $i$ (i.e., $\delta_i=1$ if the event of interest occurs for individual $i$, and $\delta_i=0$ otherwise), and $R_i$ is the set of individuals at risk at the time of event for individual $i$. The Cox model estimates the regression coefficients that maximize the likelihood function.


Survival analysis is a powerful tool for analyzing time-to-event data. The survival function and hazard function are important concepts in survival analysis, and the Kaplan-Meier estimator is a useful non-parametric method for estimating the survival function. The log-rank test and Cox proportional hazards model are commonly used statistical methods for comparing survival distributions and modeling the relationship between covariates and the hazard function, respectively. By understanding these concepts and methods, researchers can gain valuable insights into the time-to-event outcomes in their data.

Leave a comment