A Glimpse of Survival Analysis

Updated: January 17, 2023

2 minute read

Survival analysis is a handy statistical tool for studying the time it takes for an event of interest to happen. You’ll find it used across fields like medical research, engineering, and even social sciences. Let’s dive into the basics of survival analysis—starting with a few key concepts and tools like the survival function, hazard function, Kaplan-Meier estimator, and more.

Survival Function

The survival function is all about probabilities—it tells you the chance that an individual survives beyond a specific point in time.

In math terms, it’s defined as:

\[S(t) = P(T > t)\]

Here, $T$ is the random variable for the time until the event happens, and $t$ is the time point you’re checking. Think of it as the proportion of people who make it past time $t$.

Hazard Function

Next up, we have the hazard function. This measures the likelihood of the event happening at time $t$, assuming the individual has already survived up to that point.

Mathematically, it’s expressed as:

\[h(t) = \lim_{\Delta t \rightarrow 0} \frac{P(t \leq T < t + \Delta t | T \geq t)}{\Delta t}\]

In simpler terms, it’s like asking, “What are the odds this happens right now if they’ve made it this far?”

Kaplan-Meier Estimator

Sometimes, you don’t know the survival time distribution, and that’s where the Kaplan-Meier estimator steps in. It’s a non-parametric way to estimate the survival function based on observed data.

If $t_1, t_2, …, t_n$ are the observed survival times, and $d_1, d_2, …, d_n$ are the number of events at each time point, then the estimator looks like this:

\[\hat{S}(t) = \prod_{i:t_i \leq t} \left( 1 - \frac{d_i}{n_i} \right)\]

Here, $n_i$ is the number of individuals still at risk at time $t_i$. You multiply the survival probabilities up to a certain point to get the overall estimate.

Log-Rank Test

When you want to compare survival between two or more groups, the log-rank test is your go-to.

The idea is simple: check if the observed and expected number of events in each group line up under the assumption that all groups have the same survival distribution. The test uses a statistic based on the difference between observed ($O_i$) and expected ($E_i$) events.

If the differences are large, that’s evidence the groups are different. The math behind it boils down to a chi-squared test.

Cox Proportional Hazards Model

Now, let’s talk about the Cox proportional hazards model—a superstar in survival analysis. This model connects explanatory variables (a.k.a. covariates) to the hazard function, assuming the hazard ratios between groups stay constant over time.

It’s written as:

\[h(t, \boldsymbol{X}) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p)\]

Here:

$h_0(t)$ is the baseline hazard function (i.e., when all covariates are zero).
$\boldsymbol{X}$ is a vector of covariates.
$\beta_1, \beta_2, …, \beta_p$ are the regression coefficients.

One cool thing? It doesn’t assume a specific shape for the baseline hazard function, so it’s super flexible.

Wrapping It Up

Survival analysis is a powerful way to dig into time-to-event data. Key takeaways:

The survival function measures the probability of surviving beyond a certain time.
The hazard function looks at the instantaneous risk of an event.
The Kaplan-Meier estimator helps estimate survival probabilities when you don’t know the distribution.
The log-rank test compares survival between groups.
The Cox model links covariates to survival while staying flexible with assumptions.

By understanding these tools, you’re well on your way to tackling real-world survival data.

Share on

Twitter Facebook LinkedIn

Hongyu Hè

A Glimpse of Survival Analysis

Survival Function

Hazard Function

Kaplan-Meier Estimator

Log-Rank Test

Cox Proportional Hazards Model

Wrapping It Up

Share on

Leave a comment

You may also enjoy

Queuing Theory: Understanding Waiting Lines

Are you Sure your Linux PID is the Process ID?

Connecting Bayesian to Regularization

Good Reads