A Glimpse of Survival Analysis
Created:
Survival analysis is the toolkit for time-to-event questions—where the “event” could be death, but in computer systems it’s more often:
- time until a server fails
- time until a customer churns
- time until an incident is resolved
- time until a job completes
- time until a cache entry expires (or until a key becomes “cold”)
What makes survival analysis special is that it handles incomplete observation windows correctly. In real engineering datasets, you frequently do not observe the event for everyone before you stop collecting data:
- A customer hasn’t churned yet when you export the dataset.
- A disk hasn’t failed yet when you stop the experiment.
- A request hasn’t completed yet when you cut off a trace.
Treating those as “missing” or dropping them biases you toward shorter times and overconfident conclusions. Survival analysis is designed to keep that partial “last-seen” information and remain statistically consistent.
Survival data in systems: what you actually record
A time-to-event dataset typically has at least two columns:
- duration: how long you observed the unit (user/machine/request)
- event: whether the event occurred within your observation window (1) or not (0)
A concrete toy example: time-to-churn (days) for 6 trial users.
| user | duration (days) | event? | meaning |
|---|---|---|---|
| A | 4 | 1 | churned on day 4 |
| B | 7 | 0 | still active at day 7 (not observed to churn yet) |
| C | 2 | 1 | churned on day 2 |
| D | 10 | 0 | still active at day 10 (not observed to churn yet) |
| E | 6 | 1 | churned on day 6 |
| F | 3 | 1 | churned on day 3 |
Key idea (from the reference video): the “event=0” rows are not “unknown.” They tell you the unit lasted at least that long. [survival-video]
You also need to be explicit about your time origin:
- For churn: trial start, first purchase, first session, last renewal?
- For failures: installation time, last maintenance, last reboot?
- For incidents: page time, incident creation time, first alert time?
Changing the origin changes the interpretation of the curves and coefficients.
Two core functions: survival and hazard (with intuition)
Survival function answers: “What fraction lasts beyond time t?”
\[S(t) = P(T > t)\]For churn, you can read $S(30)=0.8$ as “80% of users are still active after 30 days.” [survival-analysis]
Hazard function answers a different question: “Given you’ve made it to time t, how ‘risky’ is right now?”
Formally:
\[h(t) = \lim_{\Delta t \rightarrow 0}\frac{P(t \le T < t+\Delta t \mid T \ge t)}{\Delta t}\]In systems terms, hazard is often closer to what you want operationally:
- “Given a server has been up 20 days, what is its instantaneous failure rate now?”
- “Given a user is still active at week 4, what is the instantaneous churn pressure now?”
A practical relationship to remember:
- high hazard around time t means the survival curve drops steeply around time t
- if hazard increases with time, you’re in “wear-out” behavior (common in hardware); if it decreases, you may be seeing “early-life failures” (infant mortality) or onboarding drop-off
A common derived quantity is the cumulative hazard:
\[H(t) = \int_0^t h(u)\,du \quad \text{and} \quad S(t) = e^{-H(t)}\]This becomes useful when you want to add hazards over stages or interpret models on a log scale. [survival-analysis]
The Kaplan–Meier curve: a survival curve from observed events plus “last-seen” records
If you do not want to assume a distribution, the Kaplan–Meier (KM) estimator gives a non-parametric estimate of $S(t)$ using (a) observed event times and (b) “last-seen active” times for units that did not experience the event within the window. [kaplan-meier]
It is:
\[\hat{S}(t) = \prod_{i:t_i \le t}\left(1-\frac{d_i}{n_i}\right)\]Where at each event time $t_i$:
- $n_i$ = number “at risk” just before $t_i$ (still being observed, event not yet happened)
- $d_i$ = number of events at $t_i$
Toy KM computation (churn example).
Using the 6-user table above, sort by time and update the risk set:
Event times are 2, 3, 4, 6 (and we have “last-seen” times at 7 and 10).
- At day 2: $n=6$, $d=1$ → multiply by $(1-1/6)=5/6$
- At day 3: $n=5$, $d=1$ → multiply by $(1-1/5)=4/5$
- At day 4: $n=4$, $d=1$ → multiply by $(1-1/4)=3/4$
- At day 6: $n=3$, $d=1$ → multiply by $(1-1/3)=2/3$
So:
- $\hat S(2)=5/6 \approx 0.833$
- $\hat S(3)=(5/6)(4/5)=4/6 \approx 0.667$
- $\hat S(4)=(5/6)(4/5)(3/4)=3/6 = 0.5$
- $\hat S(6)=(5/6)(4/5)(3/4)(2/3)=2/6 \approx 0.333$
Where “last-seen” matters: users B and D remain “at risk” up to days 7 and 10, respectively, contributing correctly to $n_i$ up to those times. KM is effectively saying: “they didn’t churn before 7/10, and that information counts.” [kaplan-meier]
A highly practical workflow for product/systems:
- Plot KM curves for cohorts (e.g., different regions, hardware batches, onboarding variants).
- Look for when curves separate: early vs late differences are operationally distinct (onboarding vs retention; infant mortality vs wear-out).
Comparing groups: the log-rank test (what it is and when it lies)
To test whether two groups have different survival curves, the log-rank test compares observed vs expected events over time under the null that the groups share the same survival distribution. [logrank-test]
The intuition:
- At each event time, if Group A has 60% of the risk set, then under the null it “should” get ~60% of the events.
- The log-rank statistic aggregates how much reality deviates from that expectation across time.
When it is useful:
- quick sanity check that a cohort difference is not noise
- comparing two versions of a system or product feature rollout
When it can mislead:
- if the curves cross (effects change over time)
- if observation windows differ systematically between groups (e.g., one region has shorter follow-up windows)
Cox proportional hazards: turning covariates into “risk multipliers”
The Cox proportional hazards model connects covariates to hazard without specifying the baseline hazard shape. [cox-ph]
\[h(t \mid \mathbf{X}) = h_0(t)\exp(\beta^\top \mathbf{X})\]How to read it in practice:
- $\exp(\beta_j)$ is a hazard ratio for a 1-unit increase in $X_j$.
- Hazard ratio > 1 means “riskier” (event happens sooner on average), < 1 means “protective.”
Toy interpretation (churn).
If your Cox model yields:
- $\exp(\beta_{\text{annual_plan}})=0.7$
Then, holding other covariates fixed, annual-plan users have 30% lower instantaneous churn hazard than monthly-plan users at any given time.
The key assumption: proportional hazards means those hazard ratios are roughly constant over time (i.e., the curves differ by a multiplicative factor in hazard, not by shape). When this is false, you may need:
- time-varying covariates/effects
- stratified Cox
- an accelerated failure time (AFT) model for “time scaling” instead of “hazard scaling” [survival-analysis]
Practical checklist for engineers
1) Define observation windows explicitly
Write down, in the experiment doc (not just in code), what “duration” means and when observation stops (end of study, export date, decommission time, trace cutoff).
2) Check whether observation stopping is correlated with the event
Ask simple pipeline questions:
- Do some users “disappear” because tracking stops (not because behavior changed)?
- Do some machines leave the dataset because they were proactively replaced (possibly due to warning signs)?
If “stopping observation” is correlated with the event, naive analyses can bias survival upward.
3) Start with KM curves before fitting models
KM gives shape intuition: early drop, long tail, crossing hazards, etc. [kaplan-meier]
4) Use Cox when you need covariate-adjusted answers
Examples:
- adjust failure risk for load, temperature, and batch
- adjust churn risk for acquisition channel and engagement
5) Operationalize outputs
Good time-to-event analysis ends with a decision:
- “Which cohort should we target, and when?”
- “What burn-in period reduces infant mortality?”
- “Which covariate is the strongest early-warning signal?”
A simple operational trick: evaluate $S(t)$ at business-relevant horizons (e.g., day 1, day 7, day 30) and report those, not just “the curve.”
References
- Survival analysis (overview). [survival-analysis]
- Kaplan–Meier estimator. [kaplan-meier]
- Log-rank test. [logrank-test]
- Proportional hazards / Cox model. [cox-ph]
- Introduction to survival data and incomplete observation windows (video). [survival-video]
Leave a comment