Queuing Theory: Understanding Waiting Lines
Created:
Some people prefer the term “queueing theory” over “queuing theory.” Here, I use the latter for simplicity.
What Is Queuing Theory, and Why Should You Care?
In computer systems, latency is usually waiting, not doing. Requests wait for CPU time, for a DB connection, for a disk, for a lock, for a thread pool, for a downstream dependency, etc. Queuing theory is the toolkit for reasoning about that waiting:
- When does latency “blow up” as load increases?
- Which component is the bottleneck?
- How many workers/threads/servers do we need to hit an SLO?
- Why do averages look fine while p99/p999 gets ugly?
A useful mental model from systems performance practice is the “utilization knee”: as utilization approaches 100%, queues grow superlinearly and tail latency spikes.
The Basics: How Queues Map to Computer Systems
A queuing system has three parts:
- Customers: requests, jobs, packets, RPCs, tasks.
- Servers: CPU cores, threads, DB connections, GPU slots, NICs, disks.
- Queue/buffer: ready queue, thread pool backlog, socket accept queue, message broker topic, request backlog at a load balancer.
A minimal “single service station” view of many components:
arrivals (λ) ---> [ queue ] ---> server(s) with rate μ ---> departures
Key performance metrics you typically care about:
- Utilization: how busy the servers are.
- Queue length: how much work is waiting.
- Waiting time: how long work sits before service starts.
- Response time (sojourn time): waiting + service.
- Throughput: completed work per unit time.
Kendall’s Notation: A Handy Shorthand
Kendall’s notation is a concise way to name common models: A/S/C.
| Symbol | Meaning |
|---|---|
| $A$ | Inter-arrival distribution (e.g., $M$ = Poisson/exponential) |
| $S$ | Service-time distribution (e.g., $M$ = exponential, $D$ = deterministic, $G$ = general) |
| $C$ | Number of parallel servers |
Examples:
- M/M/1: Poisson arrivals, exponential service, 1 server.
- M/M/c: same, but $c$ parallel servers (also called Erlang-C). [mmc-queue]
- G/G/1: “anything goes” arrivals and service times. [gg1-queue]
Queue discipline (FIFO, priority, shortest-job-first, processor-sharing) also matters for tail latency and fairness. [queueing-theory]
Essential Formulas (With Toy Computer-Systems Examples)
1. Little’s Law (works shockingly often)
\[L = \lambda W\]- $L$: average number of requests in the system (waiting + in service)
- $\lambda$: effective arrival rate (throughput in steady state)
- $W$: average time in system (response time)
Little’s Law is extremely general: it does not require Poisson arrivals or exponential service—just a stable, long-run average system. [littles-law]
Example: “How many requests are in-flight?”
You run a service doing $1000$ requests/sec steady-state. Average end-to-end latency is $50$ ms.
- $\lambda = 1000$ req/s
- $W = 0.050$ s
- $L = \lambda W = 1000 \times 0.050 = 50$
So, on average, ~50 requests are concurrently in-flight across your service (queued + running). That’s often directly actionable:
- If your service uses one DB connection per in-flight request, you will need O(50) connections (plus headroom).
- If your service has a max in-flight limit of 30, you will either throttle to ~600 req/s or push queuing upstream.
2. Utilization and the stability condition
For $c$ identical servers with service rate $\mu$ each: \(\rho = \frac{\lambda}{c\mu}\)
Interpretation: $\rho$ is the fraction of time the server capacity is busy (on average).
Critical insight: if $\rho \ge 1$, the queue is unstable (it grows without bound). In real systems, “without bound” manifests as timeouts, retries, OOMs, thread exhaustion, and cascading failure.
Example: single-threaded handler
A single-threaded worker processes requests with mean service time 10 ms.
- $\mu = 1/0.010 = 100$ req/s
- If arrivals are $\lambda = 80$ req/s, then $\rho = 0.8$ (stable)
- If arrivals rise to $\lambda = 110$ req/s, then $\rho = 1.1$ (unstable)
This is why “CPU at 100%” is often a crisis: you have no slack to absorb randomness (bursts, long requests, GC pauses).
3. M/M/1: the simplest queue with a sharp lesson
For M/M/1, mean response time is: \(W = \frac{1}{\mu - \lambda}\) and mean waiting time in queue: \(W_q = W - \frac{1}{\mu}\)
Example: an API server with one worker thread
Same service time: 10 ms $\Rightarrow \mu = 100$ req/s.
If $\lambda = 80$ req/s:
- $W = 1/(100-80) = 1/20 = 0.05$ s = 50 ms
- Service time is 10 ms, so
- $W_q = 50 - 10 = 40 ms$ waiting
- By Little’s Law, $L = \lambda W = 80 \times 0.05 = 4$ requests in system, on average
Operational takeaway: Even at 80% utilization, average response time is 5× the service time (because most time is waiting). As $\rho \rightarrow 1$, $W$ diverges rapidly—this is the “latency knee.”
4. M/M/c (Erlang-C): adding workers reduces waiting—sometimes dramatically
For an M/M/c queue, the probability an arrival must wait (all servers busy) is given by the Erlang-C formula (often written in terms of offered load $a=\lambda/\mu$ and utilization $\rho=\lambda/(c\mu)$). [mmc-queue]
Rather than stare at the algebra, use it to answer a capacity question.
Example: sizing a thread pool / worker pool
Suppose:
- Arrival rate: $\lambda = 200$ req/s
- Each worker averages 12.5 ms per request $\Rightarrow \mu = 80$ req/s per worker
Case A: 3 workers
- $\rho = 200/(3 \times 80) = 0.833$
Erlang-C results (for these numbers):
- $P(\text{wait}) \approx 0.70$
- Mean queueing delay $W_q \approx 17.6$ ms
- Mean response time $W \approx 30.1$ ms
Case B: 4 workers
- $\rho = 200/(4 \times 80) = 0.625$
Now:
- $P(\text{wait}) \approx 0.32$
- $W_q \approx 2.7$ ms
- $W \approx 15.2$ ms
Interpretation: Adding one worker (3 → 4) did not increase capacity by 33% just to get 33% more throughput. It cut mean latency roughly in half by pulling utilization away from the knee and collapsing the queue.
This is a recurring theme in real services: once you are near saturation, small capacity changes can create huge latency changes.
5. Variability matters (a lot): the VUT / Kingman approximation
Real workloads are rarely “memoryless exponential everything.” Variability in arrivals (bursts) and service (slow queries, GC, cache misses) is what inflates tail latency.
A common, practical approximation for G/G/1 is Kingman’s formula (often called the VUT equation): [kingman-formula] \(\mathbb{E}[W_q] \approx \left(\frac{\rho}{1-\rho}\right)\left(\frac{c_a^2 + c_s^2}{2}\right)\tau\)
- $\rho$: utilization
- $c_a$: coefficient of variation of inter-arrival times
- $c_s$: coefficient of variation of service times
- $\tau$: mean service time
Example: same mean service time, different variability
Let:
- $\rho = 0.8$
- $\tau = 10$ ms
- arrivals roughly Poisson $\Rightarrow c_a \approx 1$
Scenario 1: highly variable service ($c_s = 1$)
- $\mathbb{E}[W_q] \approx (0.8/0.2) \times ((1^2+1^2)/2) \times 10 \text{ ms}$
- $= 4 \times 1 \times 10 = 40$ ms
Scenario 2: more predictable service ($c_s = 0.2$)
- $\mathbb{E}[W_q] \approx 4 \times ((1 + 0.04)/2) \times 10$
- $= 4 \times 0.52 \times 10 = 20.8$ ms
Same mean utilization and mean service time, but halving variability roughly halves queueing delay.
Operational takeaway: performance work that reduces variance (caching, eliminating stop-the-world pauses, bounding query time, isolating noisy neighbors) often improves tail latency more than shaving a millisecond off the mean. [kingman-formula]
Queues in Real Computer Systems: Patterns You’ll Actually See
1) A queue at every boundary
Common “queue boundaries” include:
- Load balancer backlog (requests waiting to be accepted)
- Thread pool queue (waiting for a worker)
- DB connection pool (waiting for a connection)
- Downstream service (RPCs queued behind other callers)
- Disk / network (I/O queues)
A key practice: apply Little’s Law per boundary to estimate concurrency needs and identify bottlenecks. [littles-law]
2) Bottlenecks dominate end-to-end latency
If a request must traverse multiple queues (a small queueing network), the slowest / most utilized stage tends to dominate waiting. Queueing networks are a standard lens for multi-tier services. [queueing-theory]
3) Queue discipline changes p99 behavior
FIFO is fair-ish, but it lets one slow job delay many fast ones. Alternative disciplines:
- Priority: protect latency-sensitive traffic (but risk starving background jobs). [queueing-theory]
- Shortest-job-first / SRPT: improves mean latency; may be unfair without safeguards. [queueing-theory]
- Processor sharing: approximates time-slicing, common in some network and CPU models. [queueing-theory]
This is why “just add a queue” is not neutral; it encodes policy.
Practical Modeling Workflow (Minimal but Useful)
- Pick the boundary you’re modeling (e.g., DB connection pool).
-
Measure:
- $\lambda$: arrival/throughput into that boundary
- $\tau$: mean service time
- variability proxies: p50/p95/p99 service times; burstiness of arrivals
- Compute utilization $\rho$ and check headroom.
-
Use a simple model first:
- M/M/1 or M/M/c for “first-order” capacity decisions. [mmc-queue]
- Kingman (G/G/1) when variability is clearly the story. [kingman-formula]
- Validate against production (does predicted $L$ match observed in-flight? does predicted knee match when p99 explodes?).
The goal is not perfect fidelity; it is a defensible mental model that prevents naive mistakes (like running everything at 95–99% utilization and being surprised by timeouts).
Limitations (What the Formulas Don’t Tell You)
- Non-stationarity: diurnal traffic, deployments, incident conditions.
- Correlated arrivals: synchronized retries, fan-out bursts, cache stampedes.
- Heavy-tailed service times: long-tail queries or rare slow paths can dominate p99/p999.
- Finite buffers and backpressure: real queues drop, throttle, or shed load.
- Retries change $\lambda$: a timeout often creates more work, not less.
Queueing theory still helps here, but you often combine it with measurement, tracing, and load testing.
Wrapping It All Up
If you remember only a few points:
- Little’s Law turns latency and throughput into concurrency: $L=\lambda W$. [littles-law]
- Utilization is destiny: near 100%, queues (and tail latency) explode.
- Variability inflates waiting: controlling variance can be more valuable than improving the mean. [kingman-formula]
- Adding capacity near the knee can cut latency disproportionately (Erlang-C intuition). [mmc-queue]
References
- Little’s Law (background). [littles-law]
- System performance and queueing intuition (talk). [srecon24]
- M/M/c (Erlang-C) model (background). [mmc-queue]
- Queueing theory in practice for performance modeling (talk). [lisa17]
- Kingman’s approximation / VUT equation (background). [kingman-formula]
- Harchol-Balter, M. (2013). Performance Modeling and Design of Computer Systems: Queueing Theory in Action. Cambridge University Press.
- Gross, D., & Harris, C. M. (1998). Fundamentals of Queuing Theory. John Wiley & Sons, Inc.
- Kleinrock, L. (1975). Queueing Systems, Volume 1: Theory. John Wiley & Sons, Inc.
- Sutton, C., & Jordan, M. I. (2010). Bayesian inference for queueing networks and modeling of internet services. [sutton-jordan2010]
Leave a comment