Sigmoda Blog
Incident runbooks for LLM products
On‑call for LLM features is stressful in a specific way: the output is the bug. You don’t get a neat stack trace. You get “it answered something unsafe,” “it got slow,” or “why did this route cost $X this hour?”
That’s why you want a runbook. Not for process theatre—because when the pager goes off, you want a short, boring path to: stop the bleeding, understand the blast radius, and ship a targeted fix.
The three incident classes (and what to measure)
- Quality regressions: hallucinations, policy violations, unsafe content, prompt leakage.
- Performance/cost spikes: latency jumps, token explosion, retry loops, increased volume.
- Provider issues: elevated 5xx/429s, degraded streaming, regional outages, model deprecations.
Instrumenting a few fields ahead of time makes all three classes debuggable: route, env, model, tokens in/out, duration, and a simple status/flag signal.
First 10 minutes: stop the bleeding
- Confirm the symptom: is it quality, latency/cost, or provider?
- Identify blast radius: which route(s), which env(s), which customer tier(s)?
- Flip the kill switch for the affected route if user impact is high.
- If provider errors: enforce retry limits and fail fast with a safe fallback message.
- Capture a small set of example event IDs (5–10) for the incident thread.
Triage: a checklist per incident type
1) Quality regression (bad outputs)
- Check recent deploys and prompt changes for the affected route.
- Filter events to the last 30–60 minutes; compare flagged rate vs baseline.
- Sample 20–30 outputs; label good/bad and note the failure pattern.
- If the pattern is narrow (e.g., a banned phrase), patch guardrails immediately.
- If the pattern is broad, downshift the feature: shorten context, switch to a safer prompt, or temporarily disable the route.
// Useful filters to have in your UI or logs:
status = "flagged"
OR contains_banned = true
OR tokens_out > threshold
AND metadata.route = "support.reply"// Useful filters to have in your UI or logs:
status = "flagged"
OR contains_banned = true
OR tokens_out > threshold
AND metadata.route = "support.reply"2) Latency spike
- Compare p95 duration by route + model (is one model slow or everything?).
- Check tokens-in/out distributions: token inflation often drives latency.
- Look for retry loops (client retries + server retries can multiply).
- If available, reroute to a faster model/provider for the affected route.
- Clamp max output tokens temporarily; it often stabilizes the system quickly.
3) Cost spike
- Check call volume by route (did traffic change or did behavior change?).
- Check tokens-in/out: a prompt bug can double cost instantly.
- Check model mix: did a route switch to a more expensive model?
- Apply a budget clamp (max tokens) and/or switch to a smaller model temporarily.
- Audit top prompts by total cost in the last hour; you’ll usually find a small set dominating.
4) Provider incident (429/5xx)
- Confirm error codes and affected regions/models.
- Enable exponential backoff + a strict retry cap (avoid thundering herds).
- Fail fast with a clear user message rather than hanging requests.
- If you have multi-provider support, reroute critical routes first.
- Log provider response metadata so you can confirm recovery and identify recurrence.
What to have before the incident
- A dashboard per route: flagged rate, p95 duration, p95 tokens_out, and total cost.
- An alert when flagged rate or p95 tokens_out jumps (regressions show up here early).
- A kill switch per route (even a config flag is fine).
- A way to label outputs quickly (good/bad + note) so incidents create data.
What to capture for the postmortem
If you capture the right data during the incident, your fix will be smaller and you’ll prevent repeats. Here’s the minimum set that tends to pay off:
- The exact route(s) impacted and the time window.
- The model/provider used and any routing decisions.
- A sample of event IDs with labels (good/bad) and notes.
- Before/after metrics: flagged rate, p95 latency, avg tokens-in/out, cost per call.
- The mitigation actions taken (kill switch flips, model changes, token clamps).
A runbook template you can copy
# LLM Incident Runbook
## Summary
- Symptom:
- Start time:
- Affected routes:
- Affected customers:
- Current mitigation:
## Triage
- Recent deploy or prompt change? (link)
- Model/provider changes? (what/when)
- Metrics deltas (baseline vs now):
- flagged_rate:
- p95_duration_ms:
- p95_tokens_out:
- cost_per_call:
## Mitigation options
- Kill switch route:
- Clamp max tokens:
- Switch model:
- Switch provider:
- Disable retries / set cap:
## Evidence
- Example event IDs:
- Labels + notes:
## Follow-ups
- Permanent fix:
- Tests/evals to add:
- Guardrails to tighten:# LLM Incident Runbook
## Summary
- Symptom:
- Start time:
- Affected routes:
- Affected customers:
- Current mitigation:
## Triage
- Recent deploy or prompt change? (link)
- Model/provider changes? (what/when)
- Metrics deltas (baseline vs now):
- flagged_rate:
- p95_duration_ms:
- p95_tokens_out:
- cost_per_call:
## Mitigation options
- Kill switch route:
- Clamp max tokens:
- Switch model:
- Switch provider:
- Disable retries / set cap:
## Evidence
- Example event IDs:
- Labels + notes:
## Follow-ups
- Permanent fix:
- Tests/evals to add:
- Guardrails to tighten:Keep runbooks short and specific. If it reads like a textbook, it won’t be used. Your runbook should tell a tired engineer what to do next and what to look at to confirm they’re fixing the right thing.