Sigmoda Blog

LLM guardrails that don’t break shipping velocity

11/18/2025·9 min read

Week one is all momentum: you get something working, you ship, you high‑five. Week two is when reality shows up. A user pastes something weird. A prompt change doubles latency. Someone asks, “Wait… are we storing this?”

The reflex is to bolt on a “policy layer” above the app. It sounds responsible, but it often turns into a second product to maintain: exceptions pile up, people route around it, and the rules become a source of noise instead of safety.

A simpler approach: put guardrails where the data already is—your LLM event pipeline. If you’re already logging route, tokens, latency, and output, you’re sitting on enough signal to start. Begin with cheap checks, make them debuggable, then tighten over time.

What you’re aiming for

You’re not trying to make the model “perfect.” You’re trying to make failures obvious and containable: you can spot regressions quickly, see the blast radius (which route, which env, which tier), and ship a targeted fix instead of freezing the whole roadmap.

That only works if guardrails match how engineers debug: scoped by route (or feature), environment, tier, and release. Global rules sound tidy, but global rules usually become false‑positive factories—and once people stop trusting the signal, you’ve lost.

Start with two boring limits

If you only ship two guardrails at the start, make them boring: max output tokens and max duration. They’re easy to reason about, easy to tune, and they stop the two failure modes users notice immediately: “why is this so slow?” and “why won’t it stop talking?”

Cap `tokens_out` per route (or per feature if that’s how your app is structured).
Cap `duration_ms`. If you can’t hard‑enforce yet, alert when p95 drifts upward.

Pick initial values using your own traffic. If your p95 output is ~250 tokens, don’t set the cap to 4,000 “just in case.” That’s how you pay for a bad day. A cap around 2–3× p95 is usually generous without being reckless.

Practical default

A good starting point: max output tokens ≈ 2–3× p95 and max duration ≈ 2× p95. Tighten slowly. A ceiling you can defend in code review is more valuable than a “perfect” number nobody trusts.

Scope rules to the surface area you ship

Once budgets exist, the next problem is drift. A route that used to be cheap slowly accumulates context. Someone “temporarily” upgrades the model during an incident. The incident ends. The temporary fix becomes the default.

Avoid drift by making model choice an explicit policy, not a code accident. An allowlist per route is enough. The important part is logging the decision with the event so you can answer “why is this expensive now?” without spelunking through commits.

// Example policy shape (config, not hard-coded)
{
  "route": "support.reply",
  "allowedModels": ["small-model", "large-model"],
  "maxOutputTokens": 700,
  "maxDurationMs": 8000
}

// Example policy shape (config, not hard-coded)
{
  "route": "support.reply",
  "allowedModels": ["small-model", "large-model"],
  "maxOutputTokens": 700,
  "maxDurationMs": 8000
}

Prefer checks you can explain (at first)

Early on, avoid “magic” safety scores that nobody can explain at 2am. Start with checks you can explain in a PR review: scoped banned phrases, obvious secret patterns, and redaction.

Keep banned phrase lists scoped. Global lists start clean and end noisy.
Add patterns for obvious leakage: API keys, OAuth tokens, private URLs, internal hostnames.
Redact before storing where you can. If you can’t, mark and quarantine the event.

The point isn’t to catch everything. It’s to catch the predictable failures immediately and build a foundation you can extend later (classifiers, LLM‑as‑judge, richer policy) once you have real data to tune against.

Flag first. Earn enforcement.

If you block outputs on day one, you’ll either (a) ship nothing, or (b) disable the guardrail the first time it’s annoying. A better pattern is: flag loudly, review quickly, and only then enforce the small subset of rules that keep proving themselves.

The feedback loop matters more than the rule. In Sigmoda today we mark events as `flagged` when they exceed a budget or hit a route’s banned phrase list, then let humans label outputs (`is_bad` + a short `note`). That turns “I think we have a problem” into something you can verify, fix, and later automate.

Phase 1: Log only (no user impact).
Phase 2: Flag + review (internal queue, fast labeling).
Phase 3: Soft block (fallback response + clear messaging).
Phase 4: Hard block (only when the false-positive rate is low).

“Flag + review” isn’t a compromise. It’s how you avoid turning guardrails into a permanent argument between product and safety.

The under-rated requirement: a kill switch

If mitigation requires a code deploy, it won’t get used at 2am. A route‑level kill switch (feature flag or config) lets on‑call stop the bleeding immediately, then follow up with a clean fix in daylight hours.

If you add a kill switch, record its state in event metadata. Otherwise you’ll waste time asking “why do outputs look different today?”

What to log (so guardrails are debuggable)

You don’t need a complex system. You need consistent fields so every guardrail has the same context. At minimum, capture:

Route (or feature), environment, and customer tier.
Model + provider (and any routing decision you made).
Tokens in/out + duration (so budgets are enforceable).
A “flag reason” when something is out of policy (not just “flagged”).
A stable way to label outputs (good/bad + short note).

A quick litmus test

If a guardrail creates more questions than answers, it’s not ready. A good guardrail doesn’t just detect a problem; it makes the next action obvious: what changed, who’s affected, and what knob you can turn right now.

guardrailssafetypolicy