Sigmoda Blog

Shadow evaluations: compare prompts/models on real traffic without risking users

2/4/2026·12 min read

Most LLM regressions ship because teams evaluate the wrong thing. They test prompts on a handful of examples, flip the switch, and only then learn what production traffic really looks like.

Shadow evaluation (sometimes called shadow traffic or dark launch) is the pragmatic fix: run the candidate prompt/model on real production inputs, log the result, but do not show it to the user. You learn what would have happened without taking the risk.

Canary vs shadow: pick the right tool

Canaries change user-visible behavior for a small percent of traffic. Shadow mode does not change user-visible behavior at all. If you are unsure about quality, start in shadow. If you are confident and optimizing for rollout speed, canary.

Shadow: safest, but costs extra (you are running additional calls).
Canary: cheaper than shadow, but it can harm users if the variant is bad.
Offline eval: cheapest, but it rarely matches production distribution.

Step 1: add experiment metadata you can actually query

Shadow mode only works if you can join control and candidate events and filter them cleanly. Use consistent metadata fields for every run.

// Suggested event metadata for shadow evals
{
  "route": "support.reply",
  "env": "prod",
  "request_id": "7b2c7d0e-2f3e-4a7e-a2bd-8c0f2a9d8c93",
  "experiment": "support.reply.shadow-2026-02-04",
  "variant": "control", // or "shadow"
  "is_shadow": false,   // true for the shadow run
  "prompt_version": "support.reply@2026-02-04.1"
}

// Suggested event metadata for shadow evals
{
  "route": "support.reply",
  "env": "prod",
  "request_id": "7b2c7d0e-2f3e-4a7e-a2bd-8c0f2a9d8c93",
  "experiment": "support.reply.shadow-2026-02-04",
  "variant": "control", // or "shadow"
  "is_shadow": false,   // true for the shadow run
  "prompt_version": "support.reply@2026-02-04.1"
}

One rule

Pick one join key (request_id, trace_id, or parent_event_id) and use it everywhere. If control and shadow cannot be joined deterministically, your analysis turns into guesswork.

Step 2: run shadow calls without hurting p95

Shadow calls should not add latency to the user path. Run them async, cap their budgets, and sample aggressively. The goal is signal, not perfection.

// Pseudocode: control is returned to the user; shadow runs asynchronously.
async function handleSupportReply(input: string) {
  const requestId = crypto.randomUUID();

  const control = await runLLM({
    model: "gpt-5-mini",
    input,
    metadata: {
      route: "support.reply",
      env: "prod",
      request_id: requestId,
      experiment: "support.reply.shadow-2026-02-04",
      variant: "control",
      is_shadow: false,
    },
  });

  // 1% sample; fire-and-forget; clamp tokens to control cost.
  if (Math.random() < 0.01) {
    void runLLM({
      model: "gpt-5.2",
      input,
      max_output_tokens: 256,
      metadata: {
        route: "support.reply",
        env: "prod",
        request_id: requestId,
        experiment: "support.reply.shadow-2026-02-04",
        variant: "shadow",
        is_shadow: true,
      },
    }).catch(() => {});
  }

  return control.output_text;
}

// Pseudocode: control is returned to the user; shadow runs asynchronously.
async function handleSupportReply(input: string) {
  const requestId = crypto.randomUUID();

  const control = await runLLM({
    model: "gpt-5-mini",
    input,
    metadata: {
      route: "support.reply",
      env: "prod",
      request_id: requestId,
      experiment: "support.reply.shadow-2026-02-04",
      variant: "control",
      is_shadow: false,
    },
  });

  // 1% sample; fire-and-forget; clamp tokens to control cost.
  if (Math.random() < 0.01) {
    void runLLM({
      model: "gpt-5.2",
      input,
      max_output_tokens: 256,
      metadata: {
        route: "support.reply",
        env: "prod",
        request_id: requestId,
        experiment: "support.reply.shadow-2026-02-04",
        variant: "shadow",
        is_shadow: true,
      },
    }).catch(() => {});
  }

  return control.output_text;
}

Cost reality check

Shadow mode can double spend if you run it on 100% traffic. Start at 0.5-2% sampling, clamp output tokens, and expand only when you know what you are paying for.

Shadow must be side-effect free

If your LLM route can take actions (tools, function calls, emails, refunds, database writes), a naive shadow run is dangerous. In shadow mode, the candidate must never perform side effects. Run it with tools disabled, or route tool calls to a dry-run stub that only records what it would have done.

Disable tools/actions entirely for shadow runs when possible.
If you must keep tools enabled, make them no-op in shadow and log intended actions.
Never let a shadow run send user-visible messages (email, push, Slack).

// Pseudocode: keep the candidate honest by forcing dry-run behavior.
const isShadow = metadata.is_shadow === true;

await runLLM({
  model: "gpt-5.2",
  input,
  tools: isShadow ? [] : tools,
  metadata,
});

// Pseudocode: keep the candidate honest by forcing dry-run behavior.
const isShadow = metadata.is_shadow === true;

await runLLM({
  model: "gpt-5.2",
  input,
  tools: isShadow ? [] : tools,
  metadata,
});

Step 3: compare the right metrics (not vibes)

Your first pass should be quantitative and boring: cost, latency, error rate, and guardrail signals. Then do a small, focused human review for true quality.

Cost: avg and p95 cost_estimate by variant.
Latency: p50/p95 duration_ms by variant.
Stability: error rate and retry rate by variant.
Guardrails: flagged rate, too_long, contains_banned deltas.
Output length: tokens_out distribution (often correlates with user annoyance and cost).

-- Example rollup if you have DB access (Postgres + JSONB metadata)
SELECT
  metadata->>'variant' AS variant,
  COUNT(*) AS n,
  AVG(cost_estimate) AS avg_cost,
  percentile_cont(0.95) WITHIN GROUP (ORDER BY duration_ms) AS p95_ms,
  AVG(CASE WHEN status = 'flagged' THEN 1 ELSE 0 END) AS flagged_rate
FROM events
WHERE metadata->>'experiment' = 'support.reply.shadow-2026-02-04'
  AND metadata->>'env' = 'prod'
GROUP BY 1
ORDER BY n DESC;

-- Example rollup if you have DB access (Postgres + JSONB metadata)
SELECT
  metadata->>'variant' AS variant,
  COUNT(*) AS n,
  AVG(cost_estimate) AS avg_cost,
  percentile_cont(0.95) WITHIN GROUP (ORDER BY duration_ms) AS p95_ms,
  AVG(CASE WHEN status = 'flagged' THEN 1 ELSE 0 END) AS flagged_rate
FROM events
WHERE metadata->>'experiment' = 'support.reply.shadow-2026-02-04'
  AND metadata->>'env' = 'prod'
GROUP BY 1
ORDER BY n DESC;

Step 4: do paired human review (fast and fair)

Aggregates do not catch tone drift, missing citations, or subtle failure patterns. The fix is paired review: show a human both answers for the same request without revealing which is control.

Sample 30-60 request_ids from production shadow traffic.
Render control and shadow outputs side-by-side, randomized.
Label: wins/losses/ties + a short note (why).
Look for pattern failures (not one-off weirdness).

Make it repeatable

If every evaluation is a bespoke spreadsheet, you will not do it consistently. Standardize labels and store them with the event so you can trend quality over time.

Step 5: promote the winner with a rollback plan

Shadow mode should end with a decision. If the candidate wins (or is an acceptable tradeoff), move to a canary. If it loses, keep the data and iterate with a new prompt/model version.

Promote: 1% canary -> 10% -> 50% -> 100%, with stop conditions.
Rollback: keep a last-known-good prompt_version and model as a one-click switch.
Record: experiment name, decision, and notes so future you knows why.

A checklist you can paste into a PR

## Shadow eval checklist
- Route:
- Experiment:
- Control: model + prompt_version
- Candidate: model + prompt_version
- Join key (request_id/trace_id):
- Sampling %:
- Budget clamps (max_output_tokens, timeout):
- Side effects: tools disabled / dry-run stub
- Metrics to compare (cost, p95 latency, flagged rate):
- Paired review size + result:
- Decision + rollout plan + rollback knob:

## Shadow eval checklist
- Route:
- Experiment:
- Control: model + prompt_version
- Candidate: model + prompt_version
- Join key (request_id/trace_id):
- Sampling %:
- Budget clamps (max_output_tokens, timeout):
- Side effects: tools disabled / dry-run stub
- Metrics to compare (cost, p95 latency, flagged rate):
- Paired review size + result:
- Decision + rollout plan + rollback knob:

Shadow evaluations are not fancy. They are a way to stop shipping regressions by learning from production traffic before users do. Once you have the metadata and a repeatable review loop, model and prompt upgrades become boring again.

evalsreleasereliability