RZ AI Learning

The Dispatcher: How Automated Evaluation Infrastructure Makes Multi-Agent Quality Measurable

Fourth in a series. Previous posts: (1) AI Agent Teams for Analytics — the 17-agent architecture and lifecycle. (2) Best Practices To Keep AI Agents on Track — how enforcement mechanism matters more than rule content. (3) Agent Discussion: The Quality Layer That Harness Engineering Can’t Replace — how structured agent-to-agent discussion catches judgment-dependent quality issues.

Summary

  1. You cannot improve what you cannot measure repeatably. The compliance gains reported in posts 1 and 2 (~58% → ~65% → ~77% → ~90%+) were only possible because every spec change could be evaluated against the same 24-run battery overnight, with programmatic scoring across all outputs. Without that, every spec change is an opinion.
  2. The Dispatcher compresses a 7-10 day evaluation cycle into one overnight run. Manual evaluation caps at 3-5 runs/day (human attention is the bottleneck, not compute). Automated evaluation runs 12-24/day with zero babysitting — launch before bed, analyze results next morning.
  3. Three rounds per task turns anecdotes into measurements. A rule at 90% true compliance looks identical to 100% on a single run. Three rounds per task is the minimum to distinguish systematic compliance from random variation in non-deterministic agent systems.
  4. Systematic runs surface failure patterns ad-hoc testing never finds. A 100% correlation between a specific agent routing pattern and missing deliverables — visible only across dozens of automated runs — drove a routing architecture change that would never have been discovered through one-off testing.
  5. The Dispatcher adapts to any hardware automatically. An adaptive resource monitor with SIGSTOP/SIGCONT freeze-thaw scales concurrency to available RAM — from a laptop running 1 pipeline to a server running 6+ — without configuration changes.
  6. The Dispatcher is a general pattern, not a one-off script. Any multi-agent system that needs quality measurement benefits from: queue management, controlled-comparison evaluation, adaptive resource monitoring, completion detection, and programmatic compliance scoring.

1. Introduction

The previous two posts described what to change in a multi-agent system: enforcement mechanisms for rule compliance, structured discussion for judgment quality. This post addresses a prerequisite question: how do you know your changes actually worked?

When the system is one prompt and one model call, the answer is simple: re-run the prompt, eyeball the output. When the system is eight agents coordinating across 60-90 minutes of work, producing dozens of intermediate artifacts and a final deliverable, the question becomes operationally hard. A single run consumes hours of wall-clock time and tens of gigabytes of RAM. A single observation tells you nothing about variance. Manual scoring of compliance across 16 rules and 24 runs is a full day of human labor.

We hit this wall in early 2026. We had ideas for spec improvements but no way to evaluate them at the pace ideas arrived. The bottleneck was not LLM inference, not API quota, not engineering capacity to design changes — it was the human in the loop, manually launching runs and manually comparing outputs. A spec change took 5-7 days to evaluate, and most of that time was waiting for someone to babysit the next test.

The Dispatcher is the infrastructure that removed the human from the inner loop. It is a single bash script that manages a queue of pipeline runs, launches them in parallel with adaptive resource management, detects completion, and produces structured outputs amenable to programmatic compliance scoring. With it, a spec change takes one overnight cycle to evaluate — launch before bed, analyze results next morning.

This post covers the measurement problem the Dispatcher solves, the design that makes it reliable at scale, the discovery value it provides, and the principles that generalize to any multi-agent quality program.


2. The Problem: Quality Improvement Requires Repeatable Measurement

Observation 1: A single run tells you almost nothing

Non-deterministic agent systems produce variable outputs even with identical inputs. The same pipeline, given the same question on the same spec version, will:

  • Catch a methodology issue on one run and miss it on another
  • Produce a 7-bullet TL;DR on one run and a 12-bullet TL;DR on another
  • Reach a checkpoint deadlock on one run and pass it cleanly on the next

If you measure compliance on a single run, your measurement noise is comparable to or larger than the effect you’re trying to detect. A rule at 85% true compliance has a 53% chance of looking like 100% in a single run, and a 12% chance of looking like 0%. Single-run measurement systematically over-estimates rules with rates above 50% and under-estimates those below. The measurement floor is much higher than people intuit.

Anthropic’s January 2026 evaluation engineering writeup [Anthropic 2026] arrived at the same conclusion from a different angle, and formalized the methodology we converged on independently. They recommend reporting two complementary multi-trial metrics: pass@k (probability of at least one success in k attempts — the optimistic capability ceiling) and pass^k (probability that all k trials succeed — the reliability floor). The gap between the two quantifies behavioral consistency. Their most striking validation point: an internal benchmark that initially scored a frontier model at 42% rose to 95% (+53 percentage points) once grading bugs, ambiguous task specs, and stochastic confounds were resolved. The headline 42% number was almost entirely an artifact of eval infrastructure, not agent capability. They state plainly that this kind of correction is only reachable when “each trial is isolated by starting from a clean environment” and outputs are scored systematically across many runs. Our 3-rounds-per-task design and structured per-run directories are a concrete implementation of the same principle; the Anthropic writeup is independent validation that eval infrastructure quality directly determines eval validity.

Table 1: Single-Run vs. 3-Round Measurement Confidence

True Compliance Rate Probability 1-run = 100% Probability 1-run = 0% 3-round 95% CI half-width
100% 100% 0% ±0
90% 90% 10% ±19pp
75% 75% 25% ±28pp
50% 50% 50% ±32pp
25% 25% 75% ±28pp

Three rounds per task is not statistical rigor — it is the minimum to distinguish “systematic” from “occasional.” Five or ten rounds would be tighter, but cost scales linearly and the marginal information drops rapidly past three. We picked three as the floor and have not regretted it.

Observation 2: Controlled comparison requires identical inputs across versions

Comparing V3 to V4 compliance is only meaningful if the same questions are asked, the same evaluation criteria are applied, and the only variable is the spec change. In ad-hoc evaluation this rarely holds: the evaluator picks a different question for V4 because V3’s question “wasn’t a great test case,” or applies stricter scoring because they noticed a failure mode they hadn’t tracked before. Each comparison is contaminated by selection bias.

The Dispatcher’s task list is fixed across versions. The same 8 tasks (covering simple trend, premise refutation, experiment readout, surface comparison, multi-metric ranking, data challenge, root-cause investigation, comprehensive report) run on every version. The same scorers grep the same artifacts for the same patterns. A V3→V4 comparison reflects spec change effects, not evaluator drift.

Observation 3: Manual evaluation caps far below compute capacity

A single pipeline run takes 60-90 minutes of wall-clock time. The number of safe concurrent runs depends on available hardware — on our development server (~200GB RAM), 2-3 concurrent runs are safe, yielding 24-32 runs in a 16-hour overnight window. Machines with more or less RAM would have different caps, but the Dispatcher’s memory monitor automatically adapts to whatever hardware is available. In manual operation, the practical cap was 3-5 runs per day regardless of hardware: the bottleneck is human attention, not compute.

The gap is not compute — it is human attention. The Dispatcher operates while the human sleeps. The ratio of compute-capable runs to manual runs is roughly 6:1.

Table 2: Throughput Comparison

Workflow Stage Manual Time Automated Time
Per-run setup (terminal, question entry, config) ~5 min 0 (queue-driven)
Per-run monitoring (permission prompts, stalls, OOM) ~10 min of active attention 0 (auto-detection)
Per-run cleanup (save output, organize files) ~3 min 0 (organized in run directory)
Practical daily cap 3-5 runs 12-24 runs
Overnight capacity 0 runs 12-16 runs
Total evaluation cycle (24 runs) 5-7 days 1 overnight cycle

Observation 4: Programmatic scoring requires structured artifacts

Even if 24 runs complete, scoring them manually takes a full day: open each output file, check each rule, record pass/fail in a spreadsheet. Across 16 rules × 24 runs that’s 384 assessments per evaluation cycle. The Dispatcher’s per-run directory structure (one folder per run, all artifacts inside) makes programmatic scoring feasible: a scoring script grep-walks every run directory in seconds and produces a compliance matrix.

This only works because the Dispatcher enforces directory structure. Manual runs scatter outputs across the filesystem, and programmatic scoring becomes a parsing problem rather than a measurement problem.


3. Architecture: A Queue, a Memory Monitor, and a Completion Detector

The Dispatcher’s core is ~600 lines of bash. The simplicity is intentional — it has to be debuggable at 3 AM when something goes wrong overnight. Three primitives carry the design.

flowchart LR
    Q["Task Queue"] --> L["Launch"]
    L --> M["Memory Monitor"]
    M --> C["Completion Detect"]
    C -->|done| S["Score"]
    C -->|failed| Q
    S --> R["Compliance Matrix"]

    style Q fill:#e3f2fd,stroke:#1565C0
    style M fill:#fff3e0,stroke:#E65100
    style S fill:#c8e6c9,stroke:#2E7D32

Primitive 1: Queue Management

The queue is a flat file. Each line is one job: task ID, round number, status (pending/running/completed/failed), start timestamp, end timestamp, run directory. The Dispatcher loop iterates the queue every 2 minutes:

  1. For each running job: check if the run directory has the completion markers (final-analysis.md + deliverable-url.md). If yes, mark completed. If no, check tmux session liveness. If session is gone but no completion marker, mark failed.
  2. For each pending job: if running count < max_parallel AND memory is healthy, launch the next job. Update status to running, record start timestamp, create the tmux session, send the question, attach the permission auto-accept handler.
  3. For each failed job: log the failure reason, decide whether to re-queue (transient: OOM, stall) or abandon (logic error, infinite loop).

The queue is durable: if the Dispatcher itself crashes, restarting it picks up where it left off. Running jobs remain attached to their tmux sessions; the Dispatcher rediscovers them on restart by matching session names to queue entries.

Primitive 2: Memory Safety with 3-Tier Thresholds

Each pipeline run spawns 8+ agent processes. RAM usage per run varies by hardware and model, but is substantial — on a typical development server, ~40GB per run at steady state with spikes during context-heavy phases. The safe number of concurrent runs depends entirely on available RAM. The Dispatcher doesn’t hardcode this — it monitors swap pressure dynamically and adapts:

  • Above the safe threshold: launch new runs freely
  • Approaching pressure: pause new launches, let running runs complete
  • Critical pressure: freeze the lighter project via SIGSTOP, resume when pressure subsides

This adaptive approach means the same Dispatcher script works on a 64GB laptop (1 concurrent run) or a 512GB server (6+ concurrent runs) without configuration changes.

The fix is a 3-tier memory monitor that runs every cycle:

Table 3: Memory Threshold Behavior

Swap Free Action Reason
>25% Normal: launch new runs as queue allows Healthy headroom
15-25% Pause new launches; let running runs complete Approaching pressure, no new commitments
<15% SIGSTOP lighter project; SIGCONT when recovered Imminent OOM; freeze rather than kill

The SIGSTOP/SIGCONT mechanism is the safety net. When swap drops below 15%, the Dispatcher identifies the run with the lowest RSS (typically the most recently launched, still building context), sends SIGSTOP to all its agent processes (freezes them without losing state), and waits for the older run to complete. When memory recovers, SIGCONT thaws the frozen run, and it resumes from exactly where it paused.

This pattern prevents memory cascades entirely. The freeze-thaw mechanism preserves run state without data loss — the alternative (letting the kernel pick a victim) destroys 60-90 minutes of work per kill.

Primitive 3: Completion Detection That Distinguishes Working / Stuck / Done

“Did the run complete?” sounds trivial but isn’t. The Dispatcher must distinguish four states:

flowchart TD
    Check["Check run status"] --> W{"Files recently modified?"}
    W -->|yes| Working["✅ Working"]
    W -->|no| S{"Session alive?"}
    S -->|no| Crashed["❌ Crashed → re-queue"]
    S -->|yes| P{"Progress in 15 min?"}
    P -->|no| Stuck["⚠️ Stuck → mark failed"]
    P -->|yes| Working
    Check --> D{"Both markers exist?"}
    D -->|yes| Done["✅ Done"]

    style Working fill:#c8e6c9,stroke:#2E7D32
    style Done fill:#c8e6c9,stroke:#2E7D32
    style Crashed fill:#ffcdd2,stroke:#C62828
    style Stuck fill:#fff9c4,stroke:#F9A825
  1. Working: agents are active, files are being produced, last modification time is recent
  2. Stuck: agents are alive but no progress, last modification time is hours old
  3. Crashed: tmux session is gone, partial output remains
  4. Done: all required artifacts present, agents idle

The completion criteria are deliberately strict: a run is Done only if both final-analysis.md AND deliverable-url.md exist in the run directory. Missing either marks it Failed for re-queuing. This stringency caught a systematic failure that drove a major architecture change.

Strict completion criteria surface systematic patterns that lenient criteria hide. Across dozens of runs, the Dispatcher revealed a 100% correlation between a specific agent routing pattern and incomplete deliverables: runs where the orchestrator used persistent core agents always produced complete output, while runs using disposable sub-agents consistently missed the final deliverable step. This pattern, visible only because the Dispatcher tracked completion criteria across many runs, drove the routing guard hook described in post 2.

Stall detection uses a similar mechanism: if the run directory has no file modifications for >15 minutes AND no completion markers, the Dispatcher checks tmux pane activity. A silent pane with a live process is “stuck”; a dead process is “crashed.” Both trigger failure marking and re-queuing.


4. The Iteration Cycle

The whole point is the cycle. With the Dispatcher in place, the spec-improvement loop became:

1. Change spec (e.g., add routing guard hook, update Watchdog spec)
2. Launch: bash dispatcher.sh v5 2 120
     # v5  = spec version label
     # 2   = max parallel runs
     # 120 = max wall-clock minutes per run
3. Go to sleep
4. Next morning: review overnight monitor log (one file, ~200 lines/hour)
5. Run programmatic scorer: bash score-compliance.sh v5
6. Analyze compliance matrix across 24 runs
7. Identify regressions and improvements; design next spec change
8. Repeat

One human-minute to launch. Zero babysitting. Results ready next day. A single human can run an evaluation cycle every 24 hours, sustained, for months.

A concrete cycle: We added an agent Discussion Architecture — Auditor checkpoint protocol with a 5-item agenda. We launched the Dispatcher overnight with 24 runs. The next morning, the compliance scorer showed the discussion-dependent rules improved dramatically (+40-60pp on cross-source validation and challenge calibration). But the same scorer flagged a format rule regression — the richer discussion outputs violated a bullet-hierarchy constraint. Within a day, we tightened the linter from WARN to BLOCK, relaunched overnight, and confirmed the regression was fixed — without rolling back the discussion gains. Change, measure, fix, measure again — all in 48 hours. Before the Dispatcher, this cycle would have taken 2-3 weeks.

For comparison, the pre-Dispatcher cycle was:

1. Change spec
2. Open terminal, set up environment, enter question, monitor
3. ~75 minutes later, save output
4. Repeat steps 2-3 four more times (5 runs is a good day)
5. Save outputs, switch to next task
6. Repeat steps 2-5 for 7 more tasks
7. ~7 days later, do this 2 more times for variance
8. Manually score every output, fill spreadsheet
9. ~10 days total per evaluation cycle

The Dispatcher compresses 10 days into 1 day, and removes the most failure-prone step (manual scoring) by making programmatic scoring trivial.


5. What the Dispatcher Enabled: 100+ Runs Across 5 Versions

Over 5 evaluation versions, the Dispatcher ran 100+ automated pipeline runs. Each version’s compliance was measured the same way: scan every output file from every run for every rule, compute pass rate per rule per version.

flowchart LR
    V2["V2: ~65%<br/>Harness"] --> V3["V3: ~71%<br/>Rule restructuring"]
    V3 --> V4["V4: ~77%<br/>Discussion"]
    V4 --> V5["V5: ~90%+<br/>Routing + Watchdog"]

    style V2 fill:#ffcdd2,stroke:#C62828
    style V3 fill:#fff9c4,stroke:#F9A825
    style V4 fill:#fff9c4,stroke:#F9A825
    style V5 fill:#c8e6c9,stroke:#2E7D32

Table 4: Versions, Runs, and System-Wide Compliance

Version Spec Changes Runs Headline Finding System-Wide Compliance
V2 Harness engineering: hooks, warm-rules, periodic refresh 24 Hook rules at 99-100%; standalone text rules at 50-70% ~65%
V3 Rule restructuring: 52 rules deleted, 43 deduped, format linter extensions 24 Format linters +10-25pp per rule when promoted to BLOCK ~71%
V4 Discussion architecture: Auditor checkpoint, 5-item agenda, reflections 24+8 Cross-source validation +63pp, challenge calibration +66pp ~77%
V5 Routing guard + Watchdog v2 + larger context 12 Core-agent routing enforced; checkpoint deadlocks caught proactively ~90%+ (projected)

Each row is one overnight cycle’s worth of data, evaluated programmatically. Without the Dispatcher, this 4-version progression would have taken roughly 6 weeks of human time. With the Dispatcher, it took 5 calendar nights of compute and a few hours of analysis per morning. The bottleneck shifted from running tests to deciding what to test next.

The progression from 65% → 77% → 90%+ was driven by the Dispatcher’s ability to:

  • Run the same 8 tasks on every spec version (controlled comparison)
  • Run 3 rounds per task (variance estimation)
  • Produce all artifacts in structured per-run directories (handoffs, charts, documents)
  • Complete a full evaluation cycle overnight

Posts 1 and 2 reported the compliance numbers. This post is about why those numbers exist at all.


6. Discovery Value: Patterns Only Systematic Evaluation Reveals

Beyond enabling measurement, the Dispatcher surfaced architectural insights that ad-hoc testing could never reveal. Each pattern became visible only because systematic, repeated runs with consistent logging exposed correlations across dozens of data points.

This pattern — systematic evaluation revealing dominant failure modes invisible to spot-checking — recurs across the academic benchmark literature. AgentBench [Liu et al. 2024], which evaluated 29 LLMs across 8 diverse environments, found that Task Limit Exceeded (TLE) accounted for up to 82.5% of failures in the Lateral Thinking Puzzles environment and 67.9% in Knowledge Graphs. No one set out to discover that “running out of steps” was the dominant failure mode; it emerged because systematic evaluation across many tasks made the distribution visible. Our parallel finding — that 100% of sub-agent-spawning runs in V3 failed to produce documents (Pattern 1 below) — has exactly the same epistemological shape. A failure pattern is only “obvious in hindsight” once enough runs exist to expose the correlation. Ad-hoc testing produces a string of individual bug reports; systematic dispatching produces a distribution that names the dominant mode.

Discovery 1: Agent Routing Architecture (100% correlation, dozens of runs)

The Dispatcher’s structured per-run directories made routing patterns visible across many runs. With one-off testing, an incomplete deliverable looks like an individual bug. With dozens of runs tagged by completion outcome and indexed by file presence, the systematic pattern emerges: a 100% correlation between a specific agent routing approach and incomplete output, across multiple task types and rounds.

This pattern drove the routing guard hook described in post 2 — a PreToolUse hook that blocks orchestrator-level sub-agent spawns for core roles and forces routing through persistent agents.

Discovery 2: Task Diversity Reveals Hidden Dependencies

Different task types stress different parts of the pipeline. The Dispatcher’s 8-task diversity (spanning simple trends, premise refutation, experiment readout, cross-platform comparison, and comprehensive reports) revealed that some tasks consistently pushed certain agents to their resource limits while others ran comfortably. This task-type-specific behavior was invisible in single-task testing — testing only one type would lead to either false confidence or false alarm.

The Dispatcher’s controlled runs across diverse task types enabled data-driven decisions about resource allocation: which agent roles need the most headroom, which task types are the hardest on the pipeline, and where optimization effort should focus.

Discovery 3: Pipeline Liveness Requires Proactive Monitoring

The Dispatcher’s per-run activity logging revealed that multi-agent pipelines can stall silently at inter-agent handoff points — one agent completes its work and sends a message, but the receiving agent doesn’t act on it. Without systematic logging with timestamps, such stalls look like generic “didn’t finish” timeouts. With the Dispatcher’s activity log, the exact handoff point and the specific agent responsible are immediately identifiable.

This insight drove the Watchdog v2 design (post 2, Layer 4): proactive monitoring that polls agent activity every 5 minutes and alerts the orchestrator when a handoff response is overdue.

Discovery 4: Adaptive Resource Management Enables Safe Concurrency

Cross-log correlation (memory logs vs system logs) revealed that resource pressure in multi-agent systems is non-linear — the system can go from healthy headroom to critical pressure inside a single monitoring cycle. This insight drove the adaptive 3-tier resource monitor: generous thresholds with SIGSTOP/SIGCONT freeze-thaw that preserves run state rather than losing work. The result: the Dispatcher safely maximizes concurrency on any hardware without manual tuning.

Pattern 5: Regression Detection Before Users See It

Spec changes can have unintended consequences. A change designed to improve quality (adding agent discussion) can cause a format rule to regress because the richer discussion produces longer outputs that violate formatting constraints. Without automated measurement across all runs, such regressions are invisible until a user notices — potentially weeks later.

The Dispatcher’s programmatic scoring catches regressions the morning after a spec change ships. The automated compliance matrix answers “which rules regressed?” before any user sees the output. The fix (tightening a linter from WARN to BLOCK) can be applied and re-verified within 48 hours — change, measure, fix, measure again.


7. Key Design Decisions and the Reasoning

Decision 1: Tasks × Rounds, not just Tasks

A high compliance rate (e.g., 95%+) looks identical to 100% on any single observation. The occasional failure is invisible unless you run the same task multiple times. Without three rounds per task, a rule that occasionally fails would either look perfect (if the failure round didn’t run) or broken (if only the failure round ran).

The cost is linear: 3× the runs, 3× the compute. The benefit is qualitative: claims about systematic compliance become defensible rather than anecdotal.

Decision 2: Max Parallel = 2

A pipeline run uses substantial RAM across its 8+ agent processes. The safe concurrency level depends on available hardware. The Dispatcher’s adaptive memory monitor handles this automatically — higher parallelism isn’t a configuration knob, it’s an outcome of available resources.

The SIGSTOP/SIGCONT freeze mechanism allows the Dispatcher to opportunistically run at 3 concurrent when memory has headroom and pull back to 2 when it doesn’t, without losing work. This is the right model: a hard upper cap with elastic operation below it, not a fixed parallelism number.

Decision 3: Completion = Final Analysis AND document

A pipeline that writes a markdown file but no document is not done — it has failed silently. The strictness of “both artifacts must exist” is what caught the sub-agent bypass pattern (Pattern 1 above). If completion criteria had been “any output file present,” a significant fraction of runs would have been mismarked as successful, and the routing pattern would have been invisible.

The general principle: define completion in terms of the outputs the user actually needs, not the outputs the pipeline happens to produce. If the deliverable is a document, “document exists” is the completion criterion. Anything weaker is wishful thinking.

Decision 4: Overnight Operation as the Standard Cycle

The Dispatcher runs in its own tmux session with a monitoring companion (overnight-monitor.sh) that logs every 5 minutes:

  • Memory: swap %, agent RSS, process count
  • Per-run: file count, framework/final/deliverable status, latest file modification age
  • OOM kills: scans dmesg for kernel OOM events
  • Dispatcher alive: confirms the dispatcher session is still active

The human reviews a single log file next morning. A 12-hour run produces ~150 log entries. Scanning for ALERT lines and anomalies takes 5 minutes. The full overnight history is in one place, in one format, recoverable from any time the human asks.


8. Industry Landscape: Where the Dispatcher Sits

The Dispatcher was designed from operational requirements — queue durability, memory safety, completion detection — before we mapped it against the published literature on agent evaluation. The convergence is reassuring — most of the load-bearing design decisions are independently endorsed elsewhere — but the gaps reveal where the Dispatcher contributes something the published work does not.

8.1 Anthropic (Jan 2026): Isolated Multi-Trial Evaluation with Resource Isolation

Anthropic’s evaluation engineering writeup [Anthropic 2026] is the closest published analogue to our design philosophy. The headline recommendations:

  • Isolated multi-trial evaluation with pass@k (capability ceiling) and pass^k (reliability floor) as the reported metrics.
  • Clean per-trial environments: “Each trial should be isolated by starting from a clean environment. Unnecessary shared state between runs can cause correlated failures.”
  • Resource isolation as a correctness requirement, not a nice-to-have: “If multiple distinct trials fail because of the same limitation in the environment (like limited CPU memory), these trials are not independent because they are affected by the same factor, and the eval results become unreliable for measuring agent performance.”
  • 0% pass@100 is usually a broken task, not an incapable agent: a signature for distinguishing eval-infrastructure failure from genuine model failure.

The Dispatcher’s per-run directory structure (isolated artifact namespace), 3-rounds-per-task design (variance estimation), and SIGSTOP/SIGCONT memory monitor (resource isolation) directly correspond. We arrived at these independently from operational pain; Anthropic’s framing gives the same decisions a statistical justification we did not initially articulate.

8.2 OpenAI (Jan 2026): Lightweight Two-Layer Grading

OpenAI’s developer-blog post on evaluating agent skills [OpenAI 2026] is more practitioner-focused and lands on a different stack:

  • Start small: “You don’t need a large benchmark to get value from evals. For a single skill, a small set of 10-20 prompts is enough to surface regressions and confirm improvements early.”
  • Two-layer grading: lightweight deterministic checks (did the agent run the right command, write the expected file) plus model-assisted rubric grading with structured output schemas.
  • JSONL output streams (codex exec --json) so downstream graders can run automatically.

Our 8-task list (vs. OpenAI’s “10-20 per skill”) and our programmatic compliance scoring (grep-based pattern matching, Section 4 and Appendix D) are direct instances of the same pattern. We did not implement a model-assisted second layer — every rule the Dispatcher scores is deterministic — but the architecture supports it cleanly: a second-layer scoring script would walk the same per-run directories and call out to a model for rubric items that resist regex.

8.3 Academic Benchmarks: Execution-Based Scoring at Scale

Five canonical agent benchmarks anchor the academic side:

Benchmark Year Tasks Scoring Approach
AgentBench [Liu et al. 2024] ICLR 2024 8 environments, 29 LLMs evaluated Per-environment automated scoring; surfaces TLE as dominant failure mode
SWE-bench [Jimenez et al. 2024] ICLR 2024 2,294 real GitHub issues Execution-based: patches run against repository test suites (median 51 regression tests per instance)
GAIA [Mialon et al. 2023] arXiv 2023 466 multi-step questions Deterministic exact-match; no human raters in the loop
WebArena [Zhou et al. 2024] CMU 2024 812 web tasks across 241 templates Per-task programmatic functional-correctness validators on self-hosted sites
AgentBoard [Ma et al. 2024] NeurIPS 2024 (Oral) 1,013 environments, 9 categories Continuous progress rate metric [0,1] tracking incremental completion, not just final success

The common thread: execution-based, programmatic scoring. None of these benchmarks rely on human grading for the inner loop; all of them produce structured outputs amenable to automation. The Dispatcher’s compliance-scoring stage is the same pattern applied to a different layer (spec compliance rather than task correctness). AgentBoard’s progress-rate idea is particularly relevant — it is the academic version of our per-rule pass/fail matrix, where partial credit and per-step quality are first-class measurements rather than collapsed into a single pass/fail.

8.4 The Gap the Dispatcher Fills

Across the published work, no single system combines:

  1. Autonomous dispatching (queue-driven launch, completion detection, overnight operation without a human in the loop).
  2. Resource safety (memory monitoring, SIGSTOP/SIGCONT freeze-thaw, OOM cascade prevention).
  3. Compliance scoring (spec-rule pass/fail across many runs, with per-rule regression detection between versions).

Anthropic specifies the statistical requirements (multi-trial, resource isolation) but does not publish a dispatcher. OpenAI specifies the grading stack (deterministic + model-assisted) but treats orchestration as out of scope. Academic benchmarks specify the task and scoring methodology but assume an evaluation harness exists. AgentBoard tracks per-step progress but does not address the resource-management problem that emerges when running many concurrent agent processes on shared hardware.

The Dispatcher is the practical glue. It is not novel in any single dimension — every component has a published analogue — but the integration is, to our knowledge, not documented elsewhere. The strongest evidence that this gap is real: Anthropic’s open question of “how do production teams manage OOM, CPU, and memory when running 100+ concurrent agent evaluations?” has no published answer, even though Anthropic itself names the problem. The SIGSTOP/SIGCONT freeze-thaw mechanism (Section 3, Primitive 2) is our concrete answer; we have not found a published equivalent.


9. Comparison: Manual vs. Automated Evaluation

Table 5: End-to-End Comparison

Aspect Manual (before) Automated (Dispatcher)
Per-run setup ~5 min (terminal, question, config) 0 (queue-driven)
Per-run monitoring ~10 min active attention 0 (auto-detect)
Runs per day 3-5 (human attention limited) 12-24 (compute limited)
Overnight runs Not possible (no unattended operation) Standard workflow
Memory management Hope it doesn’t crash 3-tier monitor (pause/freeze/thaw)
Failure recovery Manual restart, lost context Auto-detect, re-queue
Compliance scoring Manual review per output Programmatic scan across all runs
Regression detection Accidental discovery Systematic comparison per version
Statistical confidence 1 run per task (high noise) 3 rounds per task (variance estimable)
Spec-change evaluation cycle 7-10 days 1 overnight cycle

The transformation is not “Dispatcher does the same thing faster.” The Dispatcher makes a different class of work possible. Variance estimation requires repeated runs; repeated runs require automation. Controlled comparison across versions requires identical inputs across versions; identical inputs require a fixed task list. Programmatic scoring requires structured output directories; structured directories require queue-driven launching.

Each affordance enables others. The system is a stack, like the compliance stack in post 2.


10. Principles for Building Multi-Agent Evaluation Infrastructure

Six principles derived from the design and the failure patterns the Dispatcher surfaced.

Principle 1: Automate the full cycle, not just the inner step

It is tempting to automate the slow part (the pipeline run) and leave the surrounding workflow manual. This is the worst of both worlds: automation costs the same, but the human is still on the hook for launch, monitoring, completion detection, and cleanup. The throughput stays at the manual cap.

Automate end-to-end: queue management → launch → monitoring → completion detection → cleanup → scoring. Each step is simple in isolation. The compounding benefit comes from removing the human from every step.

Principle 2: Run diverse tasks; task diversity is test coverage

A pipeline that works on one question type may fail on another. Our 8-task list spans simple trend analysis, premise refutation, experiment readout, surface comparison, multi-metric ranking, data challenge, root-cause investigation, and comprehensive report. The context-exhaustion pattern (Pattern 2 above) was visible only because two of these tasks had radically different context profiles.

If your task list is one type (“simple analysis questions”), your evaluation tells you about that type. The pipeline may be catastrophically broken on other types and you will not know.

Principle 3: Run multiple rounds; non-deterministic systems require it

Three is the floor. A rule at 67% true compliance can pass 3-of-3 with 30% probability and fail 0-of-3 with 4%. Five rounds tightens this, ten rounds tighter still, but cost scales linearly and information drops fast past three.

The corollary: never report a compliance number from a single run. Either report “ran once, here is what happened” (a case study, not a measurement) or run repeatedly and report the rate.

There is a second, less obvious requirement: the multiple rounds must be statistically independent, which means the runs must not share resources that could cause correlated failure. Anthropic [Anthropic 2026] is explicit on this point: “If multiple distinct trials fail because of the same limitation in the environment (like limited CPU memory), these trials are not independent because they are affected by the same factor, and the eval results become unreliable for measuring agent performance.” Running three rounds under shared resource pressure does not buy three observations — it buys one observation with a multiplicity error in the reported confidence. The Dispatcher’s resource-aware scheduling (Section 3) and per-run directory isolation (Appendix A) exist in part to honor this independence requirement. Multi-round evaluation without resource isolation is theatre.

Principle 4: Monitor resources, not just outputs

Memory, context window, process liveness. The Dispatcher’s memory monitoring prevented dozens of OOM crashes that would have wasted hours of compute each. The pattern-4 cascade incident is the proof: when the resource budget is tight, normal-looking system state is one cycle away from catastrophic failure.

Output monitoring catches what completed. Resource monitoring catches what is about to fail. Both are necessary.

This is also a statistical correctness requirement, not just an operational one. Anthropic [Anthropic 2026] states that when trials share a constrained resource — CPU memory, disk, network — observed failure rates conflate agent incapability with infrastructure limitations, and the “eval results become unreliable for measuring agent performance.” Our SIGSTOP/SIGCONT freeze-thaw mechanism (Section 3, Primitive 2) is a concrete implementation of this requirement: when swap drops below 15%, freezing the most-recently-launched run preserves its state until memory recovers, rather than letting the kernel OOM-kill some arbitrary victim. The alternative — co-located runs evicting each other under memory pressure — produces failure rates that look like agent behavior but actually measure scheduling artifacts. Resource monitoring is the precondition under which the compliance numbers in posts 1 and 2 are valid measurements rather than environmental noise.

Principle 5: Log everything, review selectively

The overnight monitor produces ~200 log entries per hour. Most are uninteresting. The human scans for ALERT lines and anomalies — a 5-minute pass per overnight cycle. The cost of comprehensive logging is negligible (disk is cheap, parsing is selective). The cost of missing a failure mode is a re-run cycle (a full day).

The bias should be aggressive logging at low priority and selective review at high attention. Never the reverse.

Principle 6: Controlled comparison across versions, with the only variable being the spec change

Same tasks, same prompts, same evaluation criteria, same scoring scripts. The only thing that changes between V3 and V4 is the spec. This is what makes the V2 → V3 → V4 → V5 progression meaningful.

Comparison gets contaminated easily: a “small improvement to the question wording” between cycles invalidates the comparison. A “better scoring criterion” changes what is measured, not what changed in the system. Treat the evaluation harness as production code: change-controlled, version-tagged, reviewed.


11. The Dispatcher as a General Pattern

The Dispatcher is specific to our multi-agent pipeline, but the pattern generalizes to any multi-agent system that needs systematic quality measurement.

A/B testing spec changes: Run the same tasks on two spec versions, compare compliance rates per rule. Hold task list, prompts, and scoring constant. Vary only the spec under test.

Regression detection: Run after every spec change to catch quality regressions before they reach users. V4’s W3 regression (Pattern 5 above) would have been visible to the first user before the Dispatcher; with the Dispatcher, it was caught the morning after the spec change and patched before any user saw it.

Task diversity for coverage: Maintain a task list that spans the full surface of intended use. Periodically audit: are there task types the system is never tested on? Add them. Are there task types that always pass? Drop the rounds count on those and put the budget elsewhere.

Failure mode discovery: Systematic runs with diverse tasks surface failure patterns targeted testing misses. The sub-agent bypass pattern, the context exhaustion pattern, the checkpoint deadlock pattern — all were discovered by the Dispatcher, not by anyone who set out to look for them.

Compliance scoring across many runs: Programmatic scoring is feasible only when outputs are in structured locations. Enforce the structure at launch time, not at scoring time.

Resource safety as a first-class concern: For multi-agent systems consuming significant memory, OOM is the dominant operational failure mode. Build the 3-tier monitor before you need it. The freeze-thaw mechanism (SIGSTOP/SIGCONT) is the single highest-leverage safety mechanism we added.


12. Conclusion

The first post in this series argued that enforcement mechanism matters more than rule content. The second argued that some rules require judgment, and structured agent-to-agent discussion is the only way to enforce them. This post adds the prerequisite: none of those insights are reachable without infrastructure that makes evaluation cheap, repeatable, and systematic.

The compliance progression — 65% → 71% → 77% → 90%+ — looks like a story about spec changes. It is also, and more fundamentally, a story about measurement infrastructure. Each spec change was an experiment. Each experiment required 24 runs to evaluate. Each evaluation required programmatic scoring. None of this is possible at the cadence ideas arrive without a Dispatcher.

The Dispatcher is one bash script. It is not clever. Its design is dominated by mundane concerns: queue durability, memory thresholds, completion criteria, log structure. The cleverness is in the cycle it enables: change spec, sleep, analyze, repeat. Daily iteration on multi-agent quality is the unlock. Without it, the system improves at human-attention pace, which is much slower than the rate at which ideas arrive.

The general lesson: in any system where the unit of work is large, non-deterministic, and resource-intensive, the bottleneck to improvement is rarely the system itself. It is the apparatus around the system that makes its behavior measurable. Build the apparatus first. The improvements follow.


Appendix

A. Per-Run Directory Structure

Every run gets its own directory, named by task and round (e.g., t3-r2/). The structure inside is fixed:

t3-r2/
├── event-log.md              # Captain's chronological event log
├── framework.md              # Phase-2 framework
├── handoffs/                 # Per-agent handoffs (one per phase)
├── chat-transcripts/         # Identical text to what user saw at gates
├── final-analysis.md         # COMPLETION MARKER 1
├── deliverable-url.md               # COMPLETION MARKER 2
├── self-learn-scratch.md     # Improve's observations
├── memory-log.txt            # Dispatcher's memory snapshots
└── tmux-pane-snapshots/      # Periodic captures for stall detection

The strict structure is what makes programmatic scoring trivial. A scoring script walks t*-r*/ directories and applies rule-specific greps to known file paths. Adding a new compliance rule means writing one grep and re-running the scorer; the corpus is already structured.

B. Memory Monitor Pseudocode

while dispatcher_running; do
  swap_free_pct=$(get_swap_free_pct)
  agent_total_rss=$(get_agent_total_rss)
  running_runs=$(get_running_runs_count)

  log_memory_snapshot "$swap_free_pct" "$agent_total_rss" "$running_runs"

  if [ "$swap_free_pct" -lt 15 ]; then
    lightest=$(get_lightest_running_project)
    sigstop_project "$lightest"
    log_alert "FROZEN: $lightest at swap=$swap_free_pct%"
  elif [ "$swap_free_pct" -lt 25 ]; then
    set_launch_paused
    log_warn "PAUSED launches at swap=$swap_free_pct%"
  else
    if any_frozen_project; then
      thaw_frozen_projects
      log_info "THAWED at swap=$swap_free_pct%"
    fi
    clear_launch_paused
  fi

  sleep 120
done

C. Completion Detection Pseudocode

for run_dir in $(get_running_run_dirs); do
  has_final_analysis=$([ -f "$run_dir/final-analysis.md" ] && echo 1 || echo 0)
  has_deliverable_url=$([ -f "$run_dir/deliverable-url.md" ] && echo 1 || echo 0)
  session_alive=$(tmux_session_alive "$(get_session_name "$run_dir")" && echo 1 || echo 0)
  minutes_since_last_write=$(get_minutes_since_last_file_write "$run_dir")

  if [ "$has_final_analysis" = 1 ] && [ "$has_deliverable_url" = 1 ]; then
    mark_run_completed "$run_dir"
  elif [ "$session_alive" = 0 ]; then
    mark_run_failed "$run_dir" "session_dead"
  elif [ "$minutes_since_last_write" -gt 15 ]; then
    if pane_has_silent_alive_process "$run_dir"; then
      mark_run_failed "$run_dir" "stalled"
    elif pane_has_dead_process "$run_dir"; then
      mark_run_failed "$run_dir" "crashed"
    fi
  fi
done

D. Programmatic Compliance Scoring

For each version, the scoring script walks every t*-r*/ directory and applies per-rule grep patterns:

score_rule() {
  local rule_id="$1"
  local rule_pattern="$2"
  local rule_target_file="$3"

  local passes=0; local total=0
  for run_dir in $(get_completed_run_dirs "$version"); do
    total=$((total + 1))
    if grep -q "$rule_pattern" "$run_dir/$rule_target_file" 2>/dev/null; then
      passes=$((passes + 1))
    fi
  done
  printf "%s: %d/%d (%.1f%%)\n" "$rule_id" "$passes" "$total" \
    "$(echo "scale=1; 100 * $passes / $total" | bc)"
}

The full compliance matrix for a version (16 rules × 24 runs) computes in under 10 seconds. This is what makes the morning-after analysis tractable. Without it, the post-overnight workflow would be dominated by manual scoring — a full day per cycle — and the iteration cadence would degrade back toward manual-evaluation speeds.

E. Operational Lessons by Frequency

Table A1: Dispatcher Operational Issues, Ranked by Frequency

Issue Frequency Mitigation
Memory pressure (swap <25%) Multiple times per overnight cycle 3-tier monitor with SIGSTOP/SIGCONT
Permission-prompt blocking launch ~10% of launches Auto-accept handler attached at session start
Pipeline stall at checkpoint ~5% of runs Stall detection + re-queue + Watchdog (post 2)
Tmux session crash ~2% of runs Session liveness check, mark failed, re-queue
Memory pressure spike Rare (when concurrency exceeds hardware capacity) 3-tier monitor with adaptive concurrency prevents recurrence
Disk fill from log accumulation Rare Log rotation per evaluation cycle
Dispatcher process crash Rare Queue is durable; restart resumes from queue state

The frequency distribution is informative: the dominant operational concern is memory, not logic. Most engineering effort on the Dispatcher went into the 3-tier monitor and the queue durability, not the launching logic. This is the right allocation given the failure mode distribution.


References

The following published work is referenced throughout this post. Where our findings echo this literature, we arrived at the design independently from operational pressure; the convergence is independent validation, not derivation.

  • [Liu et al. 2024] Liu, X., Yu, H., Zhang, H., et al. AgentBench: Evaluating LLMs as Agents. ICLR 2024. arXiv:2308.03688. — 29 LLMs across 8 environments; established TLE as the dominant failure mode (up to 82.5% of failures in some environments) and the 4.5x commercial-vs-open-source gap.

  • [Jimenez et al. 2024] Jimenez, C. E., Yang, J., Wettig, A., et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024. arXiv:2310.06770. — 2,294 real GitHub issues with execution-based grading against repository test suites; canonical example of automated execution-based agent evaluation at scale.

  • [Mialon et al. 2023] Mialon, G., Fourrier, C., Swift, C., et al. GAIA: A Benchmark for General AI Assistants. arXiv:2311.12983, 2023 (HuggingFace). — 466 multi-step questions with deterministic exact-match scoring; 92% human accuracy vs 15% GPT-4-with-plugins at publication.

  • [Zhou et al. 2024] Zhou, S., Xu, F. F., Zhu, H., et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. CMU, arXiv:2307.13854, 2024. — 812 web tasks on self-hosted environments (Reddit, GitLab, shopping) with programmatic functional-correctness validators.

  • [Ma et al. 2024] Ma, C., Zhang, J., Zhu, Z., et al. AgentBoard: An Analytical Evaluation Board of Multi-Turn LLM Agents. NeurIPS 2024 (Oral). arXiv:2401.13178. — 1,013 environments across 9 categories; introduced the continuous progress-rate metric [0,1] for fine-grained per-step quality tracking beyond binary success.

  • [Anthropic 2026] Anthropic Engineering. Demystifying Evals for AI Agents. anthropic.com/engineering/demystifying-evals-for-ai-agents, January 2026. — pass@k / pass^k multi-trial framework; the 42%→95% (+53pp) score correction after fixing grading bugs and environmental confounds; resource-isolation requirement for trial independence.

  • [OpenAI 2026] OpenAI Developer Blog. Evaluating Agent Skills. developers.openai.com/blog/eval-skills, January 2026. — Two-layer grading pipeline (deterministic + model-assisted rubric); “10-20 prompts per skill” as the practical starting point for regression detection; codex exec --json for automation-friendly output streams.