Agent Discussion: The Quality Layer That Harness Engineering Can't Replace
This is the third post in a series about building a multi-agent system for complex analytics. The AI Agent Teams for Analytics covers the architecture. The Best Practices To Keep AI Agents on Track covers harness engineering for rule compliance. This one covers what happens when the harness hits its ceiling.
Summary
- Harness engineering hits a ceiling at ~65% system-wide compliance. Hooks and linters enforce programmatic rules at 99-100%, but rules requiring judgment (challenge calibration, cross-source validation, methodology review) plateau at 4-33% with harness alone.
- Adding structured agent-to-agent discussion raises system-wide compliance to ~77% (+12pp). The largest single-rule gains come from discussion: Challenge Calibration +66pp, Cross-Source Validation +63pp. No harness change produced gains this large.
- Unconditional discussion degrades performance. Research shows a 26pp accuracy drop from mandating debate on every task (Eo et al. 2025). Our architecture gates discussion selectively: hooks for programmatic rules, structured checkpoints for judgment rules.
- Discussion requires routing architecture to be reliable. Persistent agents who maintain context and follow checkpoint protocols produce substantive reviews. Disposable sub-agents skip discussion entirely. A routing guard hook ensures all core work flows through persistent agents.
- Proactive monitoring completes the stack. A watchdog agent self-polls every 5 minutes, detects checkpoint deadlocks, and alerts the orchestrator. Combined with harness + discussion + routing, this projects ~90%+ system-wide compliance.
- No existing framework combines all three. CrewAI has 2 inter-agent primitives and no structured debate. Anthropic’s evaluator-optimizer pattern covers one interaction mode. Across 22 evaluated frameworks, the performance plateau is orchestration-driven, not architecture-driven (Rasheed et al. 2026).
1. Introduction
In the Best Practices To Keep AI Agents on Track, we established that the enforcement mechanism for a rule matters more than the rule’s content. Hooks achieved 99-100% compliance, warm-start injection ~90%, and standalone text 50-70%. We improved system-wide compliance from ~58% to ~65% by promoting rules up the enforcement ladder.
But we identified a class of rules that resist promotion. No hook can verify whether a reviewer’s challenge was well-calibrated. No linter can check whether a cross-source validation was substantive or superficial. No gate can assess whether a methodology review caught the right issues. These rules require judgment — the ability to evaluate reasoning in context, weigh tradeoffs, and produce calibrated critique.
The cognitive science literature explains why. Mercier and Sperber (2011) showed that confirmation bias is asymmetric: humans are biased when producing arguments but more objective when evaluating others’. On the Wason selection task, individual reasoning produces 10% accuracy; group debate raises it to 80%. The same asymmetry applies to LLM agents: an agent generating an analysis focuses on supporting evidence; a separate reviewing agent focuses on weaknesses. This is not a workaround — it is how reasoning produces reliable results.
This post presents the data from adding structured agent-to-agent discussion to the harness. The key finding: harness engineering and agent discussion are complementary, not competing. Hooks handle what machines can verify. Discussions handle what requires judgment. The two combined, with a routing architecture that ensures discussion actually happens and a monitoring layer that catches pipeline stalls, produce a compliance stack that addresses the full rule surface.
A concrete example: in one analysis, the reviewing agent caught during a checkpoint discussion that the analyst was using registered users as the denominator for an engagement ratio, instead of active users. This single substitution inflates the denominator by 3-5x, silently suppressing the engagement metric by the same factor. No linter can detect this — both are valid COUNT(DISTINCT userid) queries that return real numbers. The reviewer asked: “Your denominator includes all registered accounts. Shouldn’t this be filtered to active users for an engagement ratio?” The analyst revised, and the final number was correct. This is the class of error that discussion catches and harness engineering cannot.
The progression:
flowchart LR
V2["V2: Harness only<br/>~65%"] --> V4["V4: + Discussion<br/>~77%"]
V4 --> V5["V5: + Routing + Monitoring<br/>~90%+"]
style V2 fill:#ffcdd2,stroke:#C62828
style V4 fill:#fff9c4,stroke:#F9A825
style V5 fill:#c8e6c9,stroke:#2E7D32
2. The Problem: Harness Engineering Hits a Ceiling
Observation 1: Some rules resist programmatic enforcement
Our system tracks 16 compliance rules across analytical methodology (A), review quality (B), data handling (D/E), and writing format (W). After applying the full harness toolkit, the compliance distribution was bimodal:
Table 1: Compliance by Enforcement Type (V2, harness only)
| Rule | Enforcement Type | Compliance |
|---|---|---|
| Step 0 Premise Verification | Hook/gate | 100% |
| Evidence Classification | Warm-rule | 100% |
| Metric Definitions file | Gate | 100% |
| Framework Coverage section | Gate | 100% |
| No Data Fabrication | Hook | 100% |
| Data Sensitivity Flagging | Hook | 100% |
| Structural-Unblock Companion | Warm-rule | 83% |
| Summary Narrative First | Format-check linter | 83% |
| Bullet Hierarchy <=7 L0 | Linter (WARN) | 79% |
| Retention Check | Warm-rule | 75% |
| No Bold Table Data | Linter (WARN) | 71% |
| Header Spacing | Linter (partial) | 58% |
| Cross-Source Validation | Standalone text | 33% |
| Challenge Calibration | Standalone text | 21% |
| Knowledge Reuse | Standalone text | 13% |
| Gap Classification | Standalone text | 4% |
The pattern is clear. Hook/gate rules: 100%. Warm-rules: 75-100%. Standalone text: 4-33%. The bottom four rules share a property that distinguishes them from the top group: they require one agent to exercise judgment about another agent’s work.
flowchart TD
subgraph High["99-100% Compliance"]
H1["Hooks & Gates"]
end
subgraph Mid["75-90% Compliance"]
M1["Warm-rules & Linters"]
end
subgraph Low["4-33% Compliance"]
L1["Standalone text<br/>(judgment rules)"]
end
H1 ~~~ M1
M1 ~~~ L1
style High fill:#c8e6c9,stroke:#2E7D32
style Mid fill:#fff9c4,stroke:#F9A825
style Low fill:#ffcdd2,stroke:#C62828
Observation 2: Discussion-dependent rules are the majority of remaining compliance gaps
Of the 10 rules below 100% in V2, six require judgment no programmatic check can verify: challenge calibration (21%), cross-source validation (33%), gap classification (4%), knowledge reuse (13%), retention checks (75%), and structural-unblock companions (83%). Each asks whether an agent exercised judgment appropriately — not whether it produced output, but whether that output was substantive.
These six rules account for 76% of the total compliance gap (weighted by distance from 100%). The remaining four below-100% rules are format rules where the linter is configured as WARN rather than BLOCK — a harness tuning issue with a known fix.
Observation 3: Discussion quality depends on agent persistence
Our system uses persistent core agents (maintain state across the pipeline) and disposable sub-agents (spawned for parallel work). The compliance difference is stark:
Table 2: Compliance by Agent Type
| Metric | Persistent Core Agent | Disposable Sub-Agent |
|---|---|---|
| Reviewer challenge depth | Inlined denominator/date/grain patterns | Generic “looks good” approvals |
| Cross-source validation | Substantive comparison with discrepancy analysis | Surface-level “validated against Table B” |
| Checkpoint protocol followed | Full 5-item agenda | Partial or skipped |
| Output completeness | Full deliverable chain | Partial output only |
A freshly spawned sub-agent has no context about the pipeline’s accumulated findings, no relationship with the reviewer, and no checkpoint protocol state. It completes its narrow task and terminates. The discussion that would have caught quality issues never happens.
Observation 4: Unconditional discussion can degrade performance
Eo et al. (2025) tested standard Multi-Agent Debate on StrategyQA: 44.54% accuracy vs. 70.74% for single-agent Chain-of-Thought — a 26pp degradation from unconditional debate. Their DOWN framework, which selectively activates debate based on confidence, reduced agent calls from 9 to 1.4 (6x efficiency) while preserving accuracy.
Li et al. (2024) found that sparse debate (degree D=2/5) achieved 66.0% vs. 64.0% for fully-connected debate with 41.5% cost savings. On hard problems, seeing more reference solutions misleads agents into converging on incorrect answers. Discussion must be conditional and structured, not universal.
Observation 5: Discussion without structured protocol is worse than no discussion
MATEval (Li et al. 2024) tested multi-agent discussion for text evaluation. Their SR+CoT protocol achieved a +67% relative improvement. But removing the feedback mechanism dropped correlation from 0.391 to 0.259. Removing explanations dropped it to 0.011 — worse than no discussion. Partial discussion protocols actively degrade below automated baselines.
3. Root Cause: Why Harness Alone Can’t Solve Judgment Rules
The Two-Bucket Framework
Bucket 1 — Harness-Enforceable: A programmatic check verifies compliance with >95% accuracy. Examples: “Does the metric definitions file exist?” “Did Step 0 run before Step 1?”
Bucket 2 — Discussion-Dependent: The check itself requires judgment — evaluating reasoning quality, contextual appropriateness, or cross-agent consistency. Examples: “Was the challenge calibrated?” “Is the cross-source validation substantive?”
Table 3: Two-Bucket Classification with Compliance Data
| Rule | Bucket | V2 (Harness) | V4 (+ Discussion) | Delta |
|---|---|---|---|---|
| Step 0 Premise | 1: Harness | 100% | 100% | 0 |
| Evidence Classification | 1: Harness | 100% | 96% | -4pp |
| Metric Definitions file | 1: Harness | 100% | 100% | 0 |
| Framework Coverage | 1: Harness | 100% | 100% | 0 |
| No Data Fabrication | 1: Harness | 100% | 100% | 0 |
| Data Sensitivity Flagging | 1: Harness | 100% | 100% | 0 |
| Summary Narrative First | 1: Harness | 83% | 100% | +17pp |
| No Bold Table Data | 1: Harness | 71% | 78% | +7pp |
| Header Spacing | 1: Harness | 58% | 70% | +12pp |
| Bullet Hierarchy <=7 L0 | 1: Harness | 79% | 43% | -36pp |
| Challenge Calibration | 2: Discussion | 21% | 87% | +66pp |
| Cross-Source Validation | 2: Discussion | 33% | 96% | +63pp |
| Retention Check | 2: Discussion | 75% | 100% | +25pp |
| Structural-Unblock | 2: Discussion | 83% | 100% | +17pp |
Bucket 1 rules average 90% in V2. Bucket 2 rules average 38% in V2. After adding structured discussion in V4, Bucket 2 rules average 96% — a +58pp gain from the mechanism that matches their nature.
The W3 regression (-36pp in V4) is instructive: this is a Bucket 1 rule where the linter was not tightened to BLOCK, and the additional discussion context actually made the writing agent produce longer outputs that violated the constraint more often. Discussion is not a substitute for a proper linter.
The Cognitive Science Parallel
Mercier and Sperber (2011) showed that confirmation bias is asymmetric: strong when producing arguments but more objective when evaluating others’. On the Wason selection task, individual performance hovers around 10% correct; group debate raises it to 80%. This asymmetry is exactly what makes multi-agent discussion work: the generating agent focuses on supporting evidence, the reviewing agent focuses on weaknesses.
Bacchelli and Bird (ICSE 2013) found the same pattern in software: only 14% of code review comments addressed defects; 29% addressed code improvements — alternative approaches and subtle issues the author missed. The highest value of review is not bug-catching (which automated tests handle) but improvements no automated check would flag.
4. Solution: Conditional Discussion Architecture
Our compliance stack has four layers. Each addresses a distinct failure mode.
flowchart TD
L1["Layer 1: Harness<br/>Hooks, gates, linters<br/>99-100% for programmatic rules"] --> L2["Layer 2: Discussion<br/>Structured checkpoints<br/>85-100% for judgment rules"]
L2 --> L3["Layer 3: Routing<br/>Guard hook ensures discussion happens<br/>Prevents bypass"]
L3 --> L4["Layer 4: Monitoring<br/>Watchdog detects stalls<br/>Prevents silent hangs"]
style L1 fill:#e3f2fd,stroke:#1565C0
style L2 fill:#fff3e0,stroke:#E65100
style L3 fill:#f3e5f5,stroke:#7B1FA2
style L4 fill:#e8f5e9,stroke:#2E7D32
Layer 1: Harness Enforcement (Hooks, Gates, Linters)
Unchanged from the Best Practices To Keep AI Agents on Track. Handles Bucket 1 rules at 99-100%. Pre-execution hooks block progression until conditions are met. Post-execution gates verify outputs before handoff. Format linters check structural rules. Warm-start injection provides reminders at agent startup (~90%). This layer is the foundation — everything that can be verified programmatically.
Layer 2: Structured Discussion at Checkpoints
Handles Bucket 2 rules. This is the new layer that produces the +12pp system-wide gain.
Two-Phase Checkpoint Protocol
Every analysis step passes through a two-phase review:
- Quick Check (BLOCKING, 3-5 minutes): The reviewing agent evaluates the step output against a structured agenda before the pipeline advances. The working agent must respond to each finding. Verdict: PASS, REVISE, or BLOCK.
- Deep Review (PARALLEL with next step): Extended review runs in the background while the next step begins. Findings feed into cross-step synthesis. This prevents the review from becoming a pipeline bottleneck while still catching issues.
5-Item Reviewer Agenda
The Quick Check uses a structured agenda, not open-ended review:
| Agenda Item | Failure Mode It Catches | Example |
|---|---|---|
| 1. Methodology | Wrong analytical approach | Using ratio metrics when absolute counts are needed |
| 2. Population | Wrong denominator or filter | Eligible users vs. daily active users conflation |
| 3. Data Quality | Stale data, JOIN issues | Querying a table whose partitions expired days ago |
| 4. Evidence Labels | Unsupported claims | “Users clearly prefer X” without statistical test |
| 5. Prescribed Method | Skipped required steps | No retention check before query execution |
Why structured agenda > open-ended review: MATEval showed that removing structured feedback drops correlation from 0.391 to 0.259. The agenda ensures every review covers the known failure modes.
What Discussion Actually Catches: Beyond Compliance to Correctness
The compliance metrics capture whether discussion happened and known rules were followed. But the highest-value contribution is catching errors that no rule anticipated:
| Error Type | Example | Impact If Missed | Detectable by Linter? |
|---|---|---|---|
| Wrong denominator | Registered users instead of active users | 3-5x silent metric suppression | No — both are valid queries |
| Simpson’s paradox | Metric rises in every sub-group but falls in aggregate | Contradictory recommendation | No — both numbers are correct |
| Stale data window | Expired partitions, outdated numbers | All findings reflect wrong period | Partially |
| Superficial cross-validation | “Validated against Source B” without discrepancy analysis | False confidence | No — keyword present |
| Inappropriate filter | Using experiment eligibility as a population filter | Biased denominator | No — valid SQL |
Only another agent asking “what was the discrepancy between sources, and how did you reconcile it?” forces substantive answers.
Layer 3: Routing Architecture — Harness Enabling Discussion
Discussion only works when it actually happens. Layer 3 ensures it does. This is the key architectural insight: a harness mechanism (Layer 1) that enables discussion (Layer 2). The two approaches are symbiotic.
- Persistent core agents: 8 agents that persist across the pipeline, accumulate context, and follow checkpoint protocols. When the reviewer evaluates Step 3, it has context from Steps 1-2.
- Routing guard hook: Intercepts every agent spawn, checks if the task maps to a core agent role, and blocks spawn if so — routing through the persistent agent instead.
- Sub-agent scope: Sub-agents handle parallel, independent work (schema lookups, partition checks). Core agents dispatch sub-agents and incorporate results into the discussion flow.
Layer 4: Proactive Monitoring (Watchdog)
Discussion introduces checkpoint liveness as a concern. If Agent A sends a message to Agent B and Agent B has stalled, the pipeline hangs indefinitely. No hook fires because no rule was violated.
The watchdog agent addresses this:
- Self-polls every 5 minutes
- Checks each agent’s last activity timestamp
- Detects stalls: >10 minutes silence during active pipeline
- Alerts the orchestrator with diagnosis and recovery recommendation
Composition: Compliance at Each Layer
Table 4: System-Wide Compliance by Architecture Version
| Version | Layers Active | System-Wide Avg | Top Improvement |
|---|---|---|---|
| V2 | Harness only | ~65% | — |
| V4 | Harness + Discussion | ~77% | Challenge Calibration +66pp |
| V5 (est.) | All four layers | ~90%+ | Routing compliance +35pp |
flowchart LR
V2["V2: ~65%<br/>Harness only"] --> V4["V4: ~77%<br/>+ Discussion<br/>(+12pp)"]
V4 --> V5["V5: ~90%+<br/>+ Routing + Monitoring<br/>(+13pp)"]
style V2 fill:#ffcdd2,stroke:#C62828
style V4 fill:#fff9c4,stroke:#F9A825
style V5 fill:#c8e6c9,stroke:#2E7D32
Table 5: Per-Rule Compliance Across All Versions
| Rule | Code | V2 | V4 | V5 (est.) | Primary Layer |
|---|---|---|---|---|---|
| Step 0 Premise | A1 | 100% | 100% | 100% | L1: Hook |
| Evidence Classification | A3 | 100% | 96% | 96% | L1: Warm-rule |
| Metric Definitions | A11 | 100% | 100% | 100% | L1: Gate |
| Framework Coverage | A12 | 100% | 100% | 100% | L1: Gate |
| No Data Fabrication | E6 | 100% | 100% | 100% | L1: Hook |
| Data Sensitivity Flagging | D6 | 100% | 100% | 100% | L1: Hook |
| Summary Narrative First | W2 | 83% | 100% | 100% | L1: Linter + L2: Writer-Reviewer |
| No Bold Table Data | W5 | 71% | 78% | ~80% | L1: Linter (needs BLOCK) |
| Bullet Hierarchy <=7 L0 | W3 | 79% | 43% | ~65% | L1: Linter + L3: Routing |
| Header Spacing | W6 | 58% | 70% | ~75% | L1: Linter (needs BLOCK) |
| Challenge Calibration | B2 | 21% | 87% | ~92% | L2: Discussion |
| Cross-Source Validation | B4 | 33% | 96% | ~98% | L2: Discussion |
| Retention Check | D2 | 75% | 100% | 100% | L2: Discussion |
| Structural-Unblock | A6 | 83% | 100% | 100% | L1: Warm-rule + L2: Discussion |
| Reviewer verdict (DISC-1) | DISC-1 | 0% | 100% | 100% | L2: Discussion |
| 7-item agenda (DISC-2) | DISC-2 | 0% | 35% | ~70% | L2: Discussion + L3: Routing |
| Recovery log | NEW | – | 87% | ~90% | L2: Discussion |
| Pipeline completion | – | 100% | 96% | ~98% | L1: Gate + L4: Watchdog |
| Document deliverable | – | N/M | N/M | ~90% | L3: Routing |
V5 numbers are simulated estimates based on root cause analysis of V4 failures, not measured results.
DISC-2 at 35% in V4 illustrates the layer interaction: the persistent reviewer knows the full 5-item agenda, but without routing enforcement, review sometimes went through less-context-aware paths that checked only 2-3 items. The routing guard (L3) ensures the persistent reviewer — with full agenda context — always handles review.
5. Industry Landscape
Table 5: Agent Discussion Mechanisms Across Frameworks
| Framework | Discussion Mechanism | Structured Agenda | Routing Control | Monitoring |
|---|---|---|---|---|
| Our system | 2-phase checkpoint, 5-item agenda | Yes | Routing guard hook | Watchdog (5-min poll) |
| Anthropic patterns | Evaluator-Optimizer loop | Partial | Manual | No |
| CrewAI | 2 primitives (Delegate, Ask) | No | allow_delegation=False |
No |
| 22-framework eval (Rasheed et al. 2026) | Task + Verification Agent | No | Per-framework | No |
The Gap No Framework Fills
Three capabilities exist independently. No framework combines all three.
- Harness enforcement (hooks, gates, linters): Present in most frameworks. Well-understood.
- Structured multi-agent discussion: Extensively studied academically. Rarely implemented in production. CrewAI’s two primitives support handoff, not debate. Anthropic’s evaluator-optimizer covers one interaction mode, not a multi-item checkpoint protocol.
- Routing architecture ensuring discussion happens: No framework includes a routing guard. Most assume agents will follow designed workflows. Our data shows they don’t unless the harness enforces it.
Why Frameworks Plateau
Rasheed et al. (2026) evaluated 22 frameworks: 12 clustered within 1.4pp (74.57-75.94%). Architectural category did not differentiate. Failures were orchestration-driven: one failed after 11 days from context growth; another consumed $1,434/day from retry loops; a third exhausted API quotas through interactions that increased prompt length without improving answers. Routing guards and watchdogs are orchestration mechanisms — they ensure existing capabilities are reliably activated.
Connection to Academic Research
Our checkpoint protocol implements structured debate with sparse topology: only relevant agents discuss (Li et al. 2024: sparse D=2/5 outperforms fully-connected), discussion activates only for Bucket 2 rules (Eo et al. 2025: adaptive gating reduced calls from 9 to 1.4), and every review follows a structured agenda (MATEval: removing feedback drops correlation from 0.391 to 0.259).
The closest human-systems analogue is the WHO Surgical Safety Checklist (Haynes et al., NEJM 2009): a structured protocol where independent verification by a second party reduced death rates by 47% and complications by 36%. Our checkpoint protocol is the agent-system equivalent: a structured moment where a second agent independently verifies the first’s work before the pipeline advances.
6. Principles for Designing Agent Discussion
Principle 1: Classify every rule into two buckets first
If you can write a programmatic check that verifies compliance with >95% accuracy, it’s Bucket 1. Otherwise, it’s Bucket 2. Misclassification wastes effort both ways: discussion on Bucket 1 rules adds latency without improving compliance; hooks on Bucket 2 rules produce false confidence while substantive violations pass through.
Principle 2: Never use unconditional discussion
Mandate debate on every task: -26pp vs. single-agent (Eo et al. 2025). Discussion activates at two points: after each analysis step (reviewer Quick Check) and during final formatting (writer-reviewer conversation). It does not activate for data discovery, query execution, or schema lookups.
Principle 3: Structure every discussion with trigger, agenda, and verdict
Three components: a trigger (event that activates review), an agenda (fixed list of items), and a verdict (PASS/REVISE/BLOCK). Without all three, discussion degrades. An agent told to “review this work” without an agenda produces generic approval. An agent with an agenda but no verdict flags issues without forcing resolution.
Principle 4: Persistent agents for discussion, disposable agents for parallel work
Persistent agents accumulate context and produce calibrated reviews. Disposable sub-agents are stateless. Discussion happens between persistent agents. Sub-agents handle independent parallel work.
Principle 5: Gate discussion at the routing layer, not the agent layer
Relying on agents to self-enforce discussion produces 50-70% compliance. The routing guard enforces discussion at the harness level. The agent cannot choose to skip because the harness structurally prevents the bypass path.
Principle 6: Monitor discussion liveness
Discussion introduces silent stalls: Agent A messages Agent B, Agent B stalls, no hook fires, the pipeline hangs. A watchdog that polls agent activity is essential. Without this, a single stalled checkpoint blocks the pipeline indefinitely.
Principle 7: Discussion catches what linters can’t
The highest-value items are methodology issues no programmatic check detects: wrong denominators (3-5x silent bias), Simpson’s paradox, inappropriate population filters, superficial cross-source claims. A linter checking “did the analyst mention cross-source validation” shows 100% while the validation is superficial. Only another agent asking “what was the discrepancy, and how did you reconcile it?” verifies substantive compliance.
7. Conclusion
The Best Practices To Keep AI Agents on Track established that enforcement mechanism matters more than rule content. This post adds a second finding: some rules can’t be enforced by any programmatic mechanism — they require judgment.
The complete compliance stack:
- Hooks and gates for programmatic rules: 99-100%. The foundation.
- Structured discussion at checkpoints for judgment rules: 85-100%. The quality layer.
- Routing architecture to ensure discussion happens: prevents bypass.
- Proactive monitoring to catch liveness failures: prevents silent stalls.
Each layer addresses a failure mode the others cannot. Hooks can’t evaluate methodology quality. Discussion can’t catch missing files. Routing can’t detect stalled agents. Monitoring can’t verify analytical reasoning. The layers compose.
flowchart LR
A["Hooks<br/>catch missing files"] ~~~ B["Discussion<br/>catches wrong methodology"]
B ~~~ C["Routing<br/>ensures discussion happens"]
C ~~~ D["Monitoring<br/>catches stalled agents"]
style A fill:#e3f2fd,stroke:#1565C0
style B fill:#fff3e0,stroke:#E65100
style C fill:#f3e5f5,stroke:#7B1FA2
style D fill:#e8f5e9,stroke:#2E7D32
The remaining gap (~90% to 100%) is dominated by format rules where the fix is straightforward harness tuning. The judgment rules that motivated this investigation — challenge calibration, cross-source validation, methodology review — are at 87-100% with the full stack.
But compliance is only the measurable proxy. The deeper value of agent discussion is output correctness — catching wrong denominators, Simpson’s paradox, stale data windows, and superficial cross-validations that silently corrupt conclusions. Mercier and Sperber’s asymmetry — biased in production, objective in evaluation — is not just a theoretical parallel. It is the mechanism by which the system produces reliable results.
Appendix
A. The Checkpoint Discussion Protocol
Quick Check Phase (BLOCKING)
TRIGGER: Analysis step completes, working agent signals ready-for-review
PARTICIPANTS: Working agent (defender), Reviewing agent (challenger)
TIME BUDGET: 3-5 minutes
PROTOCOL:
1. Reviewer receives step output + accumulated pipeline context
2. Reviewer evaluates against 5-item agenda:
[1] Methodology: Is the analytical approach appropriate?
[2] Population: Is the denominator correct?
[3] Data Quality: Partitions fresh? JOIN risks? Missing dimensions?
[4] Evidence Labels: Claims labeled PROVEN/SUPPORTED/SPECULATIVE?
[5] Prescribed Method: All required steps followed?
3. Reviewer issues findings with specific references
4. Working agent responds: accept + fix, or defend with evidence
5. Reviewer issues verdict: PASS / REVISE / BLOCK
Deep Review Phase (PARALLEL)
TRIGGER: Quick Check issues PASS verdict
RUNS: In background, parallel with next step
COVERS: Cross-step consistency, cross-source validation depth,
methodology alternatives
OUTPUT: Feeds into cross-step synthesis at pipeline end
B. Full Per-Rule Compliance Table
Table A1: Compliance Across All Measured Versions
| Rule | Code | Category | V2 (Harness) | V4 (+ Discussion) | V5 (+ Routing + Watchdog, est.) | Enforcement |
|---|---|---|---|---|---|---|
| Step 0 Premise | A1 | Methodology | 100% | 100% | 100% | Hook/gate |
| Evidence Classification | A3 | Methodology | 100% | 96% | 96% | Warm-rule |
| Structural-Unblock | A6 | Methodology | 83% | 100% | 100% | Warm-rule + Discussion |
| Gap Classification | A9 | Methodology | 4% | – | – | Standalone (debt) |
| Knowledge Reuse | A10 | Methodology | 13% | – | – | Standalone (debt) |
| Metric Definitions | A11 | Methodology | 100% | 100% | 100% | Gate |
| Framework Coverage | A12 | Methodology | 100% | 100% | 100% | Gate |
| Challenge Calibration | B2 | Review | 21% | 87% | ~92% | Discussion |
| Cross-Source Validation | B4 | Review | 33% | 96% | ~98% | Discussion |
| Retention Check | D2 | Data | 75% | 100% | 100% | Discussion |
| Data Sensitivity Flagging | D6 | Data | 100% | 100% | 100% | Hook |
| No Data Fabrication | E6 | Data | 100% | 100% | 100% | Hook |
| Summary Narrative First | W2 | Format | 83% | 100% | 100% | Linter + Discussion |
| Bullet Hierarchy <=7 L0 | W3 | Format | 79% | 43% | ~65% | Linter (needs BLOCK) |
| No Bold Table Data | W5 | Format | 71% | 78% | ~80% | Linter (needs BLOCK) |
| Header Spacing | W6 | Format | 58% | 70% | ~75% | Linter (needs BLOCK) |
| Reviewer verdict (DISC-1) | DISC-1 | Discussion | 0% | 100% | 100% | Discussion |
| 7-item agenda (DISC-2) | DISC-2 | Discussion | 0% | 35% | ~70% | Discussion + Routing |
| Recovery log | NEW | Execution | – | 87% | ~90% | Discussion |
| Pipeline completion | – | Liveness | 100% | 96% | ~98% | Gate + Watchdog |
| Document deliverable | – | Output | N/M | N/M | ~90% | Routing |
V5 estimates are based on root cause analysis of V4 failure modes, not measured production results.
C. Testing Methodology
Table D1: Evaluation Setup
| Parameter | Value |
|---|---|
| Total analysis runs evaluated | V2: 24 runs, V4: 23 runs |
| Rule categories | 4 (Methodology, Review, Data, Format) |
| Rules tracked | 16 base + 3 discussion-specific + 2 system metrics |
| Evaluation method | Manual review of pipeline logs, agent outputs, and final deliverables |
| V5 estimation method | Root cause analysis of V4 compliance gaps, projected improvement rates |
| Model | Frontier LLM (model version varied across evaluation period) |
| Pipeline type | Multi-step analytical investigation (3-7 steps per run) |
| Discussion protocol | V2: None. V4: 2-phase checkpoint with 5-item agenda |
| Routing | V2-V4: Agent-level routing. V5: Harness-level routing guard |
| Monitoring | V2-V4: None. V5: Watchdog (5-min self-poll) |
D. References
- Du, Y., et al. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325.
- Li, Y., et al. (2024). Improving Multi-Agent Debate with Sparse Communication Topology. arXiv:2406.11776.
- Eo, S., et al. (2025). Debate ON Demand: Adaptive Activation of Multi-Agent Debate. arXiv:2504.05047.
- Li, Y., et al. (2024). MATEval: A Multi-Agent Discussion Framework for Open-Ended Text Evaluation. arXiv:2403.19305.
- Rasheed, B., et al. (2026). Evaluating Multi-Agent Frameworks: A Comprehensive Study of 22 Agentic Systems. arXiv:2604.16646.
- Anthropic. (2024). Building Effective Agents. anthropic.com/engineering/building-effective-agents.
- Mercier, H. & Sperber, D. (2011). Why Do Humans Reason? Behavioral and Brain Sciences, 34(2), 57-74.
- Bacchelli, A. & Bird, C. (2013). Expectations, Outcomes, and Challenges of Modern Code Review. ICSE 2013.
- Haynes, A.B., et al. (2009). A Surgical Safety Checklist to Reduce Morbidity and Mortality. NEJM, 360(5), 491-499.
- Shinn, N., et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023.
This is the third post in a series. Future posts will cover lessons learned and performance metrics.