Agent Discussion: The Quality Layer That Harness Engineering Can't Replace

May 29, 2026

This is the third post in a series about building a multi-agent system for complex analytics. The AI Agent Teams for Analytics covers the architecture. The Best Practices To Keep AI Agents on Track covers harness engineering for rule compliance. This one covers what happens when the harness hits its ceiling.

Summary

Harness engineering hits a ceiling at ~65% system-wide compliance. Hooks and linters enforce programmatic rules at 99-100%, but rules requiring judgment (challenge calibration, cross-source validation, methodology review) plateau at 4-33% with harness alone.
Adding structured agent-to-agent discussion raises system-wide compliance to ~77% (+12pp). The largest single-rule gains come from discussion: Challenge Calibration +66pp, Cross-Source Validation +63pp. No harness change produced gains this large.
Unconditional discussion degrades performance. Research shows a 26pp accuracy drop from mandating debate on every task (Eo et al. 2025). Our architecture gates discussion selectively: hooks for programmatic rules, structured checkpoints for judgment rules.
Discussion requires routing architecture to be reliable. Persistent agents who maintain context and follow checkpoint protocols produce substantive reviews. Disposable sub-agents skip discussion entirely. A routing guard hook ensures all core work flows through persistent agents.
Proactive monitoring completes the stack. A watchdog agent self-polls every 5 minutes, detects checkpoint deadlocks, and alerts the orchestrator. Combined with harness + discussion + routing, this projects ~90%+ system-wide compliance.
No existing framework combines all three. CrewAI has 2 inter-agent primitives and no structured debate. Anthropic’s evaluator-optimizer pattern covers one interaction mode. Across 22 evaluated frameworks, the performance plateau is orchestration-driven, not architecture-driven (Rasheed et al. 2026).

1. Introduction

In the Best Practices To Keep AI Agents on Track, we established that the enforcement mechanism for a rule matters more than the rule’s content. Hooks achieved 99-100% compliance, warm-start injection ~90%, and standalone text 50-70%. We improved system-wide compliance from ~58% to ~65% by promoting rules up the enforcement ladder.

But we identified a class of rules that resist promotion. No hook can verify whether a reviewer’s challenge was well-calibrated. No linter can check whether a cross-source validation was substantive or superficial. No gate can assess whether a methodology review caught the right issues. These rules require judgment — the ability to evaluate reasoning in context, weigh tradeoffs, and produce calibrated critique.

The cognitive science literature explains why. Mercier and Sperber (2011) showed that confirmation bias is asymmetric: humans are biased when producing arguments but more objective when evaluating others’. On the Wason selection task, individual reasoning produces 10% accuracy; group debate raises it to 80%. The same asymmetry applies to LLM agents: an agent generating an analysis focuses on supporting evidence; a separate reviewing agent focuses on weaknesses. This is not a workaround — it is how reasoning produces reliable results.

This post presents the data from adding structured agent-to-agent discussion to the harness. The key finding: harness engineering and agent discussion are complementary, not competing. Hooks handle what machines can verify. Discussions handle what requires judgment. The two combined, with a routing architecture that ensures discussion actually happens and a monitoring layer that catches pipeline stalls, produce a compliance stack that addresses the full rule surface.

A concrete example: in one analysis, the reviewing agent caught during a checkpoint discussion that the analyst was using registered users as the denominator for an engagement ratio, instead of active users. This single substitution inflates the denominator by 3-5x, silently suppressing the engagement metric by the same factor. No linter can detect this — both are valid COUNT(DISTINCT userid) queries that return real numbers. The reviewer asked: “Your denominator includes all registered accounts. Shouldn’t this be filtered to active users for an engagement ratio?” The analyst revised, and the final number was correct. This is the class of error that discussion catches and harness engineering cannot.

The progression:

flowchart LR
    V2["V2: Harness only<br/>~65%"] --> V4["V4: + Discussion<br/>~77%"]
    V4 --> V5["V5: + Routing + Monitoring<br/>~90%+"]

    style V2 fill:#ffcdd2,stroke:#C62828
    style V4 fill:#fff9c4,stroke:#F9A825
    style V5 fill:#c8e6c9,stroke:#2E7D32

2. The Problem: Harness Engineering Hits a Ceiling

Observation 1: Some rules resist programmatic enforcement

Our system tracks 16 compliance rules across analytical methodology (A), review quality (B), data handling (D/E), and writing format (W). After applying the full harness toolkit, the compliance distribution was bimodal:

Table 1: Compliance by Enforcement Type (V2, harness only)

Rule	Enforcement Type	Compliance
Step 0 Premise Verification	Hook/gate	100%
Evidence Classification	Warm-rule	100%
Metric Definitions file	Gate	100%
Framework Coverage section	Gate	100%
No Data Fabrication	Hook	100%
Data Sensitivity Flagging	Hook	100%
Structural-Unblock Companion	Warm-rule	83%
Summary Narrative First	Format-check linter	83%
Bullet Hierarchy <=7 L0	Linter (WARN)	79%
Retention Check	Warm-rule	75%
No Bold Table Data	Linter (WARN)	71%
Header Spacing	Linter (partial)	58%
Cross-Source Validation	Standalone text	33%
Challenge Calibration	Standalone text	21%
Knowledge Reuse	Standalone text	13%
Gap Classification	Standalone text	4%

The pattern is clear. Hook/gate rules: 100%. Warm-rules: 75-100%. Standalone text: 4-33%. The bottom four rules share a property that distinguishes them from the top group: they require one agent to exercise judgment about another agent’s work.

flowchart TD
    subgraph High["99-100% Compliance"]
        H1["Hooks & Gates"]
    end
    subgraph Mid["75-90% Compliance"]
        M1["Warm-rules & Linters"]
    end
    subgraph Low["4-33% Compliance"]
        L1["Standalone text<br/>(judgment rules)"]
    end

    H1 ~~~ M1
    M1 ~~~ L1

    style High fill:#c8e6c9,stroke:#2E7D32
    style Mid fill:#fff9c4,stroke:#F9A825
    style Low fill:#ffcdd2,stroke:#C62828

Observation 2: Discussion-dependent rules are the majority of remaining compliance gaps

Of the 10 rules below 100% in V2, six require judgment no programmatic check can verify: challenge calibration (21%), cross-source validation (33%), gap classification (4%), knowledge reuse (13%), retention checks (75%), and structural-unblock companions (83%). Each asks whether an agent exercised judgment appropriately — not whether it produced output, but whether that output was substantive.

These six rules account for 76% of the total compliance gap (weighted by distance from 100%). The remaining four below-100% rules are format rules where the linter is configured as WARN rather than BLOCK — a harness tuning issue with a known fix.

Observation 3: Discussion quality depends on agent persistence

Our system uses persistent core agents (maintain state across the pipeline) and disposable sub-agents (spawned for parallel work). The compliance difference is stark:

Table 2: Compliance by Agent Type

Metric	Persistent Core Agent	Disposable Sub-Agent
Reviewer challenge depth	Inlined denominator/date/grain patterns	Generic “looks good” approvals
Cross-source validation	Substantive comparison with discrepancy analysis	Surface-level “validated against Table B”
Checkpoint protocol followed	Full 5-item agenda	Partial or skipped
Output completeness	Full deliverable chain	Partial output only

A freshly spawned sub-agent has no context about the pipeline’s accumulated findings, no relationship with the reviewer, and no checkpoint protocol state. It completes its narrow task and terminates. The discussion that would have caught quality issues never happens.

Observation 4: Unconditional discussion can degrade performance

Eo et al. (2025) tested standard Multi-Agent Debate on StrategyQA: 44.54% accuracy vs. 70.74% for single-agent Chain-of-Thought — a 26pp degradation from unconditional debate. Their DOWN framework, which selectively activates debate based on confidence, reduced agent calls from 9 to 1.4 (6x efficiency) while preserving accuracy.

Li et al. (2024) found that sparse debate (degree D=2/5) achieved 66.0% vs. 64.0% for fully-connected debate with 41.5% cost savings. On hard problems, seeing more reference solutions misleads agents into converging on incorrect answers. Discussion must be conditional and structured, not universal.

Observation 5: Discussion without structured protocol is worse than no discussion

MATEval (Li et al. 2024) tested multi-agent discussion for text evaluation. Their SR+CoT protocol achieved a +67% relative improvement. But removing the feedback mechanism dropped correlation from 0.391 to 0.259. Removing explanations dropped it to 0.011 — worse than no discussion. Partial discussion protocols actively degrade below automated baselines.

3. Root Cause: Why Harness Alone Can’t Solve Judgment Rules

The Two-Bucket Framework

Bucket 1 — Harness-Enforceable: A programmatic check verifies compliance with >95% accuracy. Examples: “Does the metric definitions file exist?” “Did Step 0 run before Step 1?”

Bucket 2 — Discussion-Dependent: The check itself requires judgment — evaluating reasoning quality, contextual appropriateness, or cross-agent consistency. Examples: “Was the challenge calibrated?” “Is the cross-source validation substantive?”

Table 3: Two-Bucket Classification with Compliance Data

Rule	Bucket	V2 (Harness)	V4 (+ Discussion)	Delta
Step 0 Premise	1: Harness	100%	100%	0
Evidence Classification	1: Harness	100%	96%	-4pp
Metric Definitions file	1: Harness	100%	100%	0
Framework Coverage	1: Harness	100%	100%	0
No Data Fabrication	1: Harness	100%	100%	0
Data Sensitivity Flagging	1: Harness	100%	100%	0
Summary Narrative First	1: Harness	83%	100%	+17pp
No Bold Table Data	1: Harness	71%	78%	+7pp
Header Spacing	1: Harness	58%	70%	+12pp
Bullet Hierarchy <=7 L0	1: Harness	79%	43%	-36pp
Challenge Calibration	2: Discussion	21%	87%	+66pp
Cross-Source Validation	2: Discussion	33%	96%	+63pp
Retention Check	2: Discussion	75%	100%	+25pp
Structural-Unblock	2: Discussion	83%	100%	+17pp

Bucket 1 rules average 90% in V2. Bucket 2 rules average 38% in V2. After adding structured discussion in V4, Bucket 2 rules average 96% — a +58pp gain from the mechanism that matches their nature.

The W3 regression (-36pp in V4) is instructive: this is a Bucket 1 rule where the linter was not tightened to BLOCK, and the additional discussion context actually made the writing agent produce longer outputs that violated the constraint more often. Discussion is not a substitute for a proper linter.

The Cognitive Science Parallel

Mercier and Sperber (2011) showed that confirmation bias is asymmetric: strong when producing arguments but more objective when evaluating others’. On the Wason selection task, individual performance hovers around 10% correct; group debate raises it to 80%. This asymmetry is exactly what makes multi-agent discussion work: the generating agent focuses on supporting evidence, the reviewing agent focuses on weaknesses.

Bacchelli and Bird (ICSE 2013) found the same pattern in software: only 14% of code review comments addressed defects; 29% addressed code improvements — alternative approaches and subtle issues the author missed. The highest value of review is not bug-catching (which automated tests handle) but improvements no automated check would flag.

4. Solution: Conditional Discussion Architecture

Our compliance stack has four layers. Each addresses a distinct failure mode.

flowchart TD
    L1["Layer 1: Harness<br/>Hooks, gates, linters<br/>99-100% for programmatic rules"] --> L2["Layer 2: Discussion<br/>Structured checkpoints<br/>85-100% for judgment rules"]
    L2 --> L3["Layer 3: Routing<br/>Guard hook ensures discussion happens<br/>Prevents bypass"]
    L3 --> L4["Layer 4: Monitoring<br/>Watchdog detects stalls<br/>Prevents silent hangs"]

    style L1 fill:#e3f2fd,stroke:#1565C0
    style L2 fill:#fff3e0,stroke:#E65100
    style L3 fill:#f3e5f5,stroke:#7B1FA2
    style L4 fill:#e8f5e9,stroke:#2E7D32

Layer 1: Harness Enforcement (Hooks, Gates, Linters)

Unchanged from the Best Practices To Keep AI Agents on Track. Handles Bucket 1 rules at 99-100%. Pre-execution hooks block progression until conditions are met. Post-execution gates verify outputs before handoff. Format linters check structural rules. Warm-start injection provides reminders at agent startup (~90%). This layer is the foundation — everything that can be verified programmatically.

Layer 2: Structured Discussion at Checkpoints

Handles Bucket 2 rules. This is the new layer that produces the +12pp system-wide gain.

Two-Phase Checkpoint Protocol

Every analysis step passes through a two-phase review:

Quick Check (BLOCKING, 3-5 minutes): The reviewing agent evaluates the step output against a structured agenda before the pipeline advances. The working agent must respond to each finding. Verdict: PASS, REVISE, or BLOCK.
Deep Review (PARALLEL with next step): Extended review runs in the background while the next step begins. Findings feed into cross-step synthesis. This prevents the review from becoming a pipeline bottleneck while still catching issues.

5-Item Reviewer Agenda

The Quick Check uses a structured agenda, not open-ended review:

Agenda Item	Failure Mode It Catches	Example
1. Methodology	Wrong analytical approach	Using ratio metrics when absolute counts are needed
2. Population	Wrong denominator or filter	Eligible users vs. daily active users conflation
3. Data Quality	Stale data, JOIN issues	Querying a table whose partitions expired days ago
4. Evidence Labels	Unsupported claims	“Users clearly prefer X” without statistical test
5. Prescribed Method	Skipped required steps	No retention check before query execution

Why structured agenda > open-ended review: MATEval showed that removing structured feedback drops correlation from 0.391 to 0.259. The agenda ensures every review covers the known failure modes.

What Discussion Actually Catches: Beyond Compliance to Correctness

The compliance metrics capture whether discussion happened and known rules were followed. But the highest-value contribution is catching errors that no rule anticipated:

Error Type	Example	Impact If Missed	Detectable by Linter?
Wrong denominator	Registered users instead of active users	3-5x silent metric suppression	No — both are valid queries
Simpson’s paradox	Metric rises in every sub-group but falls in aggregate	Contradictory recommendation	No — both numbers are correct
Stale data window	Expired partitions, outdated numbers	All findings reflect wrong period	Partially
Superficial cross-validation	“Validated against Source B” without discrepancy analysis	False confidence	No — keyword present
Inappropriate filter	Using experiment eligibility as a population filter	Biased denominator	No — valid SQL

Only another agent asking “what was the discrepancy between sources, and how did you reconcile it?” forces substantive answers.

Layer 3: Routing Architecture — Harness Enabling Discussion

Discussion only works when it actually happens. Layer 3 ensures it does. This is the key architectural insight: a harness mechanism (Layer 1) that enables discussion (Layer 2). The two approaches are symbiotic.

Persistent core agents: 8 agents that persist across the pipeline, accumulate context, and follow checkpoint protocols. When the reviewer evaluates Step 3, it has context from Steps 1-2.
Routing guard hook: Intercepts every agent spawn, checks if the task maps to a core agent role, and blocks spawn if so — routing through the persistent agent instead.
Sub-agent scope: Sub-agents handle parallel, independent work (schema lookups, partition checks). Core agents dispatch sub-agents and incorporate results into the discussion flow.

Layer 4: Proactive Monitoring (Watchdog)

Discussion introduces checkpoint liveness as a concern. If Agent A sends a message to Agent B and Agent B has stalled, the pipeline hangs indefinitely. No hook fires because no rule was violated.

The watchdog agent addresses this:

Self-polls every 5 minutes
Checks each agent’s last activity timestamp
Detects stalls: >10 minutes silence during active pipeline
Alerts the orchestrator with diagnosis and recovery recommendation

Composition: Compliance at Each Layer

Table 4: System-Wide Compliance by Architecture Version

Version	Layers Active	System-Wide Avg	Top Improvement
V2	Harness only	~65%	—
V4	Harness + Discussion	~77%	Challenge Calibration +66pp
V5 (est.)	All four layers	~90%+	Routing compliance +35pp

flowchart LR
    V2["V2: ~65%<br/>Harness only"] --> V4["V4: ~77%<br/>+ Discussion<br/>(+12pp)"]
    V4 --> V5["V5: ~90%+<br/>+ Routing + Monitoring<br/>(+13pp)"]

    style V2 fill:#ffcdd2,stroke:#C62828
    style V4 fill:#fff9c4,stroke:#F9A825
    style V5 fill:#c8e6c9,stroke:#2E7D32

Table 5: Per-Rule Compliance Across All Versions

Rule	Code	V2	V4	V5 (est.)	Primary Layer
Step 0 Premise	A1	100%	100%	100%	L1: Hook
Evidence Classification	A3	100%	96%	96%	L1: Warm-rule
Metric Definitions	A11	100%	100%	100%	L1: Gate
Framework Coverage	A12	100%	100%	100%	L1: Gate
No Data Fabrication	E6	100%	100%	100%	L1: Hook
Data Sensitivity Flagging	D6	100%	100%	100%	L1: Hook
Summary Narrative First	W2	83%	100%	100%	L1: Linter + L2: Writer-Reviewer
No Bold Table Data	W5	71%	78%	~80%	L1: Linter (needs BLOCK)
Bullet Hierarchy <=7 L0	W3	79%	43%	~65%	L1: Linter + L3: Routing
Header Spacing	W6	58%	70%	~75%	L1: Linter (needs BLOCK)
Challenge Calibration	B2	21%	87%	~92%	L2: Discussion
Cross-Source Validation	B4	33%	96%	~98%	L2: Discussion
Retention Check	D2	75%	100%	100%	L2: Discussion
Structural-Unblock	A6	83%	100%	100%	L1: Warm-rule + L2: Discussion
Reviewer verdict (DISC-1)	DISC-1	0%	100%	100%	L2: Discussion
7-item agenda (DISC-2)	DISC-2	0%	35%	~70%	L2: Discussion + L3: Routing
Recovery log	NEW	–	87%	~90%	L2: Discussion
Pipeline completion	–	100%	96%	~98%	L1: Gate + L4: Watchdog
Document deliverable	–	N/M	N/M	~90%	L3: Routing

V5 numbers are simulated estimates based on root cause analysis of V4 failures, not measured results.

DISC-2 at 35% in V4 illustrates the layer interaction: the persistent reviewer knows the full 5-item agenda, but without routing enforcement, review sometimes went through less-context-aware paths that checked only 2-3 items. The routing guard (L3) ensures the persistent reviewer — with full agenda context — always handles review.

5. Industry Landscape

Table 5: Agent Discussion Mechanisms Across Frameworks

Framework	Discussion Mechanism	Structured Agenda	Routing Control	Monitoring
Our system	2-phase checkpoint, 5-item agenda	Yes	Routing guard hook	Watchdog (5-min poll)
Anthropic patterns	Evaluator-Optimizer loop	Partial	Manual	No
CrewAI	2 primitives (Delegate, Ask)	No	`allow_delegation=False`	No
22-framework eval (Rasheed et al. 2026)	Task + Verification Agent	No	Per-framework	No

The Gap No Framework Fills

Three capabilities exist independently. No framework combines all three.

Harness enforcement (hooks, gates, linters): Present in most frameworks. Well-understood.
Structured multi-agent discussion: Extensively studied academically. Rarely implemented in production. CrewAI’s two primitives support handoff, not debate. Anthropic’s evaluator-optimizer covers one interaction mode, not a multi-item checkpoint protocol.
Routing architecture ensuring discussion happens: No framework includes a routing guard. Most assume agents will follow designed workflows. Our data shows they don’t unless the harness enforces it.

Why Frameworks Plateau

Rasheed et al. (2026) evaluated 22 frameworks: 12 clustered within 1.4pp (74.57-75.94%). Architectural category did not differentiate. Failures were orchestration-driven: one failed after 11 days from context growth; another consumed $1,434/day from retry loops; a third exhausted API quotas through interactions that increased prompt length without improving answers. Routing guards and watchdogs are orchestration mechanisms — they ensure existing capabilities are reliably activated.

Connection to Academic Research

Our checkpoint protocol implements structured debate with sparse topology: only relevant agents discuss (Li et al. 2024: sparse D=2/5 outperforms fully-connected), discussion activates only for Bucket 2 rules (Eo et al. 2025: adaptive gating reduced calls from 9 to 1.4), and every review follows a structured agenda (MATEval: removing feedback drops correlation from 0.391 to 0.259).

The closest human-systems analogue is the WHO Surgical Safety Checklist (Haynes et al., NEJM 2009): a structured protocol where independent verification by a second party reduced death rates by 47% and complications by 36%. Our checkpoint protocol is the agent-system equivalent: a structured moment where a second agent independently verifies the first’s work before the pipeline advances.

6. Principles for Designing Agent Discussion

Principle 1: Classify every rule into two buckets first

If you can write a programmatic check that verifies compliance with >95% accuracy, it’s Bucket 1. Otherwise, it’s Bucket 2. Misclassification wastes effort both ways: discussion on Bucket 1 rules adds latency without improving compliance; hooks on Bucket 2 rules produce false confidence while substantive violations pass through.

Principle 2: Never use unconditional discussion

Mandate debate on every task: -26pp vs. single-agent (Eo et al. 2025). Discussion activates at two points: after each analysis step (reviewer Quick Check) and during final formatting (writer-reviewer conversation). It does not activate for data discovery, query execution, or schema lookups.

Principle 3: Structure every discussion with trigger, agenda, and verdict

Three components: a trigger (event that activates review), an agenda (fixed list of items), and a verdict (PASS/REVISE/BLOCK). Without all three, discussion degrades. An agent told to “review this work” without an agenda produces generic approval. An agent with an agenda but no verdict flags issues without forcing resolution.

Principle 4: Persistent agents for discussion, disposable agents for parallel work

Persistent agents accumulate context and produce calibrated reviews. Disposable sub-agents are stateless. Discussion happens between persistent agents. Sub-agents handle independent parallel work.

Principle 5: Gate discussion at the routing layer, not the agent layer

Relying on agents to self-enforce discussion produces 50-70% compliance. The routing guard enforces discussion at the harness level. The agent cannot choose to skip because the harness structurally prevents the bypass path.

Principle 6: Monitor discussion liveness

Discussion introduces silent stalls: Agent A messages Agent B, Agent B stalls, no hook fires, the pipeline hangs. A watchdog that polls agent activity is essential. Without this, a single stalled checkpoint blocks the pipeline indefinitely.

Principle 7: Discussion catches what linters can’t

The highest-value items are methodology issues no programmatic check detects: wrong denominators (3-5x silent bias), Simpson’s paradox, inappropriate population filters, superficial cross-source claims. A linter checking “did the analyst mention cross-source validation” shows 100% while the validation is superficial. Only another agent asking “what was the discrepancy, and how did you reconcile it?” verifies substantive compliance.

7. Conclusion

The Best Practices To Keep AI Agents on Track established that enforcement mechanism matters more than rule content. This post adds a second finding: some rules can’t be enforced by any programmatic mechanism — they require judgment.

The complete compliance stack:

Hooks and gates for programmatic rules: 99-100%. The foundation.
Structured discussion at checkpoints for judgment rules: 85-100%. The quality layer.
Routing architecture to ensure discussion happens: prevents bypass.
Proactive monitoring to catch liveness failures: prevents silent stalls.

Each layer addresses a failure mode the others cannot. Hooks can’t evaluate methodology quality. Discussion can’t catch missing files. Routing can’t detect stalled agents. Monitoring can’t verify analytical reasoning. The layers compose.

flowchart LR
    A["Hooks<br/>catch missing files"] ~~~ B["Discussion<br/>catches wrong methodology"]
    B ~~~ C["Routing<br/>ensures discussion happens"]
    C ~~~ D["Monitoring<br/>catches stalled agents"]

    style A fill:#e3f2fd,stroke:#1565C0
    style B fill:#fff3e0,stroke:#E65100
    style C fill:#f3e5f5,stroke:#7B1FA2
    style D fill:#e8f5e9,stroke:#2E7D32

The remaining gap (~90% to 100%) is dominated by format rules where the fix is straightforward harness tuning. The judgment rules that motivated this investigation — challenge calibration, cross-source validation, methodology review — are at 87-100% with the full stack.

But compliance is only the measurable proxy. The deeper value of agent discussion is output correctness — catching wrong denominators, Simpson’s paradox, stale data windows, and superficial cross-validations that silently corrupt conclusions. Mercier and Sperber’s asymmetry — biased in production, objective in evaluation — is not just a theoretical parallel. It is the mechanism by which the system produces reliable results.

Appendix

A. The Checkpoint Discussion Protocol

Quick Check Phase (BLOCKING)

TRIGGER:  Analysis step completes, working agent signals ready-for-review
PARTICIPANTS:  Working agent (defender), Reviewing agent (challenger)
TIME BUDGET:  3-5 minutes
PROTOCOL:
  1. Reviewer receives step output + accumulated pipeline context
  2. Reviewer evaluates against 5-item agenda:
     [1] Methodology: Is the analytical approach appropriate?
     [2] Population: Is the denominator correct?
     [3] Data Quality: Partitions fresh? JOIN risks? Missing dimensions?
     [4] Evidence Labels: Claims labeled PROVEN/SUPPORTED/SPECULATIVE?
     [5] Prescribed Method: All required steps followed?
  3. Reviewer issues findings with specific references
  4. Working agent responds: accept + fix, or defend with evidence
  5. Reviewer issues verdict: PASS / REVISE / BLOCK

Deep Review Phase (PARALLEL)

TRIGGER:  Quick Check issues PASS verdict
RUNS:     In background, parallel with next step
COVERS:   Cross-step consistency, cross-source validation depth,
          methodology alternatives
OUTPUT:   Feeds into cross-step synthesis at pipeline end

B. Full Per-Rule Compliance Table

Table A1: Compliance Across All Measured Versions

Rule	Code	Category	V2 (Harness)	V4 (+ Discussion)	V5 (+ Routing + Watchdog, est.)	Enforcement
Step 0 Premise	A1	Methodology	100%	100%	100%	Hook/gate
Evidence Classification	A3	Methodology	100%	96%	96%	Warm-rule
Structural-Unblock	A6	Methodology	83%	100%	100%	Warm-rule + Discussion
Gap Classification	A9	Methodology	4%	–	–	Standalone (debt)
Knowledge Reuse	A10	Methodology	13%	–	–	Standalone (debt)
Metric Definitions	A11	Methodology	100%	100%	100%	Gate
Framework Coverage	A12	Methodology	100%	100%	100%	Gate
Challenge Calibration	B2	Review	21%	87%	~92%	Discussion
Cross-Source Validation	B4	Review	33%	96%	~98%	Discussion
Retention Check	D2	Data	75%	100%	100%	Discussion
Data Sensitivity Flagging	D6	Data	100%	100%	100%	Hook
No Data Fabrication	E6	Data	100%	100%	100%	Hook
Summary Narrative First	W2	Format	83%	100%	100%	Linter + Discussion
Bullet Hierarchy <=7 L0	W3	Format	79%	43%	~65%	Linter (needs BLOCK)
No Bold Table Data	W5	Format	71%	78%	~80%	Linter (needs BLOCK)
Header Spacing	W6	Format	58%	70%	~75%	Linter (needs BLOCK)
Reviewer verdict (DISC-1)	DISC-1	Discussion	0%	100%	100%	Discussion
7-item agenda (DISC-2)	DISC-2	Discussion	0%	35%	~70%	Discussion + Routing
Recovery log	NEW	Execution	–	87%	~90%	Discussion
Pipeline completion	–	Liveness	100%	96%	~98%	Gate + Watchdog
Document deliverable	–	Output	N/M	N/M	~90%	Routing

V5 estimates are based on root cause analysis of V4 failure modes, not measured production results.

C. Testing Methodology

Table D1: Evaluation Setup

Parameter	Value
Total analysis runs evaluated	V2: 24 runs, V4: 23 runs
Rule categories	4 (Methodology, Review, Data, Format)
Rules tracked	16 base + 3 discussion-specific + 2 system metrics
Evaluation method	Manual review of pipeline logs, agent outputs, and final deliverables
V5 estimation method	Root cause analysis of V4 compliance gaps, projected improvement rates
Model	Frontier LLM (model version varied across evaluation period)
Pipeline type	Multi-step analytical investigation (3-7 steps per run)
Discussion protocol	V2: None. V4: 2-phase checkpoint with 5-item agenda
Routing	V2-V4: Agent-level routing. V5: Harness-level routing guard
Monitoring	V2-V4: None. V5: Watchdog (5-min self-poll)

D. References

Du, Y., et al. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325.
Li, Y., et al. (2024). Improving Multi-Agent Debate with Sparse Communication Topology. arXiv:2406.11776.
Eo, S., et al. (2025). Debate ON Demand: Adaptive Activation of Multi-Agent Debate. arXiv:2504.05047.
Li, Y., et al. (2024). MATEval: A Multi-Agent Discussion Framework for Open-Ended Text Evaluation. arXiv:2403.19305.
Rasheed, B., et al. (2026). Evaluating Multi-Agent Frameworks: A Comprehensive Study of 22 Agentic Systems. arXiv:2604.16646.
Anthropic. (2024). Building Effective Agents. anthropic.com/engineering/building-effective-agents.
Mercier, H. & Sperber, D. (2011). Why Do Humans Reason? Behavioral and Brain Sciences, 34(2), 57-74.
Bacchelli, A. & Bird, C. (2013). Expectations, Outcomes, and Challenges of Modern Code Review. ICSE 2013.
Haynes, A.B., et al. (2009). A Surgical Safety Checklist to Reduce Morbidity and Mortality. NEJM, 360(5), 491-499.
Shinn, N., et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023.

This is the third post in a series. Future posts will cover lessons learned and performance metrics.