RZ AI Learning

Agent Discussion: The Quality Layer That Harness Engineering Can't Replace

This is the third post in a series about building a multi-agent system for complex analytics. The AI Agent Teams for Analytics covers the architecture. The Best Practices To Keep AI Agents on Track covers harness engineering for rule compliance. This one covers what happens when the harness hits its ceiling.


Summary

  1. Harness engineering hits a ceiling at ~65% system-wide compliance. Hooks and linters enforce programmatic rules at 99-100%, but rules requiring judgment (challenge calibration, cross-source validation, methodology review) plateau at 4-33% with harness alone.
  2. Adding structured agent-to-agent discussion raises system-wide compliance to ~77% (+12pp). The largest single-rule gains come from discussion: Challenge Calibration +66pp, Cross-Source Validation +63pp. No harness change produced gains this large.
  3. Unconditional discussion degrades performance. Research shows a 26pp accuracy drop from mandating debate on every task (Eo et al. 2025). Our architecture gates discussion selectively: hooks for programmatic rules, structured checkpoints for judgment rules.
  4. Discussion requires routing architecture to be reliable. Persistent agents who maintain context and follow checkpoint protocols produce substantive reviews. Disposable sub-agents skip discussion entirely. A routing guard hook ensures all core work flows through persistent agents.
  5. Proactive monitoring completes the stack. A watchdog agent self-polls every 5 minutes, detects checkpoint deadlocks, and alerts the orchestrator. Combined with harness + discussion + routing, this projects ~90%+ system-wide compliance.
  6. No existing framework combines all three. CrewAI has 2 inter-agent primitives and no structured debate. Anthropic’s evaluator-optimizer pattern covers one interaction mode. Across 22 evaluated frameworks, the performance plateau is orchestration-driven, not architecture-driven (Rasheed et al. 2026).

1. Introduction

In the Best Practices To Keep AI Agents on Track, we established that the enforcement mechanism for a rule matters more than the rule’s content. Hooks achieved 99-100% compliance, warm-start injection ~90%, and standalone text 50-70%. We improved system-wide compliance from ~58% to ~65% by promoting rules up the enforcement ladder.

But we identified a class of rules that resist promotion. No hook can verify whether a reviewer’s challenge was well-calibrated. No linter can check whether a cross-source validation was substantive or superficial. No gate can assess whether a methodology review caught the right issues. These rules require judgment — the ability to evaluate reasoning in context, weigh tradeoffs, and produce calibrated critique.

The cognitive science literature explains why. Mercier and Sperber (2011) showed that confirmation bias is asymmetric: humans are biased when producing arguments but more objective when evaluating others’. On the Wason selection task, individual reasoning produces 10% accuracy; group debate raises it to 80%. The same asymmetry applies to LLM agents: an agent generating an analysis focuses on supporting evidence; a separate reviewing agent focuses on weaknesses. This is not a workaround — it is how reasoning produces reliable results.

This post presents the data from adding structured agent-to-agent discussion to the harness. The key finding: harness engineering and agent discussion are complementary, not competing. Hooks handle what machines can verify. Discussions handle what requires judgment. The two combined, with a routing architecture that ensures discussion actually happens and a monitoring layer that catches pipeline stalls, produce a compliance stack that addresses the full rule surface.

A concrete example: in one analysis, the reviewing agent caught during a checkpoint discussion that the analyst was using registered users as the denominator for an engagement ratio, instead of active users. This single substitution inflates the denominator by 3-5x, silently suppressing the engagement metric by the same factor. No linter can detect this — both are valid COUNT(DISTINCT userid) queries that return real numbers. The reviewer asked: “Your denominator includes all registered accounts. Shouldn’t this be filtered to active users for an engagement ratio?” The analyst revised, and the final number was correct. This is the class of error that discussion catches and harness engineering cannot.

The progression:

flowchart LR
    V2["V2: Harness only<br/>~65%"] --> V4["V4: + Discussion<br/>~77%"]
    V4 --> V5["V5: + Routing + Monitoring<br/>~90%+"]

    style V2 fill:#ffcdd2,stroke:#C62828
    style V4 fill:#fff9c4,stroke:#F9A825
    style V5 fill:#c8e6c9,stroke:#2E7D32

2. The Problem: Harness Engineering Hits a Ceiling

Observation 1: Some rules resist programmatic enforcement

Our system tracks 16 compliance rules across analytical methodology (A), review quality (B), data handling (D/E), and writing format (W). After applying the full harness toolkit, the compliance distribution was bimodal:

Table 1: Compliance by Enforcement Type (V2, harness only)

Rule Enforcement Type Compliance
Step 0 Premise Verification Hook/gate 100%
Evidence Classification Warm-rule 100%
Metric Definitions file Gate 100%
Framework Coverage section Gate 100%
No Data Fabrication Hook 100%
Data Sensitivity Flagging Hook 100%
Structural-Unblock Companion Warm-rule 83%
Summary Narrative First Format-check linter 83%
Bullet Hierarchy <=7 L0 Linter (WARN) 79%
Retention Check Warm-rule 75%
No Bold Table Data Linter (WARN) 71%
Header Spacing Linter (partial) 58%
Cross-Source Validation Standalone text 33%
Challenge Calibration Standalone text 21%
Knowledge Reuse Standalone text 13%
Gap Classification Standalone text 4%

The pattern is clear. Hook/gate rules: 100%. Warm-rules: 75-100%. Standalone text: 4-33%. The bottom four rules share a property that distinguishes them from the top group: they require one agent to exercise judgment about another agent’s work.

flowchart TD
    subgraph High["99-100% Compliance"]
        H1["Hooks & Gates"]
    end
    subgraph Mid["75-90% Compliance"]
        M1["Warm-rules & Linters"]
    end
    subgraph Low["4-33% Compliance"]
        L1["Standalone text<br/>(judgment rules)"]
    end

    H1 ~~~ M1
    M1 ~~~ L1

    style High fill:#c8e6c9,stroke:#2E7D32
    style Mid fill:#fff9c4,stroke:#F9A825
    style Low fill:#ffcdd2,stroke:#C62828

Observation 2: Discussion-dependent rules are the majority of remaining compliance gaps

Of the 10 rules below 100% in V2, six require judgment no programmatic check can verify: challenge calibration (21%), cross-source validation (33%), gap classification (4%), knowledge reuse (13%), retention checks (75%), and structural-unblock companions (83%). Each asks whether an agent exercised judgment appropriately — not whether it produced output, but whether that output was substantive.

These six rules account for 76% of the total compliance gap (weighted by distance from 100%). The remaining four below-100% rules are format rules where the linter is configured as WARN rather than BLOCK — a harness tuning issue with a known fix.

Observation 3: Discussion quality depends on agent persistence

Our system uses persistent core agents (maintain state across the pipeline) and disposable sub-agents (spawned for parallel work). The compliance difference is stark:

Table 2: Compliance by Agent Type

Metric Persistent Core Agent Disposable Sub-Agent
Reviewer challenge depth Inlined denominator/date/grain patterns Generic “looks good” approvals
Cross-source validation Substantive comparison with discrepancy analysis Surface-level “validated against Table B”
Checkpoint protocol followed Full 5-item agenda Partial or skipped
Output completeness Full deliverable chain Partial output only

A freshly spawned sub-agent has no context about the pipeline’s accumulated findings, no relationship with the reviewer, and no checkpoint protocol state. It completes its narrow task and terminates. The discussion that would have caught quality issues never happens.

Observation 4: Unconditional discussion can degrade performance

Eo et al. (2025) tested standard Multi-Agent Debate on StrategyQA: 44.54% accuracy vs. 70.74% for single-agent Chain-of-Thought — a 26pp degradation from unconditional debate. Their DOWN framework, which selectively activates debate based on confidence, reduced agent calls from 9 to 1.4 (6x efficiency) while preserving accuracy.

Li et al. (2024) found that sparse debate (degree D=2/5) achieved 66.0% vs. 64.0% for fully-connected debate with 41.5% cost savings. On hard problems, seeing more reference solutions misleads agents into converging on incorrect answers. Discussion must be conditional and structured, not universal.

Observation 5: Discussion without structured protocol is worse than no discussion

MATEval (Li et al. 2024) tested multi-agent discussion for text evaluation. Their SR+CoT protocol achieved a +67% relative improvement. But removing the feedback mechanism dropped correlation from 0.391 to 0.259. Removing explanations dropped it to 0.011 — worse than no discussion. Partial discussion protocols actively degrade below automated baselines.


3. Root Cause: Why Harness Alone Can’t Solve Judgment Rules

The Two-Bucket Framework

Bucket 1 — Harness-Enforceable: A programmatic check verifies compliance with >95% accuracy. Examples: “Does the metric definitions file exist?” “Did Step 0 run before Step 1?”

Bucket 2 — Discussion-Dependent: The check itself requires judgment — evaluating reasoning quality, contextual appropriateness, or cross-agent consistency. Examples: “Was the challenge calibrated?” “Is the cross-source validation substantive?”

Table 3: Two-Bucket Classification with Compliance Data

Rule Bucket V2 (Harness) V4 (+ Discussion) Delta
Step 0 Premise 1: Harness 100% 100% 0
Evidence Classification 1: Harness 100% 96% -4pp
Metric Definitions file 1: Harness 100% 100% 0
Framework Coverage 1: Harness 100% 100% 0
No Data Fabrication 1: Harness 100% 100% 0
Data Sensitivity Flagging 1: Harness 100% 100% 0
Summary Narrative First 1: Harness 83% 100% +17pp
No Bold Table Data 1: Harness 71% 78% +7pp
Header Spacing 1: Harness 58% 70% +12pp
Bullet Hierarchy <=7 L0 1: Harness 79% 43% -36pp
Challenge Calibration 2: Discussion 21% 87% +66pp
Cross-Source Validation 2: Discussion 33% 96% +63pp
Retention Check 2: Discussion 75% 100% +25pp
Structural-Unblock 2: Discussion 83% 100% +17pp

Bucket 1 rules average 90% in V2. Bucket 2 rules average 38% in V2. After adding structured discussion in V4, Bucket 2 rules average 96% — a +58pp gain from the mechanism that matches their nature.

The W3 regression (-36pp in V4) is instructive: this is a Bucket 1 rule where the linter was not tightened to BLOCK, and the additional discussion context actually made the writing agent produce longer outputs that violated the constraint more often. Discussion is not a substitute for a proper linter.

The Cognitive Science Parallel

Mercier and Sperber (2011) showed that confirmation bias is asymmetric: strong when producing arguments but more objective when evaluating others’. On the Wason selection task, individual performance hovers around 10% correct; group debate raises it to 80%. This asymmetry is exactly what makes multi-agent discussion work: the generating agent focuses on supporting evidence, the reviewing agent focuses on weaknesses.

Bacchelli and Bird (ICSE 2013) found the same pattern in software: only 14% of code review comments addressed defects; 29% addressed code improvements — alternative approaches and subtle issues the author missed. The highest value of review is not bug-catching (which automated tests handle) but improvements no automated check would flag.


4. Solution: Conditional Discussion Architecture

Our compliance stack has four layers. Each addresses a distinct failure mode.

flowchart TD
    L1["Layer 1: Harness<br/>Hooks, gates, linters<br/>99-100% for programmatic rules"] --> L2["Layer 2: Discussion<br/>Structured checkpoints<br/>85-100% for judgment rules"]
    L2 --> L3["Layer 3: Routing<br/>Guard hook ensures discussion happens<br/>Prevents bypass"]
    L3 --> L4["Layer 4: Monitoring<br/>Watchdog detects stalls<br/>Prevents silent hangs"]

    style L1 fill:#e3f2fd,stroke:#1565C0
    style L2 fill:#fff3e0,stroke:#E65100
    style L3 fill:#f3e5f5,stroke:#7B1FA2
    style L4 fill:#e8f5e9,stroke:#2E7D32

Layer 1: Harness Enforcement (Hooks, Gates, Linters)

Unchanged from the Best Practices To Keep AI Agents on Track. Handles Bucket 1 rules at 99-100%. Pre-execution hooks block progression until conditions are met. Post-execution gates verify outputs before handoff. Format linters check structural rules. Warm-start injection provides reminders at agent startup (~90%). This layer is the foundation — everything that can be verified programmatically.

Layer 2: Structured Discussion at Checkpoints

Handles Bucket 2 rules. This is the new layer that produces the +12pp system-wide gain.

Two-Phase Checkpoint Protocol

Every analysis step passes through a two-phase review:

  1. Quick Check (BLOCKING, 3-5 minutes): The reviewing agent evaluates the step output against a structured agenda before the pipeline advances. The working agent must respond to each finding. Verdict: PASS, REVISE, or BLOCK.
  2. Deep Review (PARALLEL with next step): Extended review runs in the background while the next step begins. Findings feed into cross-step synthesis. This prevents the review from becoming a pipeline bottleneck while still catching issues.

5-Item Reviewer Agenda

The Quick Check uses a structured agenda, not open-ended review:

Agenda Item Failure Mode It Catches Example
1. Methodology Wrong analytical approach Using ratio metrics when absolute counts are needed
2. Population Wrong denominator or filter Eligible users vs. daily active users conflation
3. Data Quality Stale data, JOIN issues Querying a table whose partitions expired days ago
4. Evidence Labels Unsupported claims “Users clearly prefer X” without statistical test
5. Prescribed Method Skipped required steps No retention check before query execution

Why structured agenda > open-ended review: MATEval showed that removing structured feedback drops correlation from 0.391 to 0.259. The agenda ensures every review covers the known failure modes.

What Discussion Actually Catches: Beyond Compliance to Correctness

The compliance metrics capture whether discussion happened and known rules were followed. But the highest-value contribution is catching errors that no rule anticipated:

Error Type Example Impact If Missed Detectable by Linter?
Wrong denominator Registered users instead of active users 3-5x silent metric suppression No — both are valid queries
Simpson’s paradox Metric rises in every sub-group but falls in aggregate Contradictory recommendation No — both numbers are correct
Stale data window Expired partitions, outdated numbers All findings reflect wrong period Partially
Superficial cross-validation “Validated against Source B” without discrepancy analysis False confidence No — keyword present
Inappropriate filter Using experiment eligibility as a population filter Biased denominator No — valid SQL

Only another agent asking “what was the discrepancy between sources, and how did you reconcile it?” forces substantive answers.

Layer 3: Routing Architecture — Harness Enabling Discussion

Discussion only works when it actually happens. Layer 3 ensures it does. This is the key architectural insight: a harness mechanism (Layer 1) that enables discussion (Layer 2). The two approaches are symbiotic.

  • Persistent core agents: 8 agents that persist across the pipeline, accumulate context, and follow checkpoint protocols. When the reviewer evaluates Step 3, it has context from Steps 1-2.
  • Routing guard hook: Intercepts every agent spawn, checks if the task maps to a core agent role, and blocks spawn if so — routing through the persistent agent instead.
  • Sub-agent scope: Sub-agents handle parallel, independent work (schema lookups, partition checks). Core agents dispatch sub-agents and incorporate results into the discussion flow.

Layer 4: Proactive Monitoring (Watchdog)

Discussion introduces checkpoint liveness as a concern. If Agent A sends a message to Agent B and Agent B has stalled, the pipeline hangs indefinitely. No hook fires because no rule was violated.

The watchdog agent addresses this:

  1. Self-polls every 5 minutes
  2. Checks each agent’s last activity timestamp
  3. Detects stalls: >10 minutes silence during active pipeline
  4. Alerts the orchestrator with diagnosis and recovery recommendation

Composition: Compliance at Each Layer

Table 4: System-Wide Compliance by Architecture Version

Version Layers Active System-Wide Avg Top Improvement
V2 Harness only ~65%
V4 Harness + Discussion ~77% Challenge Calibration +66pp
V5 (est.) All four layers ~90%+ Routing compliance +35pp
flowchart LR
    V2["V2: ~65%<br/>Harness only"] --> V4["V4: ~77%<br/>+ Discussion<br/>(+12pp)"]
    V4 --> V5["V5: ~90%+<br/>+ Routing + Monitoring<br/>(+13pp)"]

    style V2 fill:#ffcdd2,stroke:#C62828
    style V4 fill:#fff9c4,stroke:#F9A825
    style V5 fill:#c8e6c9,stroke:#2E7D32

Table 5: Per-Rule Compliance Across All Versions

Rule Code V2 V4 V5 (est.) Primary Layer
Step 0 Premise A1 100% 100% 100% L1: Hook
Evidence Classification A3 100% 96% 96% L1: Warm-rule
Metric Definitions A11 100% 100% 100% L1: Gate
Framework Coverage A12 100% 100% 100% L1: Gate
No Data Fabrication E6 100% 100% 100% L1: Hook
Data Sensitivity Flagging D6 100% 100% 100% L1: Hook
Summary Narrative First W2 83% 100% 100% L1: Linter + L2: Writer-Reviewer
No Bold Table Data W5 71% 78% ~80% L1: Linter (needs BLOCK)
Bullet Hierarchy <=7 L0 W3 79% 43% ~65% L1: Linter + L3: Routing
Header Spacing W6 58% 70% ~75% L1: Linter (needs BLOCK)
Challenge Calibration B2 21% 87% ~92% L2: Discussion
Cross-Source Validation B4 33% 96% ~98% L2: Discussion
Retention Check D2 75% 100% 100% L2: Discussion
Structural-Unblock A6 83% 100% 100% L1: Warm-rule + L2: Discussion
Reviewer verdict (DISC-1) DISC-1 0% 100% 100% L2: Discussion
7-item agenda (DISC-2) DISC-2 0% 35% ~70% L2: Discussion + L3: Routing
Recovery log NEW 87% ~90% L2: Discussion
Pipeline completion 100% 96% ~98% L1: Gate + L4: Watchdog
Document deliverable N/M N/M ~90% L3: Routing

V5 numbers are simulated estimates based on root cause analysis of V4 failures, not measured results.

DISC-2 at 35% in V4 illustrates the layer interaction: the persistent reviewer knows the full 5-item agenda, but without routing enforcement, review sometimes went through less-context-aware paths that checked only 2-3 items. The routing guard (L3) ensures the persistent reviewer — with full agenda context — always handles review.


5. Industry Landscape

Table 5: Agent Discussion Mechanisms Across Frameworks

Framework Discussion Mechanism Structured Agenda Routing Control Monitoring
Our system 2-phase checkpoint, 5-item agenda Yes Routing guard hook Watchdog (5-min poll)
Anthropic patterns Evaluator-Optimizer loop Partial Manual No
CrewAI 2 primitives (Delegate, Ask) No allow_delegation=False No
22-framework eval (Rasheed et al. 2026) Task + Verification Agent No Per-framework No

The Gap No Framework Fills

Three capabilities exist independently. No framework combines all three.

  1. Harness enforcement (hooks, gates, linters): Present in most frameworks. Well-understood.
  2. Structured multi-agent discussion: Extensively studied academically. Rarely implemented in production. CrewAI’s two primitives support handoff, not debate. Anthropic’s evaluator-optimizer covers one interaction mode, not a multi-item checkpoint protocol.
  3. Routing architecture ensuring discussion happens: No framework includes a routing guard. Most assume agents will follow designed workflows. Our data shows they don’t unless the harness enforces it.

Why Frameworks Plateau

Rasheed et al. (2026) evaluated 22 frameworks: 12 clustered within 1.4pp (74.57-75.94%). Architectural category did not differentiate. Failures were orchestration-driven: one failed after 11 days from context growth; another consumed $1,434/day from retry loops; a third exhausted API quotas through interactions that increased prompt length without improving answers. Routing guards and watchdogs are orchestration mechanisms — they ensure existing capabilities are reliably activated.

Connection to Academic Research

Our checkpoint protocol implements structured debate with sparse topology: only relevant agents discuss (Li et al. 2024: sparse D=2/5 outperforms fully-connected), discussion activates only for Bucket 2 rules (Eo et al. 2025: adaptive gating reduced calls from 9 to 1.4), and every review follows a structured agenda (MATEval: removing feedback drops correlation from 0.391 to 0.259).

The closest human-systems analogue is the WHO Surgical Safety Checklist (Haynes et al., NEJM 2009): a structured protocol where independent verification by a second party reduced death rates by 47% and complications by 36%. Our checkpoint protocol is the agent-system equivalent: a structured moment where a second agent independently verifies the first’s work before the pipeline advances.


6. Principles for Designing Agent Discussion

Principle 1: Classify every rule into two buckets first

If you can write a programmatic check that verifies compliance with >95% accuracy, it’s Bucket 1. Otherwise, it’s Bucket 2. Misclassification wastes effort both ways: discussion on Bucket 1 rules adds latency without improving compliance; hooks on Bucket 2 rules produce false confidence while substantive violations pass through.

Principle 2: Never use unconditional discussion

Mandate debate on every task: -26pp vs. single-agent (Eo et al. 2025). Discussion activates at two points: after each analysis step (reviewer Quick Check) and during final formatting (writer-reviewer conversation). It does not activate for data discovery, query execution, or schema lookups.

Principle 3: Structure every discussion with trigger, agenda, and verdict

Three components: a trigger (event that activates review), an agenda (fixed list of items), and a verdict (PASS/REVISE/BLOCK). Without all three, discussion degrades. An agent told to “review this work” without an agenda produces generic approval. An agent with an agenda but no verdict flags issues without forcing resolution.

Principle 4: Persistent agents for discussion, disposable agents for parallel work

Persistent agents accumulate context and produce calibrated reviews. Disposable sub-agents are stateless. Discussion happens between persistent agents. Sub-agents handle independent parallel work.

Principle 5: Gate discussion at the routing layer, not the agent layer

Relying on agents to self-enforce discussion produces 50-70% compliance. The routing guard enforces discussion at the harness level. The agent cannot choose to skip because the harness structurally prevents the bypass path.

Principle 6: Monitor discussion liveness

Discussion introduces silent stalls: Agent A messages Agent B, Agent B stalls, no hook fires, the pipeline hangs. A watchdog that polls agent activity is essential. Without this, a single stalled checkpoint blocks the pipeline indefinitely.

Principle 7: Discussion catches what linters can’t

The highest-value items are methodology issues no programmatic check detects: wrong denominators (3-5x silent bias), Simpson’s paradox, inappropriate population filters, superficial cross-source claims. A linter checking “did the analyst mention cross-source validation” shows 100% while the validation is superficial. Only another agent asking “what was the discrepancy, and how did you reconcile it?” verifies substantive compliance.


7. Conclusion

The Best Practices To Keep AI Agents on Track established that enforcement mechanism matters more than rule content. This post adds a second finding: some rules can’t be enforced by any programmatic mechanism — they require judgment.

The complete compliance stack:

  1. Hooks and gates for programmatic rules: 99-100%. The foundation.
  2. Structured discussion at checkpoints for judgment rules: 85-100%. The quality layer.
  3. Routing architecture to ensure discussion happens: prevents bypass.
  4. Proactive monitoring to catch liveness failures: prevents silent stalls.

Each layer addresses a failure mode the others cannot. Hooks can’t evaluate methodology quality. Discussion can’t catch missing files. Routing can’t detect stalled agents. Monitoring can’t verify analytical reasoning. The layers compose.

flowchart LR
    A["Hooks<br/>catch missing files"] ~~~ B["Discussion<br/>catches wrong methodology"]
    B ~~~ C["Routing<br/>ensures discussion happens"]
    C ~~~ D["Monitoring<br/>catches stalled agents"]

    style A fill:#e3f2fd,stroke:#1565C0
    style B fill:#fff3e0,stroke:#E65100
    style C fill:#f3e5f5,stroke:#7B1FA2
    style D fill:#e8f5e9,stroke:#2E7D32

The remaining gap (~90% to 100%) is dominated by format rules where the fix is straightforward harness tuning. The judgment rules that motivated this investigation — challenge calibration, cross-source validation, methodology review — are at 87-100% with the full stack.

But compliance is only the measurable proxy. The deeper value of agent discussion is output correctness — catching wrong denominators, Simpson’s paradox, stale data windows, and superficial cross-validations that silently corrupt conclusions. Mercier and Sperber’s asymmetry — biased in production, objective in evaluation — is not just a theoretical parallel. It is the mechanism by which the system produces reliable results.


Appendix

A. The Checkpoint Discussion Protocol

Quick Check Phase (BLOCKING)

TRIGGER:  Analysis step completes, working agent signals ready-for-review
PARTICIPANTS:  Working agent (defender), Reviewing agent (challenger)
TIME BUDGET:  3-5 minutes
PROTOCOL:
  1. Reviewer receives step output + accumulated pipeline context
  2. Reviewer evaluates against 5-item agenda:
     [1] Methodology: Is the analytical approach appropriate?
     [2] Population: Is the denominator correct?
     [3] Data Quality: Partitions fresh? JOIN risks? Missing dimensions?
     [4] Evidence Labels: Claims labeled PROVEN/SUPPORTED/SPECULATIVE?
     [5] Prescribed Method: All required steps followed?
  3. Reviewer issues findings with specific references
  4. Working agent responds: accept + fix, or defend with evidence
  5. Reviewer issues verdict: PASS / REVISE / BLOCK

Deep Review Phase (PARALLEL)

TRIGGER:  Quick Check issues PASS verdict
RUNS:     In background, parallel with next step
COVERS:   Cross-step consistency, cross-source validation depth,
          methodology alternatives
OUTPUT:   Feeds into cross-step synthesis at pipeline end

B. Full Per-Rule Compliance Table

Table A1: Compliance Across All Measured Versions

Rule Code Category V2 (Harness) V4 (+ Discussion) V5 (+ Routing + Watchdog, est.) Enforcement
Step 0 Premise A1 Methodology 100% 100% 100% Hook/gate
Evidence Classification A3 Methodology 100% 96% 96% Warm-rule
Structural-Unblock A6 Methodology 83% 100% 100% Warm-rule + Discussion
Gap Classification A9 Methodology 4% Standalone (debt)
Knowledge Reuse A10 Methodology 13% Standalone (debt)
Metric Definitions A11 Methodology 100% 100% 100% Gate
Framework Coverage A12 Methodology 100% 100% 100% Gate
Challenge Calibration B2 Review 21% 87% ~92% Discussion
Cross-Source Validation B4 Review 33% 96% ~98% Discussion
Retention Check D2 Data 75% 100% 100% Discussion
Data Sensitivity Flagging D6 Data 100% 100% 100% Hook
No Data Fabrication E6 Data 100% 100% 100% Hook
Summary Narrative First W2 Format 83% 100% 100% Linter + Discussion
Bullet Hierarchy <=7 L0 W3 Format 79% 43% ~65% Linter (needs BLOCK)
No Bold Table Data W5 Format 71% 78% ~80% Linter (needs BLOCK)
Header Spacing W6 Format 58% 70% ~75% Linter (needs BLOCK)
Reviewer verdict (DISC-1) DISC-1 Discussion 0% 100% 100% Discussion
7-item agenda (DISC-2) DISC-2 Discussion 0% 35% ~70% Discussion + Routing
Recovery log NEW Execution 87% ~90% Discussion
Pipeline completion Liveness 100% 96% ~98% Gate + Watchdog
Document deliverable Output N/M N/M ~90% Routing

V5 estimates are based on root cause analysis of V4 failure modes, not measured production results.

C. Testing Methodology

Table D1: Evaluation Setup

Parameter Value
Total analysis runs evaluated V2: 24 runs, V4: 23 runs
Rule categories 4 (Methodology, Review, Data, Format)
Rules tracked 16 base + 3 discussion-specific + 2 system metrics
Evaluation method Manual review of pipeline logs, agent outputs, and final deliverables
V5 estimation method Root cause analysis of V4 compliance gaps, projected improvement rates
Model Frontier LLM (model version varied across evaluation period)
Pipeline type Multi-step analytical investigation (3-7 steps per run)
Discussion protocol V2: None. V4: 2-phase checkpoint with 5-item agenda
Routing V2-V4: Agent-level routing. V5: Harness-level routing guard
Monitoring V2-V4: None. V5: Watchdog (5-min self-poll)

D. References

  1. Du, Y., et al. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325.
  2. Li, Y., et al. (2024). Improving Multi-Agent Debate with Sparse Communication Topology. arXiv:2406.11776.
  3. Eo, S., et al. (2025). Debate ON Demand: Adaptive Activation of Multi-Agent Debate. arXiv:2504.05047.
  4. Li, Y., et al. (2024). MATEval: A Multi-Agent Discussion Framework for Open-Ended Text Evaluation. arXiv:2403.19305.
  5. Rasheed, B., et al. (2026). Evaluating Multi-Agent Frameworks: A Comprehensive Study of 22 Agentic Systems. arXiv:2604.16646.
  6. Anthropic. (2024). Building Effective Agents. anthropic.com/engineering/building-effective-agents.
  7. Mercier, H. & Sperber, D. (2011). Why Do Humans Reason? Behavioral and Brain Sciences, 34(2), 57-74.
  8. Bacchelli, A. & Bird, C. (2013). Expectations, Outcomes, and Challenges of Modern Code Review. ICSE 2013.
  9. Haynes, A.B., et al. (2009). A Surgical Safety Checklist to Reduce Morbidity and Mortality. NEJM, 360(5), 491-499.
  10. Shinn, N., et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023.

This is the third post in a series. Future posts will cover lessons learned and performance metrics.