RZ AI Learning

Best Practices To Keep AI Agents on Track: Rule Compliance Through Harness Engineering

This is the second post in a series about building a multi-agent system for complex analytics. The first post covers the architecture and agent lifecycle. This one covers how to make agents actually follow their rules.


Summary

When you write a new rule for an AI agent:

  1. Can it be checked programmatically? Make it an automated hook (a script that blocks violations). 99% reliable. Everything else is less.
  2. Is it for a persistent agent (one that lives for hours)? Add a 1-line reminder to the agent’s warm-rules file AND the Warm Refresh section at the top of the spec. The system injects the reminder on every new task automatically.
  3. Where does it go in the spec? At the exact workflow step where the agent needs it. Never in a standalone “Rules” section at the bottom.
  4. Does it have a trigger, action, and verification? “Before X, do Y, verify Z.” If it’s an abstract principle (“be thorough”), rewrite it as a concrete checkpoint.

When a rule keeps getting violated:

  • 1st violation → add rule text to the spec at the decision point
  • 2nd violation → add to warm-rules file (injected on every task) + Warm Refresh section
  • 3rd violation → make it a hook (automated enforcement, 99% reliable)

When structuring an agent’s instruction file:

  • Keep all procedures inline (don’t point to external files — agents skip them ~15–20% of the time)
  • Put the 10 most-violated rules in a “Warm Refresh” section at the top (first 50 lines)
  • For files over 800 lines: ensure critical rules are at natural workflow positions, not buried
  • Only split into separate files for multi-phase orchestrators (not single-purpose agents)

Core principles:

  • Automate enforcement for every critical rule
  • Protect persistent agents with warm-start mechanisms
  • Every rule needs trigger + action + verification
  • Embed rules at decision points, not in “Rules” sections
  • Respect the attention budget
  • Keep procedures inline — never use pointers

1. Introduction

  • This paper presents measured compliance data from operating a multi-agent LLM analytics system (9 core agents) across ~90 analysis projects
  • The key finding: how you STRUCTURE rules matters more than what the rules SAY — the same rule achieves 50% or 99% compliance depending on enforcement mechanism
  • Why this matters: existing multi-agent frameworks write agent instructions once at spawn and assume they stay effective; no framework manages instruction decay over time [Masterman et al., 2024]
  • Instruction-following benchmarks confirm even frontier models fail 17–29% of the time [Zhou et al., 2023]; AGENTIF [Tsinghua, 2025] found the best model perfectly follows fewer than 30% of complex agentic instructions

2. The Problems and Observations

Problem: Agents skip rules even when reinforced

We wrote detailed instruction files (433–1,939 lines per agent). When agents violated rules, we added the same rule in multiple places. It didn’t help.

What went wrong Where the rule was written Compliance
Agents skipped the quality review pipeline Line 866 in a 1,939-line file 28%
Agents created files with banned content patterns Line 1,864 (near bottom) 60% before hooks, 99% after
Agents didn’t trigger self-improvement Line 1,634 (middle) 80%
Agents skipped formatting checks Line 1,258 (middle) 50%
Agents followed “state your name” rule Line 36 (top) + automated reminder 99%

Table 1. Compliance data by rule and enforcement. The problem is not agent capability — it’s how instruction files are structured.

Observation 1: Long instruction files have a dead zone

Compliance by position in a ~2,000-line instruction file:

flowchart LR
    A["First 200 lines<br/>~95%"] --> B["Lines 200–500<br/>~85%"]
    B --> C["Lines 500–1,500<br/>~60%"]
    C --> D["Last 400 lines<br/>~75%"]

    style A fill:#c8e6c9,stroke:#2E7D32
    style B fill:#fff9c4,stroke:#F9A825
    style C fill:#ffcdd2,stroke:#C62828
    style D fill:#fff9c4,stroke:#F9A825
  • First 200 lines: ~95% compliance
  • Lines 200–500: ~85%
  • Lines 500–1,500: ~60% (the dead zone — most operational rules sit here)
  • Last 400 lines: ~75%

This U-shaped curve matches “Lost in the Middle” research [Liu et al., 2024]: accuracy dropped from 75.8% at position 0 to 53.8% at mid-positions. Yin et al. [2024] identified the architectural root cause: RoPE introduces a decay effect favoring recent and initial positions; causal masking means earlier tokens accumulate more attention — the bias is structural, not a training artifact.

Observation 2: Time degrades compliance for persistent agents

Lifetime Stage Dead-Zone Rule Compliance What’s Happening
At spawn (task 1) ~100% Agent just read full spec; all rules active
After 1–2 tasks ~85% Instructions pushed back by task results
After 3–5 tasks ~60% Spec thousands of tokens back; dead-zone rules forgotten
After context compression ~40% Spec reduced to summary; specific procedures GONE

Table 2. Compliance degradation over agent lifetime. Single-shot agents don’t have this problem.

The “attention sinks” phenomenon [Xiao et al., 2024] explains this: transformers allocate disproportionate attention to initial and recent tokens, with middle content receiving the least. As agents process more tasks, original instructions move into the “middle” of accumulated context — outside both the attention sink and the recent-token window.

flowchart TD
    T1["Task 1: ~100%"] --> T2["After 1-2 tasks: ~85%"]
    T2 --> T3["After 3-5 tasks: ~60%"]
    T3 --> T4["After compression: ~40%"]

    style T1 fill:#c8e6c9,stroke:#2E7D32
    style T2 fill:#fff9c4,stroke:#F9A825
    style T3 fill:#ffcdd2,stroke:#C62828
    style T4 fill:#b71c1c,stroke:#b71c1c,color:#fff

Compliance decay curve for persistent agents without intervention.

Observation 3: Copy-pasting rules doesn’t help — unless contextual

We had ~125 lines of identical boilerplate copied into each of our 9 agent files (1,125 lines total). When automated scripts already enforced those rules, the duplicated text added no compliance benefit.

However: stating a rule at the exact moment the agent needs it works much better than a separate “Rules” section.

Observation 4: Agents don’t follow pointers to external files

When we split long instruction files and used pointers (“read procedure X from file Y”), agents skipped the Read call ~15–20% of the time. Even purpose-trained tool-using models skip external calls when confident [Schick et al., 2023].

Observation 5: Context compression is the silent killer

After hours of work, the system compresses older context; the agent’s instruction file gets reduced to a brief summary. Specific procedures, field names, exact templates are GONE — the agent improvises where it used to follow exact procedures.

The only mechanisms that survive compression: hooks (fire fresh every time) and explicit re-reading of the spec after compression.


3. Root Cause: Why Some Rules Stick and Others Don’t

The four-condition framework

For a rule to be reliably followed, all four conditions must be true:

  1. In context — the agent still has the rule in working memory
  2. Triggered — the agent recognizes “this rule applies right now”
  3. Actionable — the agent knows exactly what to do (a specific command, not a vague principle)
  4. Verified — the agent or the system can confirm the rule was followed
Rule Compliance Conditions Met
“Start every message with your name” + automated reminder ~99% All 4
“Never include banned content” + automated blocker ~99% All 4
“Check for updates before starting work” + automated reminder ~95% 3 of 4
“Run a 5-point quality checklist after every task” ~85% 2 of 4
“When fixing an issue, change only what was flagged” ~60% 1 of 4
“Persist until high confidence on every finding” ~50% 0 of 4

Table 3. Compliance by number of conditions met. Trigger + specific action + verification = compliance. Abstract principle + no trigger = forgotten.

Connection to established research

  • Implementation intentions [Gollwitzer and Sheeran, 2006]: meta-analysis of 94 studies found that “when X arises, I will do Y” plans outperform abstract goals (effect size d=0.65). Same mechanism applies to LLM agents: specific triggers create reliable responses; abstract principles degrade under load.
  • AgentSpec [Wang et al., ICSE 2026]: independently arrived at the same structure — rules as (trigger, predicate, enforcement) 3-tuples achieving >90–100% compliance. Strongest external validation of our framework.
  • WHO Surgical Safety Checklist [Haynes et al., 2009]: reduced surgical mortality by 47%. Key principle: checklists must be short (5–9 items), tied to workflow pause points, and revised based on observed failures.

Enforcement tier data

Enforcement Tier Mechanism Compliance
Automated hook (system blocks violations) Script fires before/after agent action 99%+
Script gate (orchestrator checks at transitions) Check script at pipeline boundaries 95%+
Warm-start injection (rules injected every task) Hook delivers 10–15 rules with each task ~90%
Workflow-embedded text (rule at decision point) Rule stated where agent makes the decision 85–90%
Standalone text (rule in separate section) “General Principles” at line 1,600 50–70%

Table 4. Compliance by enforcement tier. The gradient from 50% to 99% demonstrates that enforcement mechanism is the primary determinant.

flowchart LR
    T1["Standalone text<br/>50–70%"] --> T2["Workflow-embedded<br/>85–90%"]
    T2 --> T3["Warm-start injection<br/>~90%"]
    T3 --> T4["Script gate<br/>95%+"]
    T4 --> T5["Automated hook<br/>99%+"]

    style T1 fill:#ffcdd2,stroke:#C62828
    style T2 fill:#fff9c4,stroke:#F9A825
    style T3 fill:#fff9c4,stroke:#F9A825
    style T4 fill:#c8e6c9,stroke:#2E7D32
    style T5 fill:#c8e6c9,stroke:#2E7D32

The enforcement ladder: same rule, different compliance depending on mechanism.


4. Solution: Three-Layer Defense

Layer 1: Automated enforcement (hooks) — 99%

Scripts that fire automatically before/after every agent action at the system level. The agent doesn’t need to remember anything — the system blocks violations before they reach the user.

In our system, 14 hooks operate at three points:

  • PreToolUse — blocks actions before they happen
  • PostToolUse — verifies output after actions complete
  • Stop — blocks task completion if required steps are missing

Conceptually identical to NeMo Guardrails [Rebedea et al., 2023] and poka-yoke (“mistake-proofing”) from lean manufacturing. OpenAI’s Agents SDK [2025] independently converged on the same pattern: guardrails with a “tripwire” mechanism that halts on violation.

Layer 2: Warm-start rule injection — ~90%

Two complementary mechanisms:

  • Hook-injected reminders: A script automatically injects each agent’s 10–15 most critical rules alongside every new task. Rules appear in the freshest context position. Fires automatically; survives context compression.
  • Warm Refresh section: Top 10 most-violated rules placed in the first 50 lines of the spec, positioning them in the attention sink zone [Xiao et al., 2024].

The “context engineering” paradigm [Karpathy, 2025; Anthropic, 2026] validates this: a focused 300-token context often outperforms an unfocused 113K-token context.

Layer 3: Periodic full refresh — 100% reset

After every 8 task exchanges, the orchestrator respawns the agent fresh — all rules return to 100%.

Critical addition: post-compression re-read — when the system detects context compression, the agent re-reads its full spec immediately. Without post-compression re-read, agents ran for 5+ hours on compressed summaries at ~40% compliance.

File-based coordination via a shared vault enables this: all project state is in durable files, so the new instance recovers context without message history.

Composition: How the layers work together

Agent State Without Defense With All 3 Layers
Cold start (task 1) 100% 100%
After 3 tasks ~75% ~90%
After 5 tasks ~55% ~85%
After context compression ~40% ~85%
After 8 tasks (full refresh) ~40% 100%

Table 5. Compliance over agent lifetime. The layers have diverse failure modes (Swiss Cheese Model [Reason, 1990]): hooks can’t catch non-programmatic violations, warm-start injection degrades over many tasks, periodic refresh has cold-start cost. Together, they cover each other’s gaps.

flowchart TD
    subgraph Without["Without Defense"]
        W1["Task 1: 100%"] --> W2["Task 3: ~75%"]
        W2 --> W3["Task 5: ~55%"]
        W3 --> W4["Compression: ~40%"]
        W4 --> W5["Task 8: ~40%"]
    end

    subgraph With["With Three-Layer Defense"]
        D1["Task 1: 100%"] --> D2["Task 3: ~90%"]
        D2 --> D3["Task 5: ~85%"]
        D3 --> D4["Compression: ~85%"]
        D4 --> D5["Task 8: 100% ↻"]
    end

    style W4 fill:#ffcdd2,stroke:#C62828
    style W5 fill:#ffcdd2,stroke:#C62828
    style D1 fill:#c8e6c9,stroke:#2E7D32
    style D5 fill:#c8e6c9,stroke:#2E7D32

Side-by-side compliance over agent lifetime.

Reflect-Before-Fix protocol

When an agent’s output is flagged, the agent writes a 1–2 sentence reflection before fixing. Adopted from Reflexion [Shinn et al., 2023]: agents with verbal self-reflections improved ALFWorld success from 80% to 97%. Provides agent-level within-project learning, complementing system-level self-improvement.


5. Current Industry Landscape

Category Key Systems Gap
Multi-agent frameworks CrewAI, AutoGen/AG2, MetaGPT [ICLR 2025], LangGraph, OpenAI Agents SDK, Pydantic AI, Google ADK, Claude Agent SDK, Mastra None manage instruction decay over time
Agent guardrails NeMo Guardrails (Colang 2.0), Guardrails AI, LlamaGuard, AgentSpec [ICSE 2026] Most are binary pass/fail for safety — not procedural compliance
Self-improvement Reflexion, MAR, CLIN, Self-Refine, Voyager Improve agent behavior, not enforcement architecture
Memory management Letta/MemGPT (virtual context, git-backed memory) Focus on data paging, not instruction freshness
Protocols MCP (tools), A2A (agents) Standardize communication, not compliance
Security OWASP Top 10 for Agentic Applications [Dec 2025] Risk taxonomy, not enforcement framework

Table 6. Industry landscape summary. No existing system combines instruction decay management, graduated enforcement, and structural self-improvement.

Capability Industry State Our Approach
Instruction decay Not addressed; CrewAI closest (role re-injection) Three-layer defense: hooks + warm-start + periodic refresh
Enforcement Binary pass/fail (NeMo, LlamaGuard) 5-tier graduated ladder with escalation (text → warm-rules → hook)
Self-improvement Agent learns (Reflexion, CLIN) Agent learns AND system promotes text rules to hooks
Validation Symmetric debate or single reviewer Specialized roles: challenger + arbiter + independent cross-validator
Coordination Message passing or shared memory File-based vault: inspectable, diffable, survives agent death
Rule format Ad-hoc natural language Trigger + action + verification (validated by AgentSpec [ICSE 2026])

Table 7. Comparison with industry. The “context engineering” paradigm [Karpathy, 2025] independently validates the core techniques.

Key comparisons:

  • AgentSpec [ICSE 2026] validates the trigger-action-verification framework with structurally identical (trigger, predicate, enforcement) 3-tuples. Their DSL is more maintainable at scale; our natural language + bash hooks are more flexible.
  • “Found in the Middle” [Yin et al., 2024] proposes model-level fixes (RoPE calibration). Our approach is application-level. Both are valid at different layers and complementary.
  • MCP + A2A protocols formalize communication over HTTP/JSON-RPC. Our vault-file approach is lower-tech but files are inspectable, diffable, and survive agent death without coordination infrastructure.

6. Principles for Writing Agent Rules

Ordered by impact (most important first).

Principle 1: Automate enforcement for every critical rule

If a rule has been violated twice and can be checked programmatically, make it a hook. The enforcement ladder:

hooks (99%) > script gates (95%) > warm-start injection (90%) > workflow-embedded text (85%) > standalone section (50–70%)

Mirrors the NIOSH Hierarchy of Controls: engineering controls > administrative controls > personal vigilance.

Principle 2: Protect persistent agents with warm-start mechanisms

Warm Refresh section at the top (~50 lines, 10 most-violated rules) + hook injection on every task + full refresh every N tasks + post-compression re-read.

This raised compliance from ~60% to ~90% — the single most impactful structural change.

Principle 3: Every rule needs trigger + action + verification

  • Trigger: “before sending results to user”
  • Action: “run the content-check script on file X”
  • Verification: “script exit code = 0”
  • Consequence: “block the send, fix, retry”

If you can’t define a trigger, the rule is too abstract to enforce.

Principle 4: Embed rules at decision points, not in “Rules” sections

A standalone “Rules” section at the bottom of an instruction file is where rules go to die. State each rule at the exact workflow step where the agent needs it.

Principle 5: Respect the attention budget

File Length Recommendation
Under 400 lines Agents reliably follow the full file
400–800 lines Critical rules at natural workflow positions
800–1,200 lines Add Warm Refresh section + hook injection
Over 1,200 lines Split into core + modules (only for multi-phase orchestrators)

Principle 6: Only split files for multi-phase orchestrators

We split our 1,939-line orchestrator into a 433-line core + 5 phase files. We did NOT split single-purpose agents (1,274 and 1,198 lines) — single file + Warm Refresh + hooks is more reliable. Splitting adds ~15–20% Read-call skip risk.

Principle 7: Keep procedures inline — never use pointers

When we extracted procedures to shared files with pointers, agents skipped reading them ~15–20% of the time. Keep full procedures inline; shared files exist as human reference only.


7. Conclusion

  • Enforcement mechanism matters more than rule content — the same rule achieves 50% or 99% compliance depending on whether it’s text or a hook
  • Persistent agents need active instruction maintenance — without intervention, compliance degrades from 100% to ~40%; the three-layer defense raises this to ~90%
  • Rule structure follows implementation intention principles — trigger + action + verification = 95–99% compliance; abstract principles = 50–60%
  • The instruction file teaches agents WHAT to do. Hooks and injection ensure they DO.

Appendix

A. The Enforcement Ladder

Tier Mechanism Compliance How It Works Example
5 Automated hook 99%+ Script fires before/after agent action; blocks violations at system level Content restriction: banned patterns blocked automatically
4 Script gate 95%+ Orchestrator runs check script at pipeline transitions Pipeline completion verified before publishing
3 Warm-start injection ~90% Hook delivers 10–15 critical rules with every new task Most-violated rules appear in freshest context each task
2 Workflow-embedded text 85–90% Rule stated at the exact step where it applies Decision routing procedures at the routing step
1 Standalone text 50–70% Rule stated once in a separate section, no enforcement “General Principles” section — promote everything out of this tier

Table A1. Full enforcement ladder with tiers, compliance rates, mechanisms, and examples.

B. Testing Results

Scenario Before Redesign After Redesign
Agent at cold start 100% (rules fresh) 100% (same)
Agent at task 5 needs dead-zone rule ~60% (forgotten) ~90% (hook injects rule fresh)
Context compresses after 3 hours ~40% (rules in summary) ~85% (hook fires fresh + post-compression re-read)
Cross-phase rule needed in wrong phase Often missed Works (cross-phase rules in always-loaded core)
Agent needs edge-case procedure Often improvised Works (full procedure in spec; warm-rules remind)
Quality review pipeline before publishing 28% completion ~95% (hook blocks publishing until reviews done)
Orchestrator spec too thin (272 lines) N/A Failed — restored to 433 lines; core too thin
Pointer to shared protocol file N/A Failed — agents skip Read calls ~15–20%; reverted to inline
Agent receives 10+ tasks without refresh ~40% (severe decay) ~85% (warm-start + 8-task refresh cycle)
Adversarial validation after synthesis Occasionally skipped ~95% (hook blocks publishing without validation)

Table B1. Testing results across 10 critical scenarios, comparing compliance before and after the three-layer defense.

C. References

Positional Bias & Context:

  • [Liu et al., 2024] “Lost in the Middle: How Language Models Use Long Contexts.” TACL.
  • [Yin et al., 2024] “Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization.” ACL 2024.
  • [Hsieh et al., 2024] “RULER: What’s the Real Context Size of Your Long-Context Language Models?”
  • [Levy et al., 2024] “Same Task, More Tokens: The Impact of Input Length on the Reasoning Performance of LLMs.” ACL 2024.
  • [Xiao et al., 2024] “Efficient Streaming Language Models with Attention Sinks.” ICLR 2024.
  • [Paulsen, 2025] “Maximum Effective Context Window (MECW).”
  • [Jiang et al., 2023] “LLMLingua: Compressing Prompts for Accelerated Inference.” EMNLP 2023.

Instruction Following:

  • [Zhou et al., 2023] “Instruction-Following Evaluation (IFEval).”
  • [Mu et al., 2024] “Rule-Based Rewards for LLM Instruction Following (RuLES).”
  • [IFEval++, 2025] “Instruction Following Evaluation with Rephrasing.”
  • [AGENTIF, 2025] “Agentic Instruction Following Benchmark.” Tsinghua University.
  • [MultiChallenge, ACL 2025] “Instruction Retention in Multi-Turn Conversations.”

Cognitive Science & Human Factors:

  • [Gollwitzer & Sheeran, 2006] “Implementation Intentions and Goal Achievement: A Meta-Analysis.” Advances in Experimental Social Psychology.
  • [Haynes et al., 2009] “A Surgical Safety Checklist to Reduce Morbidity and Mortality in a Global Population.” NEJM 360(5):491-499.
  • [Gawande, 2009] “The Checklist Manifesto.” Metropolitan Books.
  • [Degani & Wiener, 1993] “Cockpit Checklists: Concepts, Design, and Use.” Human Factors.
  • [Sweller, 1988] “Cognitive Load During Problem Solving.” Cognitive Science.
  • [Reason, 1990] “Human Error.” Cambridge University Press.

Multi-Agent Systems:

  • [Hong et al., 2023] “MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework.” ICLR 2025 Oral.
  • [Du et al., 2023] “Improving Factuality and Reasoning in Language Models through Multiagent Debate.”
  • [Shinn et al., 2023] “Reflexion: Language Agents with Verbal Reinforcement Learning.” NeurIPS 2023.
  • [Wang et al., 2023] “Voyager: An Open-Ended Embodied Agent with Large Language Models.”
  • [Packer et al., 2023] “MemGPT: Towards LLMs as Operating Systems.” (Now Letta.)
  • [Madaan et al., 2023] “Self-Refine: Iterative Refinement with Self-Feedback.” NeurIPS 2023.
  • [MAR, 2025] “Multi-Agent Reflexion.” Dec 2025.
  • [CLIN, 2025] “Continual Learning for Language-Based Agents.”
  • [Masterman et al., 2024] “The Landscape of Emerging AI Agent Architectures.” Survey.

Agent Safety & Guardrails:

  • [Rebedea et al., 2023] “NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications.” (Colang 2.0 Beta, 2025.)
  • [Inan et al., 2023] “LlamaGuard: LLM-Based Input-Output Safeguard for Human-AI Conversations.”
  • [Wang et al., 2026] “AgentSpec: Runtime Enforcement for LLM Agent Systems.” ICSE 2026.
  • [OWASP, 2025] “Top 10 for Agentic Applications.” Dec 2025.
  • [Shamsujjoha et al., 2025] “Swiss Cheese Model for AI Agent Safety.” IEEE ICSA 2025.

Context Engineering & Frameworks:

  • [Karpathy, 2025] “Context Engineering.” Blog post.
  • [Anthropic, 2024] “Building Effective Agents.” Dec 2024.
  • [Anthropic, 2026] “Context Engineering Guide.”
  • [OpenAI, 2025] “Agents SDK.” Mar 2025.
  • [Yao et al., 2023] “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023.
  • [Schick et al., 2023] “Toolformer: Language Models Can Teach Themselves to Use Tools.” NeurIPS 2023.

This is the second post in a series. The first post covers the 17-agent architecture and analysis lifecycle. Future posts will cover lessons learned and performance metrics.