Best Practices To Keep AI Agents on Track: Rule Compliance Through Harness Engineering
This is the second post in a series about building a multi-agent system for complex analytics. The first post covers the architecture and agent lifecycle. This one covers how to make agents actually follow their rules.
Summary
When you write a new rule for an AI agent:
- Can it be checked programmatically? Make it an automated hook (a script that blocks violations). 99% reliable. Everything else is less.
- Is it for a persistent agent (one that lives for hours)? Add a 1-line reminder to the agent’s warm-rules file AND the Warm Refresh section at the top of the spec. The system injects the reminder on every new task automatically.
- Where does it go in the spec? At the exact workflow step where the agent needs it. Never in a standalone “Rules” section at the bottom.
- Does it have a trigger, action, and verification? “Before X, do Y, verify Z.” If it’s an abstract principle (“be thorough”), rewrite it as a concrete checkpoint.
When a rule keeps getting violated:
- 1st violation → add rule text to the spec at the decision point
- 2nd violation → add to warm-rules file (injected on every task) + Warm Refresh section
- 3rd violation → make it a hook (automated enforcement, 99% reliable)
When structuring an agent’s instruction file:
- Keep all procedures inline (don’t point to external files — agents skip them ~15–20% of the time)
- Put the 10 most-violated rules in a “Warm Refresh” section at the top (first 50 lines)
- For files over 800 lines: ensure critical rules are at natural workflow positions, not buried
- Only split into separate files for multi-phase orchestrators (not single-purpose agents)
Core principles:
- Automate enforcement for every critical rule
- Protect persistent agents with warm-start mechanisms
- Every rule needs trigger + action + verification
- Embed rules at decision points, not in “Rules” sections
- Respect the attention budget
- Keep procedures inline — never use pointers
1. Introduction
- This paper presents measured compliance data from operating a multi-agent LLM analytics system (9 core agents) across ~90 analysis projects
- The key finding: how you STRUCTURE rules matters more than what the rules SAY — the same rule achieves 50% or 99% compliance depending on enforcement mechanism
- Why this matters: existing multi-agent frameworks write agent instructions once at spawn and assume they stay effective; no framework manages instruction decay over time [Masterman et al., 2024]
- Instruction-following benchmarks confirm even frontier models fail 17–29% of the time [Zhou et al., 2023]; AGENTIF [Tsinghua, 2025] found the best model perfectly follows fewer than 30% of complex agentic instructions
2. The Problems and Observations
Problem: Agents skip rules even when reinforced
We wrote detailed instruction files (433–1,939 lines per agent). When agents violated rules, we added the same rule in multiple places. It didn’t help.
| What went wrong | Where the rule was written | Compliance |
|---|---|---|
| Agents skipped the quality review pipeline | Line 866 in a 1,939-line file | 28% |
| Agents created files with banned content patterns | Line 1,864 (near bottom) | 60% before hooks, 99% after |
| Agents didn’t trigger self-improvement | Line 1,634 (middle) | 80% |
| Agents skipped formatting checks | Line 1,258 (middle) | 50% |
| Agents followed “state your name” rule | Line 36 (top) + automated reminder | 99% |
Table 1. Compliance data by rule and enforcement. The problem is not agent capability — it’s how instruction files are structured.
Observation 1: Long instruction files have a dead zone
Compliance by position in a ~2,000-line instruction file:
flowchart LR
A["First 200 lines<br/>~95%"] --> B["Lines 200–500<br/>~85%"]
B --> C["Lines 500–1,500<br/>~60%"]
C --> D["Last 400 lines<br/>~75%"]
style A fill:#c8e6c9,stroke:#2E7D32
style B fill:#fff9c4,stroke:#F9A825
style C fill:#ffcdd2,stroke:#C62828
style D fill:#fff9c4,stroke:#F9A825
- First 200 lines: ~95% compliance
- Lines 200–500: ~85%
- Lines 500–1,500: ~60% (the dead zone — most operational rules sit here)
- Last 400 lines: ~75%
This U-shaped curve matches “Lost in the Middle” research [Liu et al., 2024]: accuracy dropped from 75.8% at position 0 to 53.8% at mid-positions. Yin et al. [2024] identified the architectural root cause: RoPE introduces a decay effect favoring recent and initial positions; causal masking means earlier tokens accumulate more attention — the bias is structural, not a training artifact.
Observation 2: Time degrades compliance for persistent agents
| Lifetime Stage | Dead-Zone Rule Compliance | What’s Happening |
|---|---|---|
| At spawn (task 1) | ~100% | Agent just read full spec; all rules active |
| After 1–2 tasks | ~85% | Instructions pushed back by task results |
| After 3–5 tasks | ~60% | Spec thousands of tokens back; dead-zone rules forgotten |
| After context compression | ~40% | Spec reduced to summary; specific procedures GONE |
Table 2. Compliance degradation over agent lifetime. Single-shot agents don’t have this problem.
The “attention sinks” phenomenon [Xiao et al., 2024] explains this: transformers allocate disproportionate attention to initial and recent tokens, with middle content receiving the least. As agents process more tasks, original instructions move into the “middle” of accumulated context — outside both the attention sink and the recent-token window.
flowchart TD
T1["Task 1: ~100%"] --> T2["After 1-2 tasks: ~85%"]
T2 --> T3["After 3-5 tasks: ~60%"]
T3 --> T4["After compression: ~40%"]
style T1 fill:#c8e6c9,stroke:#2E7D32
style T2 fill:#fff9c4,stroke:#F9A825
style T3 fill:#ffcdd2,stroke:#C62828
style T4 fill:#b71c1c,stroke:#b71c1c,color:#fff
Compliance decay curve for persistent agents without intervention.
Observation 3: Copy-pasting rules doesn’t help — unless contextual
We had ~125 lines of identical boilerplate copied into each of our 9 agent files (1,125 lines total). When automated scripts already enforced those rules, the duplicated text added no compliance benefit.
However: stating a rule at the exact moment the agent needs it works much better than a separate “Rules” section.
Observation 4: Agents don’t follow pointers to external files
When we split long instruction files and used pointers (“read procedure X from file Y”), agents skipped the Read call ~15–20% of the time. Even purpose-trained tool-using models skip external calls when confident [Schick et al., 2023].
Observation 5: Context compression is the silent killer
After hours of work, the system compresses older context; the agent’s instruction file gets reduced to a brief summary. Specific procedures, field names, exact templates are GONE — the agent improvises where it used to follow exact procedures.
The only mechanisms that survive compression: hooks (fire fresh every time) and explicit re-reading of the spec after compression.
3. Root Cause: Why Some Rules Stick and Others Don’t
The four-condition framework
For a rule to be reliably followed, all four conditions must be true:
- In context — the agent still has the rule in working memory
- Triggered — the agent recognizes “this rule applies right now”
- Actionable — the agent knows exactly what to do (a specific command, not a vague principle)
- Verified — the agent or the system can confirm the rule was followed
| Rule | Compliance | Conditions Met |
|---|---|---|
| “Start every message with your name” + automated reminder | ~99% | All 4 |
| “Never include banned content” + automated blocker | ~99% | All 4 |
| “Check for updates before starting work” + automated reminder | ~95% | 3 of 4 |
| “Run a 5-point quality checklist after every task” | ~85% | 2 of 4 |
| “When fixing an issue, change only what was flagged” | ~60% | 1 of 4 |
| “Persist until high confidence on every finding” | ~50% | 0 of 4 |
Table 3. Compliance by number of conditions met. Trigger + specific action + verification = compliance. Abstract principle + no trigger = forgotten.
Connection to established research
- Implementation intentions [Gollwitzer and Sheeran, 2006]: meta-analysis of 94 studies found that “when X arises, I will do Y” plans outperform abstract goals (effect size d=0.65). Same mechanism applies to LLM agents: specific triggers create reliable responses; abstract principles degrade under load.
- AgentSpec [Wang et al., ICSE 2026]: independently arrived at the same structure — rules as (trigger, predicate, enforcement) 3-tuples achieving >90–100% compliance. Strongest external validation of our framework.
- WHO Surgical Safety Checklist [Haynes et al., 2009]: reduced surgical mortality by 47%. Key principle: checklists must be short (5–9 items), tied to workflow pause points, and revised based on observed failures.
Enforcement tier data
| Enforcement Tier | Mechanism | Compliance |
|---|---|---|
| Automated hook (system blocks violations) | Script fires before/after agent action | 99%+ |
| Script gate (orchestrator checks at transitions) | Check script at pipeline boundaries | 95%+ |
| Warm-start injection (rules injected every task) | Hook delivers 10–15 rules with each task | ~90% |
| Workflow-embedded text (rule at decision point) | Rule stated where agent makes the decision | 85–90% |
| Standalone text (rule in separate section) | “General Principles” at line 1,600 | 50–70% |
Table 4. Compliance by enforcement tier. The gradient from 50% to 99% demonstrates that enforcement mechanism is the primary determinant.
flowchart LR
T1["Standalone text<br/>50–70%"] --> T2["Workflow-embedded<br/>85–90%"]
T2 --> T3["Warm-start injection<br/>~90%"]
T3 --> T4["Script gate<br/>95%+"]
T4 --> T5["Automated hook<br/>99%+"]
style T1 fill:#ffcdd2,stroke:#C62828
style T2 fill:#fff9c4,stroke:#F9A825
style T3 fill:#fff9c4,stroke:#F9A825
style T4 fill:#c8e6c9,stroke:#2E7D32
style T5 fill:#c8e6c9,stroke:#2E7D32
The enforcement ladder: same rule, different compliance depending on mechanism.
4. Solution: Three-Layer Defense
Layer 1: Automated enforcement (hooks) — 99%
Scripts that fire automatically before/after every agent action at the system level. The agent doesn’t need to remember anything — the system blocks violations before they reach the user.
In our system, 14 hooks operate at three points:
- PreToolUse — blocks actions before they happen
- PostToolUse — verifies output after actions complete
- Stop — blocks task completion if required steps are missing
Conceptually identical to NeMo Guardrails [Rebedea et al., 2023] and poka-yoke (“mistake-proofing”) from lean manufacturing. OpenAI’s Agents SDK [2025] independently converged on the same pattern: guardrails with a “tripwire” mechanism that halts on violation.
Layer 2: Warm-start rule injection — ~90%
Two complementary mechanisms:
- Hook-injected reminders: A script automatically injects each agent’s 10–15 most critical rules alongside every new task. Rules appear in the freshest context position. Fires automatically; survives context compression.
- Warm Refresh section: Top 10 most-violated rules placed in the first 50 lines of the spec, positioning them in the attention sink zone [Xiao et al., 2024].
The “context engineering” paradigm [Karpathy, 2025; Anthropic, 2026] validates this: a focused 300-token context often outperforms an unfocused 113K-token context.
Layer 3: Periodic full refresh — 100% reset
After every 8 task exchanges, the orchestrator respawns the agent fresh — all rules return to 100%.
Critical addition: post-compression re-read — when the system detects context compression, the agent re-reads its full spec immediately. Without post-compression re-read, agents ran for 5+ hours on compressed summaries at ~40% compliance.
File-based coordination via a shared vault enables this: all project state is in durable files, so the new instance recovers context without message history.
Composition: How the layers work together
| Agent State | Without Defense | With All 3 Layers |
|---|---|---|
| Cold start (task 1) | 100% | 100% |
| After 3 tasks | ~75% | ~90% |
| After 5 tasks | ~55% | ~85% |
| After context compression | ~40% | ~85% |
| After 8 tasks (full refresh) | ~40% | 100% |
Table 5. Compliance over agent lifetime. The layers have diverse failure modes (Swiss Cheese Model [Reason, 1990]): hooks can’t catch non-programmatic violations, warm-start injection degrades over many tasks, periodic refresh has cold-start cost. Together, they cover each other’s gaps.
flowchart TD
subgraph Without["Without Defense"]
W1["Task 1: 100%"] --> W2["Task 3: ~75%"]
W2 --> W3["Task 5: ~55%"]
W3 --> W4["Compression: ~40%"]
W4 --> W5["Task 8: ~40%"]
end
subgraph With["With Three-Layer Defense"]
D1["Task 1: 100%"] --> D2["Task 3: ~90%"]
D2 --> D3["Task 5: ~85%"]
D3 --> D4["Compression: ~85%"]
D4 --> D5["Task 8: 100% ↻"]
end
style W4 fill:#ffcdd2,stroke:#C62828
style W5 fill:#ffcdd2,stroke:#C62828
style D1 fill:#c8e6c9,stroke:#2E7D32
style D5 fill:#c8e6c9,stroke:#2E7D32
Side-by-side compliance over agent lifetime.
Reflect-Before-Fix protocol
When an agent’s output is flagged, the agent writes a 1–2 sentence reflection before fixing. Adopted from Reflexion [Shinn et al., 2023]: agents with verbal self-reflections improved ALFWorld success from 80% to 97%. Provides agent-level within-project learning, complementing system-level self-improvement.
5. Current Industry Landscape
| Category | Key Systems | Gap |
|---|---|---|
| Multi-agent frameworks | CrewAI, AutoGen/AG2, MetaGPT [ICLR 2025], LangGraph, OpenAI Agents SDK, Pydantic AI, Google ADK, Claude Agent SDK, Mastra | None manage instruction decay over time |
| Agent guardrails | NeMo Guardrails (Colang 2.0), Guardrails AI, LlamaGuard, AgentSpec [ICSE 2026] | Most are binary pass/fail for safety — not procedural compliance |
| Self-improvement | Reflexion, MAR, CLIN, Self-Refine, Voyager | Improve agent behavior, not enforcement architecture |
| Memory management | Letta/MemGPT (virtual context, git-backed memory) | Focus on data paging, not instruction freshness |
| Protocols | MCP (tools), A2A (agents) | Standardize communication, not compliance |
| Security | OWASP Top 10 for Agentic Applications [Dec 2025] | Risk taxonomy, not enforcement framework |
Table 6. Industry landscape summary. No existing system combines instruction decay management, graduated enforcement, and structural self-improvement.
| Capability | Industry State | Our Approach |
|---|---|---|
| Instruction decay | Not addressed; CrewAI closest (role re-injection) | Three-layer defense: hooks + warm-start + periodic refresh |
| Enforcement | Binary pass/fail (NeMo, LlamaGuard) | 5-tier graduated ladder with escalation (text → warm-rules → hook) |
| Self-improvement | Agent learns (Reflexion, CLIN) | Agent learns AND system promotes text rules to hooks |
| Validation | Symmetric debate or single reviewer | Specialized roles: challenger + arbiter + independent cross-validator |
| Coordination | Message passing or shared memory | File-based vault: inspectable, diffable, survives agent death |
| Rule format | Ad-hoc natural language | Trigger + action + verification (validated by AgentSpec [ICSE 2026]) |
Table 7. Comparison with industry. The “context engineering” paradigm [Karpathy, 2025] independently validates the core techniques.
Key comparisons:
- AgentSpec [ICSE 2026] validates the trigger-action-verification framework with structurally identical (trigger, predicate, enforcement) 3-tuples. Their DSL is more maintainable at scale; our natural language + bash hooks are more flexible.
- “Found in the Middle” [Yin et al., 2024] proposes model-level fixes (RoPE calibration). Our approach is application-level. Both are valid at different layers and complementary.
- MCP + A2A protocols formalize communication over HTTP/JSON-RPC. Our vault-file approach is lower-tech but files are inspectable, diffable, and survive agent death without coordination infrastructure.
6. Principles for Writing Agent Rules
Ordered by impact (most important first).
Principle 1: Automate enforcement for every critical rule
If a rule has been violated twice and can be checked programmatically, make it a hook. The enforcement ladder:
hooks (99%) > script gates (95%) > warm-start injection (90%) > workflow-embedded text (85%) > standalone section (50–70%)
Mirrors the NIOSH Hierarchy of Controls: engineering controls > administrative controls > personal vigilance.
Principle 2: Protect persistent agents with warm-start mechanisms
Warm Refresh section at the top (~50 lines, 10 most-violated rules) + hook injection on every task + full refresh every N tasks + post-compression re-read.
This raised compliance from ~60% to ~90% — the single most impactful structural change.
Principle 3: Every rule needs trigger + action + verification
- Trigger: “before sending results to user”
- Action: “run the content-check script on file X”
- Verification: “script exit code = 0”
- Consequence: “block the send, fix, retry”
If you can’t define a trigger, the rule is too abstract to enforce.
Principle 4: Embed rules at decision points, not in “Rules” sections
A standalone “Rules” section at the bottom of an instruction file is where rules go to die. State each rule at the exact workflow step where the agent needs it.
Principle 5: Respect the attention budget
| File Length | Recommendation |
|---|---|
| Under 400 lines | Agents reliably follow the full file |
| 400–800 lines | Critical rules at natural workflow positions |
| 800–1,200 lines | Add Warm Refresh section + hook injection |
| Over 1,200 lines | Split into core + modules (only for multi-phase orchestrators) |
Principle 6: Only split files for multi-phase orchestrators
We split our 1,939-line orchestrator into a 433-line core + 5 phase files. We did NOT split single-purpose agents (1,274 and 1,198 lines) — single file + Warm Refresh + hooks is more reliable. Splitting adds ~15–20% Read-call skip risk.
Principle 7: Keep procedures inline — never use pointers
When we extracted procedures to shared files with pointers, agents skipped reading them ~15–20% of the time. Keep full procedures inline; shared files exist as human reference only.
7. Conclusion
- Enforcement mechanism matters more than rule content — the same rule achieves 50% or 99% compliance depending on whether it’s text or a hook
- Persistent agents need active instruction maintenance — without intervention, compliance degrades from 100% to ~40%; the three-layer defense raises this to ~90%
- Rule structure follows implementation intention principles — trigger + action + verification = 95–99% compliance; abstract principles = 50–60%
- The instruction file teaches agents WHAT to do. Hooks and injection ensure they DO.
Appendix
A. The Enforcement Ladder
| Tier | Mechanism | Compliance | How It Works | Example |
|---|---|---|---|---|
| 5 | Automated hook | 99%+ | Script fires before/after agent action; blocks violations at system level | Content restriction: banned patterns blocked automatically |
| 4 | Script gate | 95%+ | Orchestrator runs check script at pipeline transitions | Pipeline completion verified before publishing |
| 3 | Warm-start injection | ~90% | Hook delivers 10–15 critical rules with every new task | Most-violated rules appear in freshest context each task |
| 2 | Workflow-embedded text | 85–90% | Rule stated at the exact step where it applies | Decision routing procedures at the routing step |
| 1 | Standalone text | 50–70% | Rule stated once in a separate section, no enforcement | “General Principles” section — promote everything out of this tier |
Table A1. Full enforcement ladder with tiers, compliance rates, mechanisms, and examples.
B. Testing Results
| Scenario | Before Redesign | After Redesign |
|---|---|---|
| Agent at cold start | 100% (rules fresh) | 100% (same) |
| Agent at task 5 needs dead-zone rule | ~60% (forgotten) | ~90% (hook injects rule fresh) |
| Context compresses after 3 hours | ~40% (rules in summary) | ~85% (hook fires fresh + post-compression re-read) |
| Cross-phase rule needed in wrong phase | Often missed | Works (cross-phase rules in always-loaded core) |
| Agent needs edge-case procedure | Often improvised | Works (full procedure in spec; warm-rules remind) |
| Quality review pipeline before publishing | 28% completion | ~95% (hook blocks publishing until reviews done) |
| Orchestrator spec too thin (272 lines) | N/A | Failed — restored to 433 lines; core too thin |
| Pointer to shared protocol file | N/A | Failed — agents skip Read calls ~15–20%; reverted to inline |
| Agent receives 10+ tasks without refresh | ~40% (severe decay) | ~85% (warm-start + 8-task refresh cycle) |
| Adversarial validation after synthesis | Occasionally skipped | ~95% (hook blocks publishing without validation) |
Table B1. Testing results across 10 critical scenarios, comparing compliance before and after the three-layer defense.
C. References
Positional Bias & Context:
- [Liu et al., 2024] “Lost in the Middle: How Language Models Use Long Contexts.” TACL.
- [Yin et al., 2024] “Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization.” ACL 2024.
- [Hsieh et al., 2024] “RULER: What’s the Real Context Size of Your Long-Context Language Models?”
- [Levy et al., 2024] “Same Task, More Tokens: The Impact of Input Length on the Reasoning Performance of LLMs.” ACL 2024.
- [Xiao et al., 2024] “Efficient Streaming Language Models with Attention Sinks.” ICLR 2024.
- [Paulsen, 2025] “Maximum Effective Context Window (MECW).”
- [Jiang et al., 2023] “LLMLingua: Compressing Prompts for Accelerated Inference.” EMNLP 2023.
Instruction Following:
- [Zhou et al., 2023] “Instruction-Following Evaluation (IFEval).”
- [Mu et al., 2024] “Rule-Based Rewards for LLM Instruction Following (RuLES).”
- [IFEval++, 2025] “Instruction Following Evaluation with Rephrasing.”
- [AGENTIF, 2025] “Agentic Instruction Following Benchmark.” Tsinghua University.
- [MultiChallenge, ACL 2025] “Instruction Retention in Multi-Turn Conversations.”
Cognitive Science & Human Factors:
- [Gollwitzer & Sheeran, 2006] “Implementation Intentions and Goal Achievement: A Meta-Analysis.” Advances in Experimental Social Psychology.
- [Haynes et al., 2009] “A Surgical Safety Checklist to Reduce Morbidity and Mortality in a Global Population.” NEJM 360(5):491-499.
- [Gawande, 2009] “The Checklist Manifesto.” Metropolitan Books.
- [Degani & Wiener, 1993] “Cockpit Checklists: Concepts, Design, and Use.” Human Factors.
- [Sweller, 1988] “Cognitive Load During Problem Solving.” Cognitive Science.
- [Reason, 1990] “Human Error.” Cambridge University Press.
Multi-Agent Systems:
- [Hong et al., 2023] “MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework.” ICLR 2025 Oral.
- [Du et al., 2023] “Improving Factuality and Reasoning in Language Models through Multiagent Debate.”
- [Shinn et al., 2023] “Reflexion: Language Agents with Verbal Reinforcement Learning.” NeurIPS 2023.
- [Wang et al., 2023] “Voyager: An Open-Ended Embodied Agent with Large Language Models.”
- [Packer et al., 2023] “MemGPT: Towards LLMs as Operating Systems.” (Now Letta.)
- [Madaan et al., 2023] “Self-Refine: Iterative Refinement with Self-Feedback.” NeurIPS 2023.
- [MAR, 2025] “Multi-Agent Reflexion.” Dec 2025.
- [CLIN, 2025] “Continual Learning for Language-Based Agents.”
- [Masterman et al., 2024] “The Landscape of Emerging AI Agent Architectures.” Survey.
Agent Safety & Guardrails:
- [Rebedea et al., 2023] “NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications.” (Colang 2.0 Beta, 2025.)
- [Inan et al., 2023] “LlamaGuard: LLM-Based Input-Output Safeguard for Human-AI Conversations.”
- [Wang et al., 2026] “AgentSpec: Runtime Enforcement for LLM Agent Systems.” ICSE 2026.
- [OWASP, 2025] “Top 10 for Agentic Applications.” Dec 2025.
- [Shamsujjoha et al., 2025] “Swiss Cheese Model for AI Agent Safety.” IEEE ICSA 2025.
Context Engineering & Frameworks:
- [Karpathy, 2025] “Context Engineering.” Blog post.
- [Anthropic, 2024] “Building Effective Agents.” Dec 2024.
- [Anthropic, 2026] “Context Engineering Guide.”
- [OpenAI, 2025] “Agents SDK.” Mar 2025.
- [Yao et al., 2023] “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023.
- [Schick et al., 2023] “Toolformer: Language Models Can Teach Themselves to Use Tools.” NeurIPS 2023.
This is the second post in a series. The first post covers the 17-agent architecture and analysis lifecycle. Future posts will cover lessons learned and performance metrics.