RZ AI Learning

Keeping Multi-Agent Systems Alive: How We Built a Watchdog Layer That Prevents Silent Failures

Sixth in a series. Previous posts: (1) AI Agent Teams for Analytics – the 17-agent architecture and lifecycle. (2) Best Practices To Keep AI Agents on Track – how enforcement mechanism matters more than rule content. (3) Agent Discussion: The Quality Layer That Harness Engineering Can’t Replace – how structured agent-to-agent discussion catches judgment-dependent quality issues. (4) The Dispatcher – how repeatable evaluation turns spec changes into measurements. (5) The Captain-Dispatcher Design – how independent-process agents solve per-agent context starvation.

Summary

  1. Multi-agent systems fail silently. Agents stall, exhaust context, enter dead loops, or block the pipeline – and nothing alerts anyone. Before the watchdog layer, we observed captains that appeared dead but were actually blocked on sub-agents running long cross-validation queries. The pipeline looked hung; the root cause was invisible without proactive monitoring. The watchdog system now detects these stalls within 5-10 minutes and routes recovery automatically.
  2. We built a 4-layer monitoring system to catch these failures. Layer 1: a dedicated Watchdog agent that self-polls every 5 minutes and alerts the orchestrator on stalls. Layer 2: 14 self-checks inside Captain that verify agent liveness, file output, and memory. Layer 3: harness hooks that enforce kill hierarchies, version propagation, and sub-agent timeout tracking at the system level. Layer 4: a multi-sensor memory tracker in the Dispatcher that enables safe parallel pipeline operation — throttling launches, killing idle workers, and preventing OOM cascades that would take down every running pipeline.
  3. The kill hierarchy prevents agents from destroying each other. A spawn registry tracks which agent created which session. Kills flow downward only: Dispatcher can kill Captains it spawned; Captains can kill agents they spawned. No self-kills. No upward kills. tmux kill-server is unconditionally blocked.
  4. Monitoring the monitor is not optional. Captain runs a WATCHDOG_ALIVE check on every self-check cycle. If Watchdog is dead, Captain respawns it. If Watchdog is alive but silent for >10 minutes, Captain nudges it. A dead Watchdog means stalls go undetected and the pipeline deadlocks.
  5. Latest evaluation: 21 stall detections, 193 memory interventions, 12/12 pipelines delivered. Across 12 overnight runs, Watchdog flagged 21 stall conditions and the memory monitor made 193 interventions (pausing launches, throttling concurrency, reclaiming idle workers). Without this stack, our first attempt at 4-parallel runs resulted in 0 deliverables — all 4 killed by memory pressure within 46 minutes. With it: 12/12 completed, 445 files, 117 charts, ~86% system-wide compliance (+6pp over the prior iteration).
  6. Next frontiers: faster zombie detection, sub-agent lifecycle management, and message delivery verification. The monitoring stack catches the failure modes that previously caused pipeline hangs. The next generation will tighten detection latency, automate sub-agent cleanup, and add delivery acknowledgment for cross-agent messages.

1. Introduction

The previous posts in this series described how to make a multi-agent pipeline correct: enforcement mechanisms for rules (post 2), structured discussion for judgment quality (post 3), automated evaluation for measurement (post 4), and full per-agent context windows for analytical depth (post 5). Each of those layers assumed that the agents were alive and running. None of them addressed what happens when they aren’t.

This post is about what happens when agents die, stall, or silently stop producing useful work – and the monitoring infrastructure that detects these failures before the pipeline hangs indefinitely.

Multi-agent systems have a failure mode that single-agent systems don’t: silent pipeline stalls. When a single agent hangs, the user sees a spinner and eventually gives up. When one agent in an eight-agent pipeline hangs, the other seven agents are unaffected – but the pipeline is blocked. The orchestrator is waiting for output that will never arrive. No error fires. No hook triggers. The system looks alive but is functionally dead.

This failure mode is especially dangerous in unattended operation. When the Dispatcher (post 4) runs 12 pipelines overnight, a silent stall at the 30-minute mark wastes 60+ minutes of wall-clock time before the Dispatcher’s own stall detection catches it. That is 60 minutes of compute burned, and the run must restart from scratch. Worse, if the stall is in the orchestrator itself, the Dispatcher cannot distinguish “Captain is thinking about a complex routing decision” from “Captain is deadlocked on a sub-agent that crashed 20 minutes ago.”

We built a 4-layer monitoring system to catch these failures. This post covers the problem, the four layers, the kill hierarchy that prevents agents from accidentally destroying each other, and the results from our latest evaluation.


2. The Agent Architecture: Why Monitoring Is Hard

To understand the monitoring problem, you need to understand the three types of agents in the system and why each exists.

Three Types of Agents

Type 1: Core agents (tmux sessions, 1M context, persistent)

Eight agents – Analyst, Data, Execution, Auditor, Writer, Judge, Improve, Watchdog – each run as an independent top-level process in their own tmux session for the full duration of the pipeline. Each gets the full 1M-token context window (post 5). They communicate through file-based SendMessage mailboxes. They accumulate context across the entire project and maintain state for discussion protocols.

Type 2: Sub-agents (ephemeral, spawned via Agent(), parallel work)

Short-lived agents spawned by core agents for independent, parallel tasks: fetching schemas, running independent queries, checking partition availability. They complete their task and terminate. They inherit context from the core agent that spawned them but do not participate in pipeline-wide checkpoint discussions.

Type 3: Captain (orchestrator, thin router)

Captain runs in its own tmux session for the full pipeline. It routes tasks, manages gates, writes the event log, and delegates all substantive work to the other core agents. Captain’s context stays small because it handles routing, not analysis.

Why So Many?

Each role is distinct. An Auditor that also writes queries would conflate generation and review – the same agent grading its own exam. A Data agent that also ran ML models would exhaust its context on two fundamentally different tasks. Specialization enables the discussion architecture (post 3): the Auditor can challenge the Analyst’s methodology because it is a separate process with a separate perspective.

The Cost

8-12 concurrent processes per pipeline, each consuming substantial memory. On our development server, a single pipeline at steady state uses ~60-80GB of RAM across all agent processes. Two concurrent pipelines (the Dispatcher’s standard configuration) use ~100-130GB. Our dispatcher logs show peak memory at 135GB with 38 processes during overlap periods.

This resource intensity is the reason monitoring cannot be an afterthought. A single OOM event or process crash ripples through the pipeline. The agent that dies stops producing output. The agent waiting for that output stalls. The orchestrator waiting for both of them stalls. Three agents are now dead, and from the outside the system looks like it is “still running.”


3. The Problems: How Multi-Agent Systems Fail Silently

These failure modes emerged from our evaluation runs – over 60 automated pipeline runs across five iterations of the system. Each pattern was discovered through the Dispatcher’s systematic logging (post 4), not through targeted debugging.

Problem 1: Silent Stalls at Handoff Points

Agent A completes its work and sends a message to Agent B via SendMessage. Agent B has crashed, or has entered an infinite loop, or is blocked on a tool call that will never return. No hook fires because no rule was violated. No error message appears. The pipeline hangs.

In early iterations, the Dispatcher’s per-run activity logging revealed that these handoff stalls were the primary cause of pipeline timeouts – more common than logic errors or wrong outputs. The exact handoff point and the specific agent responsible were only identifiable because the Dispatcher tracked file modification timestamps across all agents in each run.

Problem 2: Context Exhaustion (Pre-Independent-Process Architecture)

Under the earlier sub-process model (before the independent-process architecture described in post 5), every agent exhausted its context within 95 minutes on complex tasks. When an agent’s context compacted, it lost specific procedures, evidence chains, and checkpoint state. Post-compaction, the agent would produce work that looked normal but silently dropped quality. A Writer that compacted mid-draft would cite figures it could no longer verify.

The Captain-Dispatcher architecture (post 5) largely solved this – our latest evaluation shows no context exhaustion events across all 12 runs. But the monitoring layer still tracks context utilization as a leading indicator.

Problem 3: Zombie Sessions

An agent’s process crashes but its tmux session persists. The session appears alive – tmux list-sessions shows it, tmux has-session returns success. But the Claude process inside the session has exited. The agent is a zombie: visibly alive, functionally dead. Without process-level monitoring (checking whether a live Claude process exists inside the session, not just whether the session exists), zombies go undetected until the Dispatcher’s 120-minute stall timeout fires.

Problem 4: Sub-Agent Memory Accumulation

Each sub-agent spawned by a core agent consumes additional memory. A core agent that spawns multiple sub-agents during a complex step can accumulate 10-20GB of additional RSS. When multiple core agents do this simultaneously – Data running parallel schema lookups while Execution runs parallel queries – the aggregate memory spike can push the system from “healthy headroom” to “critical pressure” inside a single 2-minute monitoring cycle.

The dispatcher log shows memory usage climbing from 57GB (pipeline start) to 135GB (peak, with 38 processes) – a 78GB increase during the active phase, with a single 2-minute spike from 118GB to 135GB.

Problem 5: Captain-Level Deadlocks

The most insidious failure: Captain appears to be alive and is technically working, but it is blocked waiting for a sub-agent whose output will never arrive. Before the watchdog layer, we observed captains stuck in this state for 30+ minutes – blocked on background agents running long cross-validation queries. The pipeline was alive at every layer except the one that mattered. We initially diagnosed these as “captain deaths” before discovering (via pstree) that the claude processes were still running inside tmux panes showing sub-agent contexts. The watchdog system now catches these within 10 minutes via the checkpoint deadlock signal.


4. The Watchdog System: Four Layers

Each layer catches a failure mode the others cannot.

flowchart TD
    L1["Layer 1: Watchdog Agent<br/>5-min polls, stall classification, alerts"] --> L2["Layer 2: Captain Self-Checks<br/>14 checks per cycle, WATCHDOG_ALIVE, phase gates"]
    L2 --> L3["Layer 3: Harness Hooks<br/>tmux kill guard, version propagation BLOCK, AA timeout watcher"]
    L3 --> L4["Layer 4: Dispatcher Memory Tracker<br/>Multi-sensor zones, launch throttling, idle worker reclamation"]

    style L1 fill:#e8f5e9,stroke:#2E7D32
    style L2 fill:#fff3e0,stroke:#E65100
    style L3 fill:#e3f2fd,stroke:#1565C0
    style L4 fill:#fce4ec,stroke:#C62828

Layer 1: The Watchdog Agent

Watchdog is a dedicated core agent spawned at Phase 0 alongside the other seven. Unlike other core agents that idle between tasks, Watchdog runs a continuous self-polling loop. Its sole purpose is liveness monitoring – it never makes analytical decisions, never judges output quality, never routes tasks. It observes and reports.

The Polling Loop

Every 5 minutes, Watchdog runs a poll script that checks six signals across all core agents:

Signal Method Stall Indicator
Memory free -m, agent RSS Swap < 25% = MEMORY_LOW, < 15% = MEMORY_CRITICAL
Session alive tmux has-session Session not found = SESSION_DEAD
Agent file output stat -c %Y on handoff/output files No new file past dispatch marker threshold
Dispatch markers Parse expected_max + dispatched count Past expected_max x 1.5 with no output
Checkpoint deadlock Auditor APPROVE in checkpoint file but no framework/handoff written >10 min later APPROVE received but agent didn’t act
Pipeline progress Count files, charts, verify framework + final-analysis + deliverable doc Overall health snapshot

Tiered Alerting (Alert-Once-Per-Stall)

Watchdog does not cry wolf on every slow agent. It uses a three-tier escalation model:

  • TIER_1 (first detection): Logged internally, no alert. The agent might just be running a large query.
  • TIER_2 (second poll cycle, 10+ minutes): STALL_ALERT sent to Captain with diagnosis and recommended recovery action.
  • TIER_3 (third poll cycle, 15+ minutes): STALL_CRITICAL with ESCALATE recommendation.
  • Reset: When the poll shows new file output for that agent, the alert state clears.

This prevents alert fatigue while catching genuine stalls. An agent running a 7-minute query triggers TIER_1 and clears on the next cycle. An agent stuck in a dead loop hits TIER_3 in 15 minutes and Captain is alerted.

Diagnose Before Alerting

Before sending any alert, Watchdog runs a quick diagnosis:

  1. Captures the stalled agent’s last 10 lines of tmux output – looking for an idle prompt (agent’s turn ended), an activity indicator (still working), error messages, or a shell prompt (process died).
  2. Checks if output was written to an unexpected path.
  3. Checks the process tree to determine whether the Claude process is alive inside the tmux session.

The diagnosis enriches the alert so Captain can act immediately instead of re-investigating:

WATCHDOG STALL_ALERT:
- Agent: execution
- Idle: 12m (threshold: 10m)
- Pane state: idle prompt (agent's turn ended without acting)
- Partial output: execution-step3.md exists (230 lines)
- Diagnosis: silent turn — agent completed tool call but didn't continue
- Recommendation: NUDGE

What Watchdog Does Not Do

Watchdog does not make analytical decisions (which table, which methodology), does not judge output quality (that is Auditor’s role), does not route tasks (Captain’s role), does not fix agent specs (Improve’s role), and critically, does not send messages to stalled agents directly. Watchdog observes; Captain acts. This separation prevents Watchdog from accidentally disrupting an agent that is working but slow.

Layer 2: Captain Self-Checks

Captain runs 14 self-checks on its own monitoring cycle. These catch problems that Watchdog cannot see (because they are inside Captain’s own context) and problems that Watchdog can see but might miss (defense in depth).

The critical check for this post:

Check 14 – WATCHDOG_ALIVE: On every self-check cycle, Captain verifies Watchdog is alive by checking tmux has-session -t {prefix}-watchdog. If dead, Captain respawns Watchdog via bash tools/spawn-core-agent.sh watchdog {slug} {output_dir} {prefix}. If alive but the last Watchdog entry in the hook log is >10 minutes old, Captain sends a nudge message.

This is the “monitoring the monitor” check. A dead Watchdog means stalls go undetected and the pipeline deadlocks. Without this check, the Watchdog agent itself becomes a single point of failure.

Other relevant self-checks include:

Check What It Catches
WATCHDOG_TIER3_OUTPUT_PEEK When Watchdog fires TIER_3, Captain peeks at the expected output file mtime. If the file exists with mtime > dispatch time, the agent completed but lost its message. Captain reads the file and advances the pipeline instead of respawning.
Partition liveness (C3) Probes SELECT COUNT(*) for every (table, ds) pair. Catches stale partitions before they cause silent data errors.
Dispatch marker tracking (C11) Compares dispatched task count against expected count. Catches dropped tasks.
Phase gate enforcement Ensures the pipeline doesn’t advance past a phase boundary without all required artifacts.

Layer 3: Harness Hooks

Four hooks operate at the system level – below both Watchdog and Captain. These fire automatically on every agent action, regardless of whether the agent remembers its rules. They address three specific failure modes.

Hook 1: tmux Kill Guard (tmux-kill-guard.sh)

A PreToolUse:Bash hook that intercepts every bash command containing tmux kill-session, tmux kill-window, tmux kill-pane, or tmux kill-server. Before the kill executes, the hook checks:

  1. Is tmux kill-server the command? Always blocked, unconditionally.
  2. Is the caller Captain or Dispatcher? If neither, blocked. Only Captain and Dispatcher are ever allowed to kill sessions.
  3. Is the target session in the caller’s spawn registry? If not, blocked. You can only kill sessions you created.
  4. Is the target the caller’s own session? Blocked. No self-kills.

The spawn registry is a JSONL file (/tmp/tmux-spawn-registry-${USER}.jsonl) written at spawn time by spawn-core-agent.sh. Each line records who spawned whom:

{"spawner_session":"ct-v6-t1-r1-captain","spawned_session":"ct-v6-t1-r1-analyst","role":"analyst","slug":"my-analysis","ts":"2026-06-02T16:48:01+00:00"}

Session names follow a <run-prefix>-<role> pattern: the prefix (e.g., ct-v6-t1-r1 for “compliance test, v6, task 1, run 1”) groups all sessions belonging to one pipeline run; the suffix names the agent role (captain, analyst, data, improve, etc.). This makes provenance traceable from session name alone.

Hook 2: Version Bump BLOCK (version-and-identifier-hook.sh)

When agent specifications are updated mid-pipeline, a UserPromptSubmit hook injects a VERSION_REFRESH_REQUIRED warning and writes a lock file. A companion PreToolUse:SendMessage hook then blocks the agent’s outgoing messages until it reads the updated vault/agent-versions.md. The lock auto-expires after 10 minutes to prevent permanent blocks.

This hook was upgraded from advisory (text-only warning, ~50% compliance) to blocking (Tier 5, exit 2) based on the enforcement ladder principle from post 2: the same rule achieves 50% or 99% compliance depending on whether it is text or a hook.

Hook 3: AA Output Watcher (aa-output-watcher.sh)

A dual-trigger hook that monitors sub-agents spawned for cross-validation (AA agents). It records spawn times on PostToolUse:Agent events and polls for timeouts on UserPromptSubmit events. Per-AA-type timeouts are calibrated to the expected work:

AA Type Timeout
aa-data-step* 10 min
aa-execution-step* 15 min
aa-framework, aa-data, aa-final 30 min

When an AA agent exceeds its timeout with no output file, the hook emits a systemMessage alerting Captain to re-spawn or document the skip. It also detects late-arriving DISAGREE verdicts (AA finished after the step was already approved) and triggers reconciliation.

Hook 4: Spawn Script Enforcement (spawn-core-agent.sh)

Not a hook in the technical sense, but a mandatory spawn pathway that enforces architectural constraints:

  • Validates the agent role against a whitelist (analyst, data, execution, auditor, writer, judge, improve, watchdog, aa-framework, aa-data, aa-final).
  • Spawns via tmux new-session – never Agent(). This is what gives each agent its full 1M context (post 5).
  • Sets required environment variables (PROJECT_SLUG, PROJECT_OUTPUT_DIR, SESSION_PREFIX, sensitive mode flags).
  • Registers the spawn in the kill guard’s registry.
  • Checks for existing sessions to prevent duplicate spawns.

Layer 4: Dispatcher Memory Tracker

The first three layers monitor agent behavior. Layer 4 monitors the machine itself — the shared resource that all agents depend on.

When the Dispatcher runs multiple pipelines in parallel, each pipeline spawns 8-12 agent processes at ~2GB each. Two concurrent pipelines can consume 100-130GB on a 223GB machine. Without active memory management, a single pipeline’s review fan-out (spawning 10+ sub-agents simultaneously) can push the system past the OOM threshold and kill every running pipeline — not just the one that caused the spike.

Multi-Sensor Zone Model

The Dispatcher checks memory every 2 minutes using four sensors:

Sensor YELLOW Threshold RED Threshold
Available RAM % < 20% < 10%
Claude process count > 36 > 45
Swap free % < 40% < 25%
Max single-process RSS > 8GB > 12GB

Any single sensor crossing its threshold triggers the zone. The zone determines the action:

  • GREEN: Normal operation. Launch new pipelines when slots open.
  • YELLOW: Pause all new launches. Running pipelines continue. Monitor for recovery.
  • RED: Pause launches AND SIGKILL one idle non-exempt worker per cycle. Captain sessions are immune (killing the orchestrator would orphan all its agents). Idle threshold: agent must have been inactive for 10+ minutes to be eligible.

Evaluation Evidence

Across the 12-run evaluation (463 monitoring cycles over 15.5 hours):

Zone Cycles Action Taken
GREEN 281 (61%) Normal operation
YELLOW 171 (37%) Paused launches 171 times
RED 11 (2%) SIGKILLed 11 idle workers
pie title Memory zone distribution — 463 monitoring cycles
    "GREEN — normal" : 281
    "YELLOW — paused new launches" : 171
    "RED — killed an idle worker" : 11

The YELLOW zone is doing the heavy lifting. Pausing launches early kept the system from reaching the much-more-disruptive RED zone for any sustained period. Available RAM ranged from 29% (lowest, during review fan-out) to 74% (after task completion cleanup). The memory tracker kept the system stable across 12 pipeline completions — never triggering a full OOM, never killing an active agent, never losing a pipeline to memory pressure.

The contrast — before vs after:

Our first attempt ran 4 pipelines in parallel without launch throttling. The memory progression tells the story:

Time Event Avail RAM Procs Zone
14:29 Launch 4 pipelines 74% 10 GREEN
14:31 Agents spawning 49% 41 YELLOW
14:33 All 4 captains + agents running 36% 50 RED
14:35 Still growing 33% 54 RED
14:43 SIGKILL ct-v6-t1-r1-improve 30% 55 RED
14:45 SIGKILL ct-v6-t1-r1-judge 31% 56 RED
14:47 SIGKILL ct-v6-t1-r2-judge 32% 54 RED
Killed 11 workers over 46 min 33-38% 47-55 RED
15:15 Last log entry 38% 47 RED
flowchart LR
    A["14:29<br/>Launch<br/>74% RAM<br/>10 procs"] --> B["14:31<br/>Agents spawning<br/>49% RAM<br/>41 procs"]
    B --> C["14:33<br/>RED zone<br/>36% RAM<br/>50 procs"]
    C --> D["14:43<br/>1st SIGKILL<br/>30% RAM<br/>55 procs"]
    D --> E["14:47<br/>3rd SIGKILL<br/>32% RAM<br/>54 procs"]
    E --> F["15:15<br/>Abandoned<br/>38% RAM<br/>0 deliverables"]

    style A fill:#c8e6c9,stroke:#2E7D32
    style B fill:#fff3e0,stroke:#E65100
    style C fill:#ffcdd2,stroke:#C62828
    style D fill:#ffcdd2,stroke:#C62828
    style E fill:#ffcdd2,stroke:#C62828
    style F fill:#cfd8dc,stroke:#37474F

Within 4 minutes of launch, RAM headroom fell from 74% to 36% and never returned to the GREEN zone for the rest of the run. SIGKILLs began at 14:43 and continued for 32 minutes without recovering headroom.

Result: 4 pipelines launched, 0 deliverables. The dispatcher killed idle workers but couldn’t keep up — active captains kept spawning more agents faster than idle ones could be reclaimed. All sessions eventually died or stalled.

The successful run used the same dispatcher with one change: 2-parallel instead of 4, with the YELLOW zone pausing new launches before RED was reached. Same 12 tasks, same questions, same specs — 12/12 delivered.

The memory tracker’s key insight: pause launches before the system reaches critical pressure, not after. Reactive killing (RED zone) is a last resort; proactive throttling (YELLOW zone) is the primary defense.


5. The Kill Hierarchy

With 8-12 concurrent agent processes per pipeline, and potentially multiple pipelines running simultaneously under the Dispatcher, the risk of one agent accidentally killing another is real. The kill hierarchy prevents this.

flowchart TD
    D["Dispatcher"] -->|can kill| C["Captain"]
    C -->|can kill| A["Core Agents<br/>(analyst, data, execution,<br/>auditor, writer, judge,<br/>improve, watchdog)"]

    D -.-x|BLOCKED| A
    A -.-x|BLOCKED| C
    A -.-x|BLOCKED| A
    C -.-x|BLOCKED| D

    style D fill:#e3f2fd,stroke:#1565C0
    style C fill:#fff3e0,stroke:#E65100
    style A fill:#c8e6c9,stroke:#2E7D32

Three invariants:

  1. No agent kills its own session. Self-kills are always blocked. An agent that tries to kill its own tmux session is confused; blocking it prevents accidental suicide.

  2. Kills flow downward only. Dispatcher can kill Captain sessions (because the Dispatcher spawned them). Captain can kill core agent sessions (because Captain spawned them). Core agents cannot kill Captain, cannot kill Dispatcher, and cannot kill each other.

  3. You can only kill what you created. The spawn registry is the source of truth. If a session is not in your spawn list, you cannot kill it. This means user-created sessions, other pipeline sessions, and any session from a different Dispatcher run are completely safe.

tmux kill-server is unconditionally blocked regardless of caller. One agent killing the tmux server would destroy every session on the machine – including other pipelines, the Dispatcher itself, and any user sessions.


6. Results

The latest evaluation ran 12 automated pipeline runs (4 task types x 3 rounds each) overnight. The Dispatcher launched at 16:47 on June 2 and the last run completed at 08:23 on June 3 – approximately 15.5 hours of wall-clock time for 12 complete pipelines.

What the monitoring stack caught

Metric Count Impact
Watchdog stall detections 21 Flagged slow agents before they blocked the pipeline
Memory management interventions 193 Paused launches, throttled concurrency, prevented OOM kills
Checkpoint interactions (agent-to-agent) 273 Quality gates that caught methodology and data issues

These three numbers measure three different intervention types. Watchdog stalls flagged unresponsive agents before they blocked downstream work. Memory interventions kept the system from running out of headroom. Checkpoint interactions enforced cross-agent quality gates between every dispatch. None of these signals existed before this iteration; every one represents a failure mode that previously went undetected. The ratio is also informative: checkpoint interactions (273) outnumbered memory interventions (193) which outnumbered stall detections (21) by roughly an order of magnitude each — quality discussion fires constantly during normal operation, memory pressure fires periodically, and full agent stalls are the rare exception rather than the rule.

For context: our first attempt ran 4 pipelines in parallel without the memory monitoring layer. All 4 were killed by memory pressure within 46 minutes — 0 deliverables produced. With the full monitoring stack at 2-parallel, every pipeline completed.

Pipeline Completion

Metric Prior Iteration With Watchdog Stack Delta
Pipeline completion 100% (12/12) 100% (12/12) 0
Captain deaths 0 0 0
Context exhaustion events 0 0 0
Failed runs 0 0 0

Per-run completion times (from the Dispatcher log):

Run Duration Files Charts
t1-r1 129 min 39 9
t1-r2 214 min 29 10
t1-r3 155 min 42 9
t2-r1 139 min 32 8
t2-r2 161 min 25 7
t2-r3 117 min 30 8
t3-r1 137 min 36 13
t3-r2 167 min 38 13
t3-r3 125 min 28 13
t4-r1 153 min 35 10
t4-r2 125 min 37 10
t4-r3 143 min 22 9

Average: 147 min per pipeline (range: 117-214 min). Total output: 445 files, 117 charts across 12 runs. The slowest run (t1-r2 at 214 min) was 1.83x the fastest (t2-r3 at 117 min), and most runs clustered within 20 minutes of the mean. This consistency itself is monitoring evidence: when a single pipeline ran significantly longer, it was because a real query was slow, not because an agent had stalled. Before the watchdog stack, an outlier-duration run was ambiguous — slow query or dead agent? With it, the dispatcher log shows zero stall escalations during the longest runs, confirming that the extra wall-clock time was productive work.

Watchdog Activity

Watchdog was active across all 12 runs, logging 21 stall detections total. The tiered alert model worked as designed: most detections were TIER_1 (observed, no escalation needed — the agent was running a legitimately long query). TIER_2 alerts triggered Captain recovery actions. Three runs completed with zero stall signals, representing clean pipelines where every agent operated within expected time bounds — the monitoring overhead was present but invisible.

System-Wide Compliance

Iteration System-Wide Compliance Architecture Added
Iteration 1 ~65% Harness baseline
Iteration 2 ~71% + Rule restructuring
Iteration 3 ~77% + Discussion architecture (post 3)
Iteration 4 ~80% + Captain-Dispatcher + routing guard (posts 4-5)
Iteration 5 ~86% + Watchdog + version BLOCK + AA watcher + memory tracker

Each iteration added a substrate that the next iteration depended on:

flowchart LR
    I1["Iter 1: Harness baseline<br/>~65%"] --> I2["Iter 2: Rule restructuring<br/>(makes harness enforceable)<br/>~71%"]
    I2 --> I3["Iter 3: Discussion architecture<br/>(needs agents to survive long enough)<br/>~77%"]
    I3 --> I4["Iter 4: Captain-Dispatcher<br/>(needs RAM for independent processes)<br/>~80%"]
    I4 --> I5["Iter 5: Watchdog stack<br/>(keeps everything alive)<br/>~86%"]

    style I1 fill:#eceff1,stroke:#546E7A
    style I2 fill:#e3f2fd,stroke:#1565C0
    style I3 fill:#fff3e0,stroke:#E65100
    style I4 fill:#fce4ec,stroke:#C62828
    style I5 fill:#e8f5e9,stroke:#2E7D32

Compliance climbed by ~5pp per iteration on average. The latest +6pp gain came from a layer that does not write SQL, does not render charts, and does not produce any deliverable directly. It just keeps everything else alive long enough to do its work.

The +6pp gain in the latest iteration is distributed across multiple rules. The top improvements:

Rule Prior With Watchdog Stack Delta
Challenge Calibration 67% 100% +33pp
Cross-Source Validation 67% 100% +33pp
AA Cross-Validation 50% 83% +33pp
Dispatch Markers 25% 100% +75pp
Identifier in checkpoints 50% 100% +50pp
Multi-auditor fanout 58% 92% +34pp

Every one of these six rules improved, and four of them reached 100%. The largest single gain — Dispatch Markers, +75pp — was the rule most directly enforced by a new hook this iteration: agents must write a dispatch marker file before sending a task downstream, and the post-dispatch counter hook prompts Captain to verify it. The smallest gain (AA Cross-Validation at +33pp) was on a rule where the underlying behavior is harder to enforce programmatically and still depends on agent judgment. The pattern is consistent with the enforcement-ladder principle from earlier posts: rules with a structural enforcement hook climb to 100% quickly; rules that rely on text-level reminders move more slowly even when the supporting infrastructure is in place.

These gains are not all attributable to the monitoring layer alone – spec improvements, version propagation enforcement, and AA watcher hooks all contributed. But the monitoring layer is the substrate that kept the pipeline alive long enough for every other layer to do its work. 100% pipeline completion is the precondition for every compliance number above it.

Memory Profile (Layer 4 in action)

The dispatcher log shows the memory tracker operating across 463 monitoring cycles:

  • Start: 57GB used, 166GB available (74%), 10 Claude processes
  • Steady state (2 concurrent runs): ~95-100GB used, 28-29 processes
  • Peak (agent spawn overlaps): 135GB used, 38 processes — Layer 4 YELLOW zone triggered, paused new launches
  • Post-completion cleanup: memory recovered to 60%+ within 2 minutes of task completion

The zone distribution tells the operational story: 61% GREEN (normal), 37% YELLOW (launches paused — 171 times), 2% RED (11 idle workers killed). The YELLOW zone was the workhorse — by pausing launches proactively, it kept the system from entering RED during the review fan-out phases where sub-agent spawns spike memory usage.


7. Next Frontiers

The monitoring stack catches the failure modes that previously caused pipeline hangs. Four areas are targets for the next iteration.

Frontier 1: Faster Zombie Detection

The current detection cadence is 5-minute polls with tiered escalation, giving a 10-15 minute detection window. A process-level liveness check on a faster cadence (every 60 seconds) could directly verify the Claude process inside each tmux session, reducing detection to under 2 minutes.

Frontier 2: Sub-Agent Lifecycle Management

The AA output watcher tracks AA-type sub-agents. Extending this to all sub-agent types – with per-type timeouts and automated cleanup of completed sub-agents – would reduce memory pressure during the review fan-out phases where 10+ sub-agents may be active simultaneously.

Frontier 3: Mid-Pipeline Version Propagation

The version BLOCK mechanism forces agents to re-read specs when versions change. Scaling this to coordinated multi-agent version bumps during active pipeline phases – where 8 agents need to refresh simultaneously without disrupting checkpoint discussions – is the next challenge for harness-level enforcement.

Frontier 4: Message Delivery Acknowledgment

A delivery-acknowledgment protocol – where the recipient writes a receipt and the sender checks for it – would close the gap between “message sent” and “message processed,” catching delivery failures before the Watchdog’s poll-based detection.


8. Conclusion

The first five posts in this series described how to make a multi-agent pipeline correct – accurate rules, substantive discussion, repeatable evaluation, full context windows. This post addresses the prerequisite: keeping the pipeline alive.

The monitoring stack has four layers, each catching what the others miss:

  1. Watchdog Agent – continuous 5-minute polls, tiered alerting, enriched diagnosis. Catches stalls, deadlocks, and dead sessions.
  2. Captain Self-Checks – 14 checks including WATCHDOG_ALIVE (monitoring the monitor), output-peeking to rescue lost messages, phase gates to prevent premature advancement.
  3. Harness Hooks – kill guard (prevents fratricide), version propagation BLOCK (ensures spec freshness), AA timeout watcher (catches sub-agent silent failures), spawn script (enforces architectural constraints).
  4. Dispatcher Memory Tracker – multi-sensor zone model (GREEN/YELLOW/RED) that throttles launches and reclaims idle workers. Enables safe parallel pipeline operation without OOM cascades.

The kill hierarchy – downward-only, spawn-registry-gated, no self-kills, kill-server unconditionally blocked – is the safety net that prevents the monitoring layer itself from causing damage.

The latest evaluation is the proof point: 12/12 runs completed, 0 failures, no captain deaths, no context exhaustion, system-wide compliance at ~86%. The monitoring layer did not make the pipeline correct – the other layers did that. The monitoring layer kept the pipeline alive long enough for those layers to work.

The general lesson for multi-agent systems: monitoring agents need monitoring too. A single dedicated monitor is itself a single point of failure. The defense is the same principle that runs through this entire series: layers with diverse failure modes. The Watchdog catches agent stalls. Captain catches a dead Watchdog. Hooks catch what both miss. No single layer is sufficient. The composition is.


Appendix

A. The Watchdog Polling Script Signals

Signal              | How Checked                          | Alert Type
────────────────────|──────────────────────────────────────|───────────────────
Memory              | free -m, agent RSS                   | MEMORY_LOW / CRITICAL
Session alive       | tmux has-session -t {prefix}-{agent} | SESSION_DEAD
Agent file output   | stat -c %Y on handoff/output files   | STALL / STALL_CRITICAL
Dispatch markers    | expected_max vs dispatched count      | STALL (overdue)
Checkpoint deadlock | APPROVE in checkpoint, no output >10m | CHECKPOINT_DEADLOCK
Pipeline progress   | File/chart/framework/final/deliverable-doc scan | Health snapshot

B. Captain’s 14 Self-Checks (Monitoring-Relevant Subset)

# Check What It Catches
8 Partition liveness (C3) Stale partitions that would silently corrupt data
13 WATCHDOG_TIER3_OUTPUT_PEEK Agent completed but lost its message (rescue instead of respawn)
14 WATCHDOG_ALIVE Dead or silent Watchdog (monitoring the monitor)

C. Kill Hierarchy Rules

Rule 1: tmux kill-server → ALWAYS BLOCKED
Rule 2: Non-captain, non-dispatcher agent → ALL kills BLOCKED
Rule 3: Target = caller's own session → BLOCKED (no self-kills)
Rule 4: Target NOT in caller's spawn registry → BLOCKED
Rule 5: All rules pass → ALLOWED (kills flow downward through spawn tree)

D. Compliance Progression (Full Series)

Iteration Architecture System-Wide Compliance Pipeline Completion
1 Harness only ~65% 100%
2 + Rule restructuring ~71% 100%
3 + Discussion architecture ~77% 96%
4 + Captain-Dispatcher + routing guard ~80% 100%
5 + Watchdog + memory tracker + version BLOCK + AA watcher ~86% 100%

This is the sixth post in a series. The monitoring stack described here runs alongside the rule enforcement (post 2), discussion architecture (post 3), evaluation infrastructure (post 4), and full-context agent architecture (post 5). Each layer depends on the others. The monitoring layer keeps the pipeline alive; the other layers make it correct.