The current authoritative reconciliation · 30 May 2026

Documented vs. planned vs. live.
And what's actually broken.

This page supersedes Open Issues and Plan vs. Reality (now marked as historic / archived). It does two jobs in one place: (1) a side-by-side three-axis comparison of how the workflow is documented, how the build plan planned it, and how it is actually wired today; and (2) the hot-spot findings from three Opus 4.7 Extra-High reasoning sessions that traced the riskiest areas line by line.

5 blockers ✓ resolved 9 majors ✓ resolved 12 minors ✓ resolved 3 deep traces by claude-opus-4.7-xhigh Fix campaign 30 May 08:43 BST Batch 6: 4 divergences + ISS-22 residual + 5 host entries + filename ✓ all resolved Batch 7: 4 scouts built + K1/K2 substrate ✓ 1M validator round-2 PASS 13:35 BST

Orientation

How to read this

Documented

What the docs on this site (briefs, workflow.html, agents.html, logging.html) tell a reader the system does. The story we'd hand to a successor.

Planned

What Desktop\Delivery Workflow\plan.md & Historic-plan.md committed to building — the locked decisions D1–D8, batches 1–4, NEEDS-CHARLIE list.

Live

What the on-disk skills, SQL bundles, DAB config, and MCP topology actually implement right now — verified by the three hot-spot investigators against source files and live SP definitions.

Why this replaces the old two tabs. Open Issues was a 22-ticket worklist closed by the Batch 1–4 campaign; Plan vs. Reality was the original scope reconciliation. Both are still useful history, but neither is the authoritative current view — they don't reflect the divergences the deep traces below uncovered, especially where the live SPs and the role-brief contracts disagree.

The matrix

Three-axis comparison

One row per major subsystem. Each cell is what that source-of-truth says today. match = the three agree; drift = at least one diverges from the others; defect = the live build contradicts the documented or planned contract.

Subsystem Documented (this site) Planned (plan.md) Live (on disk / in DB) Status
Peer ring (3 long-lived) CoS, SA, Steward spawned and live; cognitive separation enforced by briefs. Build-queue step 5 — Operator action; D-decisions assume the ring is up. Three sessions live (CoS 644a98e7, SA ae40a761, Steward ecba0e5d) per plan.md. match
Role briefs on disk config.html names CHIEFOFSTAFF.md, SOLUTIONARCHITECT.md, STEWARD.md. plan.md L91 wrote COS.md, SA.md, STEWARD.md. All three briefs at consistent spelled-out filenames: CHIEFOFSTAFF.md, SOLUTIONARCHITECT.md, STEWARD.md. Workflow config + docs aligned (Unit I). Historic plan + hotspot reports preserve the original misspelling as evidence. resolved
Rotation ceremony workflow.html § lifecycle: 8 steps, archive-never-delete, rotator-spawns-successor. plan.md D-decisions reference rotation_log/agent_snapshots wiring (D4 / ISS-08/09). New generic rotate-role/SKILL.md (Unit A), workflow-agnostic, with Step 9b authored (detail-row-first ordering). Old rotate-clawpilot-role is now a deprecation shim pointing at it with workflow-name='loom'. resolved
Steward brief-path resolution workflow.html implies seamless handover. plan.md doesn't specify file names. Resolved via the haleon workflow.json role→file map (Unit A); STEWARD.md ceremony shrunk to call rotate-role (Unit B). No more file-not-found fall-through to ZOMBIE. resolved
ADO write-back drainer logging.html: lease-guarded drainer; CAS-guarded transitions via sp_update_ado_writeback; failed→queued retry capped at 5. D3 + ISS-03 (resolved): CAS state machine, MAX_ATTEMPTS=5, gate-by-default at pending_approval. Unit C rewrote sp_update_ado_writeback: CAS predicate added, failed→queued legal, MAX_ATTEMPTS=5 enforced in the SP, idempotent same-state, NULL-approved_by RAISE. New sp_reset_ado_writeback dead-letter rescue. Patch fix-c-ado-drainer.sql (sha256 a019d5e9…), Steward-applied. resolved
Cost circuit-breaker (ISS-10 "breaker live" green) logging.html: cost telemetry & ring-tick wired and verified. plan.md NEEDS-CHARLIE #6 — baseline $50/day ±20% → $60 ceiling; gate for live-enable. Unit D: sp_check_cost_breaker now runs sp_rollup_cost_day(p_day_start, 2) first (fixes the cold-read), plus a new sp_check_cost_breaker_rolling; ring-tick Job B wires both breakers (OR'd trip flags). Verified arming. Patch fix-d-cost-breaker.sql (sha256 a9bd1a0f…), Steward-applied. resolved
agent_runs audit spine logging.html + briefs: multi-writer, two-phase, gap-scan on (session_id, session_turn_seq), corrections as appends. D4 / ISS-06/15: required write per turn; Steward audits. Unit E: partial UNIQUE on (session_id, session_turn_seq), class CHECK, created_by_agent NOT NULL, sp_create_agent_run RAISEs 23502 on NULL keys, new v_open_agent_runs watchdog view + Steward open-run scan. Patch fix-e-agent-runs.sql (sha256 0fde1851…), Steward-applied. resolved
cost_telemetry rollup logging.html: idempotent whole-UTC-day grid, derived from agent_runs. ISS-07: sp_rollup_cost_telemetry aggregates by spawned_at; list-price rates accepted as permanent (NEEDS-CHARLIE #2 closed). Unit E HS-10: sp_rollup_cost_telemetry now excludes class IN ('audit','correction') and any row a correction points at, with a pg_advisory_xact_lock; cost_telemetry rebaselined (8 cells). Late-arriving completions covered by the HS-5 rolling-2-day rollup in ring-tick (with Unit D's late-arriving cost fold-in). resolved
DAB anonymous role config.html: every base table read-only, writes via *In shims (ISS-22, Option A). plan.md ISS-22: Option A applied to dab-config.json 30 May. All 6 base tables now read-only over anonymous (Unit H — Brief, MeetingPrep, PdSyncPrep, QuietWatchState, Signal, SteercoPrep flipped + 6 REVOKEs); writes flow solely through the validating *In shims; authenticated=* remains the dormant production posture. resolved
Softeria M365 MCP config.html: removed; native m365_* in use. plan.md ISS-21 (resolved): removed 30 May. m365-softeria block removed from m-mcp-servers.json; takes effect on next Clawpilot reload. match
ADO target (org / project) logging.html: env-config via azure_devops MCP; pointer = smccormick0886/Haleon-AIAQ. plan.md D8 / ISS-05: de-hardcoded; document as env-config. Skill no longer hardcodes; MCP currently points at shakedown repo. match
Scout agents (5 planned) plan-vs-reality.html: Signal Scout proven; other four briefs documented (this site) per Batch 7. plan.md Batch 7: Signal Scout proven Unit J; Backlog Sentinel / Pipeline Watcher / Reuse Scout / Intake Triage briefs built K3–K6. Signal Scout proven end-to-end (Unit J, 30 May 2026 — see prior row). Four K-scout briefs LIVE on disk: BACKLOG-SENTINEL.md (sha256 3f95b6ad…b8f2), PIPELINE-WATCHER.md (sha256 80d0ca69…b598), REUSE-SCOUT.md (sha256 95504fe2…9ec0), INTAKE-TRIAGE.md (sha256 4d9f939b…0db9bb). All four pending per-scout manual prove + Charlie's posture-flag flip (per-scout proven-flag DDL on scout_enable_flags is a deferred K-extension; today the flags are prose-reserved slots only). 1M validator round-2 PASS 30 May 13:35 BST. partial
sp_create_signal idempotency (K1) logging.html / ring-tick: scouts enqueue via SP, never NULL, never 23505. plan.md Batch 7 Unit K1: ON CONFLICT wrap; signature byte-identical. LIVE — fix-k1-signal-onconflict.sql (sha256 f9f310d8…dc261c3c) Steward-applied 30 May 12:24 BST. SP now wraps ON CONFLICT (source, source_external_id) WHERE source_external_id IS NOT NULL DO NOTHING internally via INSERT-RETURNING-UNION-ALL-existing CTE — returns existing id silently on conflict, never NULL, never bubbles 23505. NULL-extid signals still insert unconditionally (partial-index predicate excludes them). Audit row e88a7ad8-23f2-41b7-8514-a6b6dad75ba1. resolved
agent_runs_class_check canonical enum (K2-verify) logging.html + STEWARD.md § Audit model: 7-member class enum. plan.md Batch 7 Unit K2-verify: 'scout_sweep' member confirmed live; HS-10 rollup-exclusion unchanged. LIVE — members exactly {turn, rotation, snapshot, ado_drain, audit, correction, scout_sweep}. PATCH 2 of the prior wedged Fix-K substrate had already landed cleanly at 11:51 BST; K2-verify (`fix-k2-verify-scout-sweep-class.sql`, sha256 rev2 2280d7d3…1caf0) is read-only and confirms. STEWARD.md L51 canonical paragraph patched. Audit row c19459c6-46f3-445b-bcaf-b16aec9dd635. resolved
Recurring ring-tick schedule logging.html: staged, intentionally off pending three-greens posture call. plan.md NEEDS-CHARLIE #5 — live-enable gated on (a) breaker live (b) drain proven (c) Scout proven. Schedule OFF. All three greens earned in DB-substrate terms: (a) cost breaker arms structurally (Unit D — both per-UTC-day and rolling-24h read $0/tripped=false live, 30 May); (b) drain proven manually (Unit C, Batch 5); (c) Signal Scout proven manually end-to-end (Unit J, 30 May — lease + watermark + breaker + idempotent enqueue + audit). Post K1: scouts call sp_create_signal idempotently with no pre-check / catch-23505 boilerplate. The 3-greens DB gate (sp_check_scout_enabled, Unit C) still reads enabled=false because Charlie has not flipped the three ScoutEnableFlagsIn booleans yet — that flip + enabling the recurring timer remains Charlie's posture call. partial
Database write-shim config.html: 12 *In views, 14 sp_* functions, 15 enums, INSTEAD OF triggers. D5/D7 + ISS-18: shims for Adr + ComplianceState built; jsonb-as-text contract preserved. 12 write-shaped views with INSERT triggers; INSTEAD OF UPDATE coverage on v_comms_draft_in / v_agent_run_in / v_brief_in (Unit H) and v_adr_in CAS-UPDATE (Unit G Div 1); v_signal_in carries explicit-deny UPDATE (signals append-only). resolved
MCP topology after schema change config.html: "one DAB shim + one dab.exe per session" — no global hot-reload. plan.md NEEDS-CHARLIE #1 — restart haleon-aiac-db MCP after Batch 7 substrate. haleon-aiac-db respawned 30 May 10:42 BST. dab-config.json = 52 entities — includes the 5 Batch-5 surfaces (AdoWritebackResetIn / ScoutEnableFlagsIn / ScoutEnabledCheckIn / CostBreakerRollingIn / OpenAgentRun) added by Unit F. Batch 7 added no new entities (K1 preserved sp_create_signal's signature byte-identically). MCP fully aligned. resolved

K-extensions backlog (deferred by design, not bugs). The Batch 7 build deliberately scoped out: per-handler watermark sub-keys for Reuse Scout (multi-source) and Intake Triage (multi-channel); per-scout scout_enable_flags.*_proven columns (today the four K-scouts have prose-reserved slot names only — Charlie may instead elect a single shared scout_proven flip covering all five scouts); Backlog Sentinel / Pipeline Watcher Stalled-N-day sweeps (not-since-watermark scan shapes don't fit the current watermark contract); an optional 'tick' class member on agent_runs_class_check; and the deferred autonomous-tier roles (Library Curator, Pod Intel aggregator, Customer Health aggregator). All documented in plan.md L233; each is a stand-alone K-extension patch when Charlie wants it.

Steward rejected-DDL register. The Steward maintains a tamper-evident register of DDL/SP proposals it has refused (STEWARD.md § Rejected DDL register). Current entries: #1sp_update_agent_run(p_detail_ref) extension (Fix-B Option A), rejected by Charlie 30 May 10:42 BST because it would re-open the spine-first / patch-detail_ref-later race that HS-4 closes (REJECTED-fix-b-rotate-sp.sql held do-not-apply). #2 — the wedged prior Fix-K substrate PATCH 1 (4-defect drift: inserted into non-existent columns), superseded by K1's clean rewrite. Override protocol requires Charlie to acknowledge the trade-off in writing and authorise a one-off scope.

Three deep traces by Opus 4.7 Extra-High

Hot-spot findings

Each of the three areas below was traced end-to-end by an isolated claude-opus-4.7-xhigh session running on the actual files (and, where possible, the live SP source recovered from session-state event logs). The findings are not speculative: every defect cites a file + line / contract section, and every fix is the smallest concrete change that resolves it.

Headline — all closed. The five blockers, nine majors and twelve minors below were cleared by a five-unit fix campaign on 30 May (07:20–08:43 BST). Unit A built a generic, workflow-agnostic ceremony library (bootstrap-roles + rotate-role) and authored the missing Step 9b; Unit B made the rotation ceremony correct; Unit C rewrote the ADO drainer SP + skills; Unit D fixed the structurally-cold cost circuit-breaker; Unit E hardened the agent_runs audit spine and cost rollup. All DDL was applied by the Steward (Units C/D/E carry sha-pinned patch files), and every fix was walked end-to-end on a sentinel before clean-up. Fresh-peer validation is in progress (Validation Director session running). The one DDL deliberately not applied — Unit B's sp_update_agent_run p_detail_ref extension — was rejected by A+E+Steward consensus (it would re-open the audit race) and is held pending Charlie's ratification.

Post-fix divergences (Validation Director surface) — all closed in Batch 6. Session 7a652654 (opus-4.7-xhigh) surfaced 4 divergences; Unit G patched them (ADR CAS state-machine, session-turn-seq auto-allocator, Steward audit-model HS-4 tag, rotate-role Step 1 freeze). Unit H closed the ISS-22 write-shim residual (sp_update_brief + signal deny + 6 REVOKEs). Unit I normalized the brief filename. Unit F added 5 dab-config entries. All resolved.

Batch 7 build-out — closed. The four K-scout briefs (Backlog Sentinel, Pipeline Watcher, Reuse Scout, Intake Triage) were authored, K1 made sp_create_signal idempotent without pre-check boilerplate, K2-verify confirmed the canonical 7-member class enum was live. Round-1 1M validator surfaced 11 findings (6 MAJORs); all 6 closed in-session by ~13:35 BST. Round-2 diff-only validator (session a0ada821): PASS. Documentation reconciliation pass (this site) on 30 May ~14:00 BST.

Full raw reports: ~/.copilot/session-state/79bd2288-…/files/hotspot-rotation.md, hotspot-ado-drainer.md, hotspot-agent-runs.md. Spawned sessions remain in the sidebar for follow-up.

Hot spot 1

Rotation ceremony & brief paths Steward

Traces the full rotation chain — STEWARD.md ceremony + rotate-clawpilot-role/SKILL.md + bootstrap-clawpilot-roles/SKILL.md + the ZOMBIE branch + the audit-trail writes. Investigator session db7dea9d-9e3b-4cbc-89cd-b4cf1e4e1c04.

✓ Resolved blocker D1 Brief-path mismatch — Haleon rotation is un-runnable today

STEWARD.md step 5 · m-role-briefs/ filenames · bootstrap/rotate skills

STEWARD.md step 5 (L124) resolves "CoS" → m-role-briefs\COS.md, "SA" → m-role-briefs\SA.md. Disk files are CHIEFOFSTAFF.md (misspelt) and SOLUTIONARCHITECT.md — the brief bodies still self-identify as COS.md / SA.md at line 1. Bootstrap + rotate skills target the loom-illustrator stack only (paths under repos\loom\illustrator\agents\clawpilot\<TARGET>.md, roles Architect/Operator/Steward) — they don't apply to Haleon at all. Worst-case execution path: file-not-found falls through to the ZOMBIE branch (STEWARD.md L131), which calls m_close_session({delete:true}) on a healthy peer.

Fix. Edit STEWARD.md step 5 to resolve "CoS" → CHIEFOFSTAFF.md, "SA" → SOLUTIONARCHITECT.md; add a "if brief read fails, HALT — do NOT fall to ZOMBIE" guard.

Shipped by Units A + B — the haleon workflow.json resolves CoS/SA to their real brief files and STEWARD.md calls the generic rotate-role skill with a no-fall-to-ZOMBIE guard. (The CHIEFOFSTAFF.md filename misspelling is left as a cosmetic residual.)

✓ Resolved blocker D2 rotate skill is truncated — Step 9b is missing

~/.copilot/m-skills/rotate-clawpilot-role/SKILL.md

The skill's preamble (L60) promises "Step 2b (snapshot) and Step 9b (rotation_log + agent_runs spine)". The file is 253 lines; Step 9 ends at L253 with }) mid-Operator-update message — there is no Step 9b. The two DB writes that turn rotation_log + agent_runs into the system-of-record (ISS-08/15) have no executable spec — only the prose "writes a DB handover record" promise on the Open Issues page. Anyone running the skill literally will skip the writes entirely; rotations leave no DB trace.

Fix. Append Step 9b with two ordered MCP inserts: (1) RotationLogIn carrying snapshot_id from Step 2b → capture rotation_log_id; (2) AgentRunIn with class='rotation', detail_ref=rotation_log_id; plus a class='snapshot' audit row with detail_ref=snapshot_id.

Shipped by Unit A — Step 9b authored in the new rotate-role/SKILL.md (RotationLog/snapshot written first, then the agent_runs spine carrying detail_ref at INSERT).

✓ Resolved major D3 STEWARD.md ceremony violates the "No boot-turn race" rule

STEWARD.md steps 6 → 7 → 8 vs. config.html § Rotation governance L262–263

Step 6 spawns with initialMessage (boot turn begins). Step 7 immediately calls m_release_session({keepPendingTurn:true}) — the handle is dropped while the boot turn is still pending, and no m_await_session_response is ever inserted for the boot ack. Step 8 then sends "## Replacement readiness". The smoke-test message arrives queued behind the boot turn; the Steward awaits the first response, which is the boot ack (a ## Context health block), reads it as a failed readiness response, and step 10 fires — m_close_session on a healthy successor + Teams-pings Charlie "rotation aborted".

Fix. Between step 6 and step 7 insert m_await_session_response(<new>, timeoutMs: 600000), validate a green ## Context health ack, then release with keepPendingTurn:false.

Shipped by Unit B — STEWARD.md ceremony rewritten to await a green ## Context health ack before release; the boot-turn-race invariant is preserved.

✓ Resolved major D4 Orphan-turn race in rotate skill step 4 → 5

rotate-clawpilot-role/SKILL.md steps 4–5 vs. plan.md L129–130 ORPHAN-TURN LESSON

Step 4 does m_release_session({keepPendingTurn:false}); step 5 immediately m_send_to_session with the heavy readiness probe — exactly the anti-pattern plan.md L129–130 warned about ("Re-messaging a just-RELEASED session handle caused an adoption race… awaits returned status:timeout forever"). The lesson was learned by the Operator on the orchestrator but never folded back into the rotate skill or STEWARD.md. Same downstream as D3: false smoke-test timeout → healthy-successor kill.

Fix. Insert a 1-line status-probe send-and-await between step 4 release and step 5 readiness probe (e.g. m_send_to_session({sessionName, message:"ping — confirm adopted"}) + m_await_session_response) to cleanly re-adopt the handle before the heavy probe.

Shipped by Units A + B — the status-probe re-adopt is folded into the new rotate-role skill between the release and the readiness probe.

✓ Resolved major D5 ZOMBIE path violates "metadata-only" + lacks the Charlie-ack guard for delete:true

STEWARD.md L131 vs. STEWARD.md L19/L42/L186 + config.html § Rotation governance L258–259

The ZOMBIE branch says: "synthesize state from last 20 turns via m_get_session_transcript" and on success "m_close_session({delete:true}) (delete the zombie, Charlie accepted the risk)". The brief's own hard rules contradict both halves — reading 20 turns of a peer's transcript IS reading content ("metadata only", "privacy of peer thought process is a feature"), and delete:true is documented as requiring an explicit Charlie ack quoted in the ledger. "Charlie accepted the risk" is asserted, not requested.

Fix. Require an inline m_ask_user Charlie-ack (quoting the zombie's sessionId + lastActivityAt) before the transcript read and the delete:true close; or downgrade to delete:false (archive) and surface post-facto.

Shipped by Unit B — rotate-role now defaults to cold-spawn + archive (no transcript read, no delete); the escalated transcript-read / delete:true path requires an explicit Charlie m_ask_user ack quoting the zombie's sessionId + lastActivityAt.

✓ Resolved major D6 Steward self-rotation has no defined procedure; "rotator spawns successor" invariant is unreconciled

STEWARD.md L110 · rotate-clawpilot-role L16 / L212 · config.html L240–242

The only mention of self-rotation is a one-line do-not ("Never rotate yourself without explicit Charlie ack"). No procedure for who spawns the successor — the rotator-spawns-successor invariant is circular when rotator equals rotatee. The rotate skill (L212) references "Steward (self-rotation case): broadcast confirmation, ask Operator to update the ledger", implying either the outgoing Steward spawned the successor (invariant break) or the Operator did (undocumented). config.html L240–242 confirms two Steward self-rotations have actually happened this engagement — this is a live gap, not hypothetical.

Fix. Add a one-paragraph "Steward self-rotation" branch to STEWARD.md naming Operator as the rotator-of-record (so the invariant holds), with the exact Charlie-ack message format required before the Operator-side spawn.

Shipped by Unit B — STEWARD.md gains a Steward-self-rotation branch naming the Operator as rotator-of-record; haleon workflow.json adds selfRotationRotator:"operator" + a documented Operator role (spawnAtBootstrap:false).

✓ Resolved minor D7 Three-write audit chain is non-atomic; partial failure leaves silently-dangling refs

Rotate Step 2b inserts agent_snapshots → Step 9b inserts rotation_logagent_runs with detail_ref (no FK by design). Three independent MCP calls, no transactional wrapper. If 9b's agent_runs insert fails after rotation_log succeeded, the rotation_log row is invisible to the gap-scan; only the Steward's post-hoc cross-ref audit catches it.

Fix. Write agent_runs first (append-for-identity per the two-phase contract), then rotation_log referencing the snapshot, then sp_update_agent_run filling detail_ref. Any failure leaves a visible gap rather than a silent dangler.

Shipped by Unit A — but with the opposite encoding: Step 9b is detail-row-first (rotation_log/snapshot written first, then the agent_runs spine carrying detail_ref at INSERT; completion-fill never touches detail_ref). This matches the audit-spine report's HS-4 recommendation and was chosen over this card's spine-first suggestion — see the reconciliation note below.

✓ Resolved minor D8 Session-name reuse window during rotation

STEWARD.md step 6 spawns the successor with the same sessionName while the predecessor is still alive (closed only at step 9). Steps 7–8 address by sessionName. Two sessions briefly share a name; if the host resolves by spawn-order, smoke-test or broadcast can land on the wrong session.

Fix. From step 6 onward, address every peer by sessionId, never sessionName, until step 9 closes the predecessor.

Shipped by Unit B — from spawn (step 6) onward all peer addressing is by sessionId, never sessionName.

Reconciliation note — audit-chain ordering (D7 vs HS-4). The two hot-spot reports disagreed on how to make the rotation audit-chain crash-safe. This card (hotspot-rotation.md D7) recommended a spine-first encoding — write agent_runs first, then back-fill detail_ref via sp_update_agent_run. The audit-spine report (hotspot-agent-runs.md HS-4) recommended the detail-row-first encoding — write rotation_log/snapshot first, then the agent_runs spine carrying detail_ref at INSERT. Unit A shipped detail-row-first (HS-4). Consequently the sp_update_agent_run p_detail_ref back-fill extension that the spine-first path would have needed was rejected by A + E + Steward consensus (it would re-open the very write-ordering race) and held DO-NOT-APPLY at fix-b-rotate-sp.sql — pending Charlie's ratification.

Hot spot 2

ADO drainer & lease races CoS

Traces the live sp_update_ado_writeback CAS state machine (recovered from the session-state event log) against the SA-brief contract, plus ado-scribe, ado-bulk-triage, and ring-tick orchestration. Investigator session bed203c0-a08d-4576-afd9-c0e731ec6cd5.

✓ Resolved blocker B1 sp_update_ado_writeback is NOT CAS-guarded (contract violation)

SOLUTIONARCHITECT.md L51 promises CAS · live SP source from session-state events.jsonl

The SA brief promises every transition is guarded: UPDATE … WHERE id=p_id AND status=<expected_current_status> → "no row" on lost race. The live SP does read-then-check-then-write: SELECT status INTO v_current_status … → IF-ladder of legality → UPDATE … WHERE id=p_id with no status predicate. Two drainers racing both pass the ladder, both UPDATE, both write to ADO. The Batch-2 9-step harness was single-threaded; it never exercised concurrency.

Fix. Add AND status = v_current_status to the UPDATE's WHERE. Drop the IF-ladder's RAISEs in favour of returning no-row; trigger swallows no-row as "claim lost". (Alternative: SELECT … FOR UPDATE on the read.)

Shipped by Unit C — CAS predicate added to the UPDATE … WHERE; concurrent drainers can no longer double-write. Patch fix-c-ado-drainer.sql (sha256 a019d5e9…), Steward-applied.

✓ Resolved blocker B2 Drainer arc failed → queued is forbidden by the live SP

ado-scribe/SKILL.md step 2 + SA brief L53 + ISS-03 issue text vs. live SP

The SKILL says "Transient AND attempts < 5 → sp_update failed → queued (re-queue; does not increment attempts)." Live SP: IF v_current_status = 'failed' AND v_target_status NOT IN ('in_flight') THEN RAISE 'Illegal transition: failed can only retry to in_flight …'. Every transient-retry call the drainer is told to make throws.

Fix. Replace NOT IN ('in_flight') with NOT IN ('queued') in that guard. The existing CASE WHEN v_target_status='in_flight' THEN v_new_attempts ELSE attempts END already correctly avoids incrementing attempts on failed → queued.

Shipped by Unit C — failed → queued is now a legal retry transition (attempts not incremented).

✓ Resolved blocker B3 Cost circuit-breaker is structurally cold (never trips during the live day)

ring-tick/SKILL.md Job B step 1 + core-tick step 5 · sp_check_cost_breaker

The breaker reads today's cost_telemetry row. But core-tick step 5 only rolls "the most recent UTC day older than the 2h settling window" — today's row is never produced. So the breaker always reads $0, always green, always lets scout-sweep spawn. The ISS-10 live-enable "breaker live" green is structurally unattainable.

Fix. Either have sp_check_cost_breaker call sp_rollup_cost_day(p_day_start, 2) first (idempotent, leaves today un-finalised), or add Job B step 0 = sp_rollup_cost_day(today) before the breaker check.

Shipped by Unit D — sp_check_cost_breaker now runs sp_rollup_cost_day(p_day_start, 2) first so today's cell exists; the breaker arms structurally (verified green at $0, engagement well under $50/day). Patch fix-d-cost-breaker.sql (sha256 a9bd1a0f…), Steward-applied.

✓ Resolved major M1 MAX_ATTEMPTS=5 cap lives only in the SKILL prose, not the SP

The SP increments attempts on every *→in_flight with no cap. A drainer with a counting bug — or any future scheduler that bypasses the SKILL — would retry indefinitely. Fix: in the SP, IF v_target_status='in_flight' AND v_new_attempts >= 5 THEN RAISE 'max attempts exceeded'.

Shipped by Unit C — MAX_ATTEMPTS=5 is now enforced inside the SP.

✓ Resolved major M2 ado-scribe SKILL is silent on abort-on-lost-renewal

ring-tick step 2 is strict ("if a renewal re-acquire returns NO ROW, ABORT IMMEDIATELY"). ado-scribe step 1 only says "renew mid-long-drain by re-acquiring" — no abort clause, no cadence. With TTL=120s and up to 10 ADO calls per tick, holder A's lease can silently expire, B can steal, and A keeps writing. Combined with B1, this is the realistic double-write path. Fix: port the ring-tick clause verbatim into ado-scribe step 1 and specify renewal interval (TTL/2 ≈ 60s).

Shipped by Unit C — ado-scribe step 1 gains the lease abort-on-renewal clause.

✓ Resolved major M3 Dead-lettered rows are permanently stranded

sp_create_ado_writeback uses ON CONFLICT (idempotency_key) DO NOTHING. Once a row hits failed with attempts=5, next Friday's triage re-emits the same recommendation → same key → silently dropped → the change is lost. No reset path. Fix: add an admin SP sp_reset_ado_writeback(p_id, p_actor) doing failed → pending_approval + attempts:=0; or change triage to detect existing failed rows and surface to Charlie.

Shipped by Unit C — new sp_reset_ado_writeback(p_id, p_actor) dead-letter rescue + v_ado_writeback_reset_in shim.

✓ Resolved major M4 Reconciler false-commit on field-set ops

SKILL step 3 says field-set ops are committed if ADO already holds the intended value. But any external setter (Charlie editing by hand, another bot, an earlier triage) puts the field at the intended value → reconciler marks committed though we never wrote it. The audit trail lies. Fix: require an HTML <!-- idk:KEY --> marker in a System.History entry for every op, not only add-comment ops; or require the reconciler to check work-item revision history for our identity having authored the change.

Shipped by Unit C — ado-scribe step 3 now requires the <!-- idk:KEY --> marker for every op, not only add-comment ops.

✓ Resolved major M5 Reconciler stamps last_error='orphaned' — un-classifiable downstream

Step 2's classifier keys on 4xx/5xx/timeout substrings. "orphaned" matches no class → drainer behaviour is undefined (likely treated as permanent → silent dead-letter without a retry). Fix: add "orphaned" to the transient set in step 2 — a crashed-mid-flight deserves one re-queue.

Shipped by Unit C — ado-scribe step 2 classifies reconciler-orphans as transient (one re-queue) with a reset path; failure-modes now distinguish lease-lost / entity-not-found / CAS-lost / max-attempts.

Minor findings (7) — click to expand
  • m1 — Double-approval raises instead of being an idempotent no-op. Fix: at SP start, IF v_current = v_target THEN RETURN p_id.
  • m2approved_by silently defaults to 'unknown' via COALESCE. Fix: RAISE if NULL on pending_approval → queued.
  • m3drainer_lease (and tick_lease) accept NULL holder. Fix: RAISE if NULL/empty.
  • m4last_error not cleared on failed → in_flight retry. Hygiene only.
  • m5 — Pre-MCP-restart silent-failure: lease acquire fails at MCP boundary → SKILL maps to "another drainer holds it, abort cleanly" → misdiagnosed.
  • m6 — Scout-sweep "3-greens gate" is convention-only prose; no programmatic check. Caged by other caps so contained.
  • m7 — UTC-day boundary on cost breaker: UK evenings split $40+$40 across two UTC days; neither trips the $60 ceiling. Fix (post-B3): rolling 24h check.

✓ Shipped — Unit C: m1 (idempotent same-state RETURN), m2 (RAISE on NULL approved_by), m3 (RAISE on NULL/empty lease holder), m4 (clear last_error on retry), m5 (failure-modes now distinguish lease-lost / entity-not-found / CAS-lost / max-attempts), m6 (new scout_enable_flags + sp_check_scout_enabled() programmatic 3-greens gate). Unit D: m7 (new sp_check_cost_breaker_rolling rolling-window check).

Hot spot 3

agent_runs audit spine Steward

Traces the multi-writer two-phase audit spine and its derived cost_telemetry rollup against the contracts in STEWARD.md and SOLUTIONARCHITECT.md, with reference to haleon-batch3-d4.sql, haleon-sp-wrappers.sql, and the v_agent_run_in triggers. Investigator session b99cc05d-8a37-4c95-9edb-661db25dc451.

✓ Resolved blocker HS-10 Correction rows don't suppress the bad row in the cost rollup

STEWARD.md L67 append-only contract · haleon-batch3-d4.sql L354–392

The contract is: bad row stays for tamper-evidence; a class='correction' row points at its id via detail_ref. But sp_rollup_cost_telemetry aggregates WHERE spawned_at ∈ window GROUP BY agent_name with no class filter. The bad row's bogus tokens_in/out still count; the correction itself adds spawn_count and possibly more tokens. Cost is over-counted by exactly the bad row.

Fix. Add to the rollup WHERE: AND ar.class NOT IN ('audit','correction') AND NOT EXISTS (SELECT 1 FROM haleon.agent_runs c WHERE c.class='correction' AND c.detail_ref = ar.id). Re-run affected days; ON CONFLICT DO UPDATE makes it idempotent.

Shipped by Unit E — the class-aware exclusion was added to sp_rollup_cost_telemetry (with a pg_advisory_xact_lock, HS-6) and cost_telemetry was rebaselined (8 cells). Patch fix-e-agent-runs.sql (sha256 0fde1851…), Steward-applied.

✓ Resolved major HS-1 session_turn_seq is client-supplied with no UNIQUE — silent duplicate-claim defeats gap-scan

sp_create_agent_run takes p_session_turn_seq integer DEFAULT NULL — no allocator, no UNIQUE index, no trigger. Two concurrent skill invocations on the same session both read MAX(seq)+1 = N and both INSERT N. The sequence 1,2,3,N,N,5 looks contiguous to gap-scan; one writer's row is hidden. NULL is also accepted, silently defeating the whole audit story without any error.

Fix. CREATE UNIQUE INDEX agent_runs_session_seq_uniq ON haleon.agent_runs(session_id, session_turn_seq) WHERE session_id IS NOT NULL AND session_turn_seq IS NOT NULL; + RAISE if either is NULL in sp_create_agent_run. Concurrent duplicates now error loudly; loser retries with MAX+1.

Shipped by Unit E — partial UNIQUE on (session_id, session_turn_seq) added; sp_create_agent_run now RAISEs 23502 on NULL session_id / session_turn_seq / created_by_agent (also closes HS-3 with created_by_agent NOT NULL).

✓ Resolved major HS-2 Unclosed runs (phase-2 update never lands) bypass gap-scan entirely

sp_update_agent_run uses COALESCE — a missing completion call leaves completed_at / tokens / success all NULL. The INSERT-at-spawn row IS present, so (session_id, session_turn_seq) is contiguous → gap-scan green. Row is "open forever" with no TTL watchdog. Cost rollup treats NULL tokens as 0 so the cost number doesn't betray the leak either.

Fix. Add Steward-audit view v_open_agent_runs (completed_at IS NULL AND spawned_at < now() - interval '15 minutes') + a partial index. Add open-run scan to the Steward audit duties.

Shipped by Unit E — v_open_agent_runs view + index added and the open-run scan folded into the STEWARD.md audit-model. (OpenAgentRun still needs its dab-config entry + a 2nd MCP restart — see host actions.)

✓ Resolved major HS-5 Late-arriving completion silently dropped from finalised UTC-day cells

The rollup aggregates by spawned_at (correct — that's the day-of-spend). But ring-tick only rolls "today". A turn spawned yesterday at 23:55 and completed today at 00:05 fills its tokens after yesterday's cell was computed — yesterday is under-counted forever unless someone re-runs sp_rollup_cost_telemetry(yesterday, today).

Fix. Ring-tick's rollup call uses a rolling 2-day window: (today-1, today) AND (today, today+1). ON CONFLICT DO UPDATE re-aggregates yesterday cleanly.

Shipped by Unit E — ring-tick core-tick step 5 now does a rolling 2-day rollup (today-1 AND today); covers Unit D's late-arriving cost fold-in too.

✓ Resolved major HS-9 Audit row for a gap has no first-class slot for the gap key

A gap means the row that would have had id=X was never written — there is no UUID to put in detail_ref. The class='audit' finding therefore has detail_ref=NULL, leaving only free-text columns to encode "session=Y seq=4 missing." No documented convention exists; the query "show audits for the gap in session Y turn 4" is undefined.

Fix (convention only, no DDL). Steward writes gap audits with triggered_by = 'gap:'||session_id||':'||session_turn_seq (parseable, indexable via LIKE 'gap:%'). Document in STEWARD.md § Audit-model (a). Stronger alternative: ALTER TABLE agent_runs ADD COLUMN finding_key text.

Shipped by Unit E — the convention is adopted: Steward gap-audits use triggered_by = 'gap:'||session_id||':'||session_turn_seq, documented in the STEWARD.md audit-model.

Minor findings (4) — click to expand
  • HS-4detail_ref write-ordering — half-audited trail on partial failure of the rotation_log → agent_runs chain.
  • HS-3 — No DB-level enforcement of "no proxy-authorship" (acknowledged tradeoff post-ISS-22 single-principal MCP). Fix: add created_by_agent text NOT NULL; Steward audits created_by_agent <> agent_name.
  • HS-6 — Concurrent sp_rollup for the same window can let the staler snapshot win. Fix: pg_advisory_xact_lock at SP start.
  • HS-7class is plain text; typos pollute silently. Fix: add CHECK (class IN (…)).

✓ Shipped — Unit E: HS-4 codified as detail-row-first ordering in Step 9b (rotation_log/snapshot first, agent_runs spine carrying detail_ref at INSERT); HS-3 created_by_agent NOT NULL DEFAULT 'unknown' + Steward proxy-authorship scan; HS-6 pg_advisory_xact_lock in the rollup; HS-7 agent_runs_class_check CHECK constraint.

What shipped, in priority order

Shipped

The campaign worked the original triage order top-down. Every item below is done — struck through with its evidence (Steward ack, patch sha, or skill/brief edit). Items that were live DDL were applied by the Steward (it owns DDL creds); the rest were brief/skill edits.

  1. D1 — fix STEWARD.md step-5 brief paths + add the no-fallthrough-to-ZOMBIE guard. done Units A + B — workflow.json resolves CoS/SA to real files; STEWARD.md calls rotate-role. Misspelt filename left cosmetic.
  2. D2 — append Step 9b so rotations write rotation_log + agent_runs. done Unit A — Step 9b authored in the new rotate-role/SKILL.md (detail-row-first).
  3. B1 + B2 — add the CAS predicate, relax the failed → queued guard on sp_update_ado_writeback. done Unit C — fix-c-ado-drainer.sql (sha256 a019d5e9…), Steward-applied.
  4. B3 — make sp_check_cost_breaker roll today first. done Unit D — fix-d-cost-breaker.sql (sha256 a9bd1a0f…), Steward-applied; breaker arms (verified $0).
  5. HS-10 — add the class-aware exclusion to sp_rollup_cost_telemetry + re-run affected days. done Unit E — fix-e-agent-runs.sql (sha256 0fde1851…), Steward-applied; 8 cells rebaselined.
  6. HS-1 — add the partial UNIQUE index on (session_id, session_turn_seq). done Unit E — index added; sp_create_agent_run RAISEs 23502 on NULL keys.
  7. D3 + D4 — insert await + status-probe in the rotation flow. done Units A + B — boot-turn await in STEWARD.md ceremony; status-probe re-adopt in rotate-role.
  8. D5 + D6 — Charlie-ack on the ZOMBIE path; document Steward self-rotation with the Operator as rotator-of-record. done Unit B — cold-spawn+archive default, m_ask_user ack gate; Operator-as-rotator branch + selfRotationRotator:"operator".
  9. HS-2 + HS-5 — open-run watchdog view + rolling-2-day rollup. done Unit E — v_open_agent_runs + Steward open-run scan; ring-tick rolling 2-day rollup.
  10. The remaining majors and all minors — quality-of-life, fold into the next batch. done All folded into Units A–E (see the minor-findings expanders above).

Remaining host action. The haleon-aiac-db MCP restart pending from Batch 5/6 has been completed (10:42 BST); Batch 7 added no new entities so no further restart is needed. NEEDS-CHARLIE #5 (live-enable the recurring ring-tick) is purely Charlie's three-greens posture call now — flip the three ScoutEnableFlagsIn booleans (one at a time, for per-green audit attribution), then a separate posture call to enable the recurring timer.

Back to the agents →   View historic Open Issues (archive) →   View historic Plan vs. Reality (archive) →

Not done by an agent — needs the host

Outstanding host actions

The fix campaign is complete in code and DDL, but a handful of items need Charlie / the host to land. None blocks the fixes already applied to the database; they gate when the new MCP surfaces become reachable.

✓ Batch 6 file edits committed: dab-config.json now 52 entities (Unit F) with the 6 base-table flips to read-only (Unit H). Restart landed: haleon-aiac-db respawned 30 May 10:42 BST — the 5 new entities are live; tightened permissions in force.

✓ Batch 7 ratified: K1 + K2-verify Steward-applied (audit rows e88a7ad8… / c19459c6…); K3–K6 briefs + ring-tick / CHIEFOFSTAFF.md / STEWARD.md doc edits Operator-applied with sha256 verification; 1M validator round-2 PASS.

Ratification register. REJECTED-fix-b-rotate-sp.sql rejection ruling at 10:42 BST documented in Batch 5 — the sp_update_agent_run p_detail_ref extension is retired by A + E + Steward consensus (it would re-open the write-ordering race the detail-row-first Step 9b closes). No further action.