cbusillo · shiny-code-bot · Jun 5, 2026 · Jun 5, 2026
diff --git a/code-rs/core/docs/auto-review.md b/code-rs/core/docs/auto-review.md
@@ -1,50 +1,202 @@
-# Auto-review / Auto-resolve Workflow (Exec, TUI, Auto-drive)
-
-This note summarizes how automated reviews flow across the CLI layers and what invariants keep snapshots fresh and non-overlapping.
-
-## Overview
-- **Exec (/review + auto-resolve):** Captures a ghost snapshot, acquires the shared review lock, runs review, and may loop fixes → follow-up review until clean or limits hit.
-- **TUI (interactive + background):** Triggers reviews, renders progress, and now shares the same lock + epoch rules so background reviews cannot overlap or reuse stale snapshots.
-- **Auto-drive:** Orchestrates turns; when it triggers reviews/auto-resolve it also takes the global lock and relies on the same snapshot epoch to reject stale follow-ups.
-
-## Core states
-- **Review lock (file-based):** Only one review pipeline (exec/TUI/auto-drive) may run at a time. If lock acquisition fails, the review is skipped/deferred.
-- **Snapshot epoch (monotonic counter):** Bumped on every git mutation we control (ghost snapshots, git helpers, worktree/remote ops). Follow-up reviews compare epochs and abort if the world moved.
-- **Auto-resolve phases:**
-  - `PendingFix` → apply patch
-  - `AwaitingFix` → wait for model output
-  - `WaitingForReview` → capture fresh snapshot, re-review
-  - Abort when identical snapshot, stale base, or epoch advanced
-
-## Ghost snapshots
-- Ghost commits are taken via `CreateGhostCommitOptions` **with** `post_commit_hook(&|| bump_snapshot_epoch())` so every snapshot invalidates prior review targets.
-- Stored metadata (commit hash + epoch) is compared before follow-up reviews; mismatch stops the loop.
-
-## Lock + epoch rules
-- Acquire the shared lock before sending a review or follow-up. Release on completion/stop/reset.
-- If lock busy: skip/defer rather than overlap.
-- If `current_snapshot_epoch` differs from the recorded epoch for the review context, abort and recapture.
-
-## Git mutations: use the helpers
-- Use `chatwidget::run_git_command` (TUI) or the auto-drive git helper; both bump epoch on mutating verbs (`pull`, `checkout`, `merge`, `apply`).
-- Worktree/remote management goes through `core::git_worktree`, which bumps epoch on remove/add/remote updates.
-- Ghost snapshots already bump epoch via the post-commit hook.
-- New git touchpoints must either call these helpers or call `bump_snapshot_epoch()` immediately after a successful mutation. The `git_mutation_guard` test enforces this for TUI/auto-drive.
-
-## Cleanup
-- Session worktree cleanup acquires the shared lock with retries; successful removal bumps the epoch so stale reviews are invalidated. If lock stays busy, cleanup defers to a later run.
-
-## Failure handling
-- Follow-up review skipped when:
-  - lock unavailable
-  - snapshot identical to last reviewed commit
-  - base snapshot no longer ancestor of HEAD
-  - epoch advanced (stale snapshot)
-- Apply/patch failure + resume must recapture; epoch drift detects staleness.
-
-## What to remember when extending
-- Add new git mutations via the shared helpers or bump the epoch manually.
-- If you add a new review entrypoint, acquire the shared lock.
-- When capturing snapshots, record the epoch and base commit; compare before re-reviewing.
-- If you must bypass the guardrails, document the reason and keep the code path read-only.
+# Auto Review Lifecycle
 
+This note describes the durable Auto Review flow shared by exec, TUI, and
+auto-drive. The important rule is that Auto Review is a harness-owned quality
+pipeline: the assistant may read compact review state and ask for bounded
+details, but run identity, freshness, duplicate avoidance, cancellation, and
+restart recovery belong to Every Code runtime code.
+
+## Surfaces
+
+- **Exec (`/review` and auto-resolve):** captures a ghost snapshot, acquires the
+  shared review lock, runs review, and may loop fixes into follow-up reviews
+  until clean or limits hit.
+- **TUI background review:** triggers and renders background review progress,
+  while writing the same durable run records as exec.
+- **Auto-drive:** orchestrates turns and uses the same lock, snapshot epoch, and
+  durable run store when it launches or follows up on reviews.
+- **Assistant context:** receives only a compact Auto Review ledger when there is
+  active, recent, actionable, or diagnostic review state. Full review output is
+  lazy detail, not normal turn context.
+
+## Durable Store
+
+Auto Review state is repo scoped under the shared review state directory:
+
+```text
+CODE_HOME/state/review/repo-<repo-key>/auto-review/
+  runs.json
+  outputs/<run_id>.json
+```
+
+`runs.json` stores bounded `AutoReviewRun` records. Output sidecars preserve full
+review results for lazy detail lookup. The store keeps enough proof to explain a
+run without injecting bulky review bodies into every prompt:
+
+- identity and ownership: `run_id`, `source`, `owner_session_id`, `agent_id`,
+  `batch_id`, `worktree_path`
+- target: `base_commit`, `snapshot_commit`, `snapshot_epoch`, `head_at_launch`,
+  `scope_hash`, `diff_fingerprint`, `prompt_policy_version`, changed paths
+- lifecycle: `status`, `freshness`, `created_at`, `started_at`, `updated_at`,
+  `completed_at`, `last_activity_at`
+- execution shape: `model`, `reasoning_effort`, prompt token estimate, actual
+  token count when available, saved token estimate when a duplicate is skipped
+- result summary: finding digests, summary digest, supersession, cancel reason,
+  error class, error summary, and output sidecar path
+
+## Status And Freshness
+
+`status` is the run lifecycle. `freshness` is the usefulness or liveness of the
+review target. They are related but not interchangeable.
+
+Run statuses:
+
+- `Pending`: recorded but not yet doing review work.
+- `Snapshotting`: capturing or preparing the reviewed snapshot.
+- `Reviewing`: the review agent is running against a snapshot.
+- `Resolving`: auto-resolve is applying or preparing follow-up work.
+- `Completed`: terminal review output was produced.
+- `Failed`: the run ended with an execution error.
+- `Cancelled`: the harness or user stopped the run for an explicit reason.
+- `Superseded`: a newer run replaces this run's useful scope.
+- `Skipped`: the harness intentionally did not launch work, commonly because an
+  equivalent active or recent review already exists.
+- `Lost`: durable state survived, but the owning process or agent could not be
+  reconciled after restart.
+
+Freshness values:
+
+- `Current`: the run still matches the active review target.
+- `LongRunning`: the run has taken a while but still has evidence of being live
+  and relevant.
+- `Inactive`: the run has stopped showing activity without enough evidence to
+  call it terminal.
+- `Superseded`: a newer matching scope has replaced the run.
+- `Obsolete`: the reviewed snapshot is no longer useful for the active head or
+  epoch.
+- `Lost`: restart reconciliation could not find the owning agent/process.
+- `Unknown`: the store cannot yet prove a more specific freshness value.
+
+Elapsed runtime alone must not make a run stale. A 30 minute review can be
+healthy if its snapshot is still relevant and the owner is active. Staleness
+comes from snapshot/head/epoch mismatch, supersession, obsolete scope, lost
+ownership, or inactivity evidence.
+
+## Lock, Epoch, And Snapshot Rules
+
+- Acquire the shared review lock before sending a review or follow-up. If the
+  lock is busy, adopt, skip, or defer instead of launching overlapping work.
+- Ghost commits are taken with a snapshot epoch bump, so every controlled
+  snapshot invalidates older review targets.
+- Follow-up reviews compare the recorded snapshot epoch and base commit before
+  continuing. If the epoch advanced, the base is no longer an ancestor, or the
+  snapshot is identical to the last reviewed commit, the loop stops and
+  recaptures only through harness policy.
+- Git mutations must use shared helpers that bump the snapshot epoch. New
+  mutation touchpoints must either call those helpers or bump the epoch
+  immediately after success.
+- Session worktree cleanup also coordinates through the shared lock. Successful
+  cleanup bumps the epoch so stale review targets are invalidated.
+
+## Duplicate, Supersede, And Cancellation Policy
+
+The harness owns duplicate and cancellation decisions. The assistant may reason
+from the ledger, but it should not directly cancel or relaunch reviews as a way
+to enforce policy.
+
+- Prefer adopting or reconnecting to an equivalent active run before launching a
+  new one.
+- Skip duplicate work when the diff fingerprint, scope hash, prompt policy, and
+  target are equivalent enough that another run already covers the useful scope.
+  Duplicate skips use the `duplicate_auto_review_scope` reason and may record a
+  saved token estimate.
+- Mark older useful scopes as `Superseded` when a newer run replaces them.
+- Use `Obsolete` or stale target applicability for terminal findings that no
+  longer match the active checkout; preserve the evidence, but do not surface it
+  as current work.
+- Cancel only for explicit user stop, superseded or obsolete scope, dead/lost
+  process evidence, hard budget exhaustion, or proven duplicate policy. Do not
+  cancel solely because elapsed runtime is high.
+- Preserve terminal findings as evidence. They should be current, superseded,
+  obsolete, dismissed, or archived; they should not vanish silently.
+
+## Compact Ledger
+
+The compact ledger is the assistant-facing summary. It is emitted only when
+there is useful active, recent, actionable, or diagnostic state, and it is capped
+by byte budget. Idle clean runs produce no ledger.
+
+Ledger entries include active or applicable runs by default, using stable run ids
+and compact fields such as status, freshness, target applicability, source,
+branch, snapshot, age, last activity, model, reasoning effort, prompt estimate,
+actual tokens when available, saved token estimate, elapsed bucket, finding
+count, summary, and short finding digests.
+
+The ledger intentionally separates:
+
+- `freshness`: whether the run is current, long-running, inactive, superseded,
+  obsolete, lost, or unknown.
+- `target`: whether the run matches the active head, is an older snapshot, comes
+  from a detached review worktree, or is unknown.
+
+Known stale or detached terminal findings are normally suppressed from actionable
+run listings. The ledger may still include diagnostics such as `suppressed_stale`
+so the assistant knows evidence exists without treating it as work to apply.
+
+Example:
+
+```xml
+<auto_review_ledger schema_version="1" max_bytes="2400">
+Auto Review state for this repo. Listed runs are active or target-applicable by default; diagnostics may include stale/detached history that is not an instruction to re-review or fix. Treat run ids as stable references for future detail lookup with the auto_review_detail tool. Freshness describes run recency; target describes whether findings match the active checkout.
+diagnostics recent_runs=2 in_flight=1 terminal=1 tokens=25915t token_runs=1 prompt_estimate=54000t prompt_runs=2 high_burn=2 longest_elapsed=lt1m
+run id=... status=Reviewing freshness=Current target=matching_head source=Tui branch=feature snapshot=abcdef1 age=lt1m last_activity=lt1m model=gpt-5.4-mini reasoning=medium prompt_estimate=42000t elapsed=lt1m
+</auto_review_ledger>
+```
+
+## Lazy Detail Lookup
+
+Full review bodies live in output sidecars and are retrieved through bounded
+detail lookup by `run_id` and optional `finding_id`. Detail lookup is read-only.
+It returns either run detail or one finding's detail, includes metadata about
+truncation and omitted findings, and enforces a hard byte cap.
+
+Use the compact ledger first. Fetch details only when a listed finding or run is
+relevant to the current task. Do not paste raw sidecar JSON or full review output
+into ordinary context.
+
+## Proof Metrics And Dogfood Diagnostics
+
+Diagnostics are designed to prove whether Auto Review is saving work and
+surfacing useful findings. The compact ledger can report recent counters without
+listing bulky run details:
+
+- lifecycle totals: `recent_runs`, `in_flight`, `terminal`, `failed`,
+  `cancelled`, `lost`
+- duplicate and supersession proof: `skipped`, `duplicate_skipped`,
+  `superseded`, `saved_estimate`, `saved_runs`
+- token and prompt cost: `tokens`, `token_runs`, `prompt_estimate`,
+  `prompt_runs`, `high_burn`
+- usefulness and freshness proof: `suppressed_stale`, plus per-run finding
+  counts, elapsed buckets, summary digests, and stable finding ids
+
+Dogfood sessions should use these signals to answer concrete questions:
+
+- Did duplicate review launches decrease?
+- How many tokens were actually spent or plausibly avoided?
+- Did terminal findings surface while still current?
+- Are stale, detached, superseded, cancelled, failed, and lost runs classified
+  without polluting normal assistant context?
+- Was latency caused by the first review pass, follow-up loops, lock/worktree
+  contention, retries, prompt size, model choice, or reasoning effort?
+
+## Extension Checklist
+
+- Add new review entrypoints through the durable run store.
+- Record lifecycle transitions with enough target, timing, model, token, and
+  terminal-reason data to explain the run later.
+- Use lock and epoch helpers for every review or git mutation path.
+- Keep runtime separate from staleness; long-running can still be healthy.
+- Preserve terminal evidence, but surface only bounded current/actionable state
+  by default.
+- Keep assistant-visible context compact. Add diagnostics or lazy detail instead
+  of injecting raw review bodies.