cbusillo · cbusillo · Jun 6, 2026 · Jun 6, 2026
diff --git a/.github/github.json b/.github/github.json
@@ -51,7 +51,8 @@
       "verifyGitHubRelease": "gh release view v<version> --repo cbusillo/code",
       "upstreamCursors": "just local-upstream-cursors",
       "upstreamImport": "just local-upstream-import",
-      "localReleaseNotes": "just local-release-notes"
+      "localReleaseNotes": "just local-release-notes",
+      "execHarnessSmoke": "just harness-smoke"
     },
     "docsRequiredWhen": [
       "CLI behavior changes",

diff --git a/docs/codex-fork-parity-ledger.md b/docs/codex-fork-parity-ledger.md
@@ -73,6 +73,7 @@ preserving the old module boundary.
 | Agents, multi-agent workflows, and selectors | `code-rs/core/src/agent_defaults.rs`, `code-rs/core/src/agent_tool.rs`, `code-rs/core/tests/agent_completion_wake.rs`, `code-rs/core/tests/antigravity_agent_spec.rs`, `code-rs/tui/src/chatwidget/agent*.rs`, `code-rs/tui/src/history_cell/agent.rs`, `vt100_chatwidget_snapshot__agent_*.snap`, `settings_overlay_agents.rs` | Docs still describe built-in multi-agent selectors. Current `main` may contain newer agent infrastructure, but old TUI/status/test parity needs confirmation. | Port | Create focused fixtures for selector resolution, agent run grouping, agent status errors, and settings UI before changing agent orchestration. |
 | Config, identity, skills, and prompts | Deleted old `code-rs/core/src/config*`, `config_loader/*`, `custom_prompts.rs`, `slash_commands.rs`, `skills/*`, `external_agent_config.rs`, and tests such as `custom_prompts_discovery.rs`, `skill_command_policy.rs`, `prompt_context_dedup.rs`, `external_agent_config` snapshots | PR #391 and #394 restored important config compatibility, including `CODE_HOME` precedence and legacy `[tui] alternate_screen`. Current Codex-base has its own config, prompt, and skills systems. | Rewrite | Keep compatibility tests for `CODE_HOME`, legacy config shapes, prompts, skills, and external agent import. Mark individual probes covered only after linking current tests. |
 | Exec, sandbox, patch, and validation harness | Deleted old `patch_harness.rs`, `workflow_validation.rs`, `dry_run_guard.rs`, `command_safety/*`, `git_worktree.rs`, `exec_command/*`, `tool_hooks.rs`, `git_mutation_guard.rs`, `stuck_exec.rs`, `exec_completion_test.rs`, `wayland_clipboard_feature_regression.rs`, `windows_altgr.rs` | Codex-base has richer exec/sandbox/apply-patch crates. Every Code-specific validator discovery and git safety behavior must be compared rather than restored wholesale. | Rewrite | Add fixtures for changed-file validation, git mutation safety, shell completion, and platform shortcut regressions only where Codex-base lacks equivalent tests. |
+| Dogfood exec harness gate | `tools/code-exec-harness/` was preserved across the pivot, but pre-pivot dogfood exec coverage has not yet been recast as the first fork-baseline gate | #404 defines Dogfood Parity 1. #405 owns the deterministic no-live-token `code exec --json` smoke gate using the dev-fast `code` binary from this checkout. | Rewrite | Make `just harness-smoke` pass after `./build-fast.sh`, then extend the suite with old high-signal exec, resume, sandbox, and validation probes before claiming daily-driver parity. |
 | Session catalog, resume, rollout, memory, and history | Deleted `active_sessions.rs`, `session_catalog.rs`, `rollout/*`, `conversation_history.rs`, `message_history.rs`, `context_ledger.rs`, `context_timeline/*`, `memories/*`, `retention.rs`, tests including `active_session_warnings.rs`, `session_catalog_resume.rs`, `image_history_replay.rs`, `resume_catalog_integration.rs`, `resume_replay.rs`, `history_cutoff_probe.rs` | Codex-base has thread store/history/resume/rollback primitives and recent resume CLI tests. External-session continuity remains unproven. | Rewrite | Prioritize external-session and image/history replay fixtures. Treat local resume picker parity as covered only after matching current tests are identified. |
 | UI polish and layout regression snapshots | `bottom_spacer_clip_regression.rs`, `ui_smoke.rs`, `mid_turn_assistant_styling.rs`, `mid_turn_queueing.rs`, `mcp_session_cleanup.rs`, `non_windows_shortcuts.rs`, bottom spacer snapshots, settings/help snapshots, multiline/tool activity snapshots | Current Codex TUI has its own snapshot suite. Old snapshots may assert implementation-specific styling, but many captured useful regressions around wrapping, spacing, queueing, and keyboard behavior. | Defer | Recreate high-value regressions against current TUI only when they guard active overlay behavior or known user-facing polish regressions. |
 | Build, release, dogfood, and cleanup workflows | Pre-#390 workflow/docs/scripts should be compared separately from code/test parity, including release guardrails and local cleanup behavior | #396 owns build-cache strategy follow-up. Release/dogfood docs now need a provisional-baseline warning until parity status is explicit. | Defer | Keep #396 for cache behavior. Add release guardrails before release-bound work; do not treat this first ledger as product parity. |
@@ -102,9 +103,11 @@ Created or existing focused issues before implementation work in these areas:
 3. #400: Auto Review proof metrics and background review evidence.
 4. #388: External-session continuity fixture for GitHub/LaunchPlane/remote-origin sessions.
 5. #401: Token/rate-limit/prompt-cache diagnostics parity.
-6. Needed: Agent selector and multi-agent status parity.
-7. Needed: Exec/patch/validation harness parity.
-8. Needed: High-signal UI/platform probes not owned by #398/#399/#401.
+6. #404: Dogfood Parity 1 daily-driver baseline.
+7. #405: Make `tools/code-exec-harness` the deterministic P0 exec gate.
+8. Needed: Agent selector and multi-agent status parity.
+9. Needed: Exec/patch/validation harness parity beyond the first smoke gate.
+10. Needed: High-signal UI/platform probes not owned by #398/#399/#401.
 
 ## Update Protocol
 

diff --git a/justfile b/justfile
@@ -91,6 +91,10 @@ local-product-health:
 local-fork-health:
     ./scripts/local/fork-health.sh
 
+[no-cd]
+harness-smoke:
+    ./tools/code-exec-harness/run-deterministic.sh
+
 [no-cd]
 local-cleanup-space *args:
     ./scripts/local/cleanup-space.sh "$@"

diff --git a/tools/code-exec-harness/AGENTS.md b/tools/code-exec-harness/AGENTS.md
@@ -4,6 +4,13 @@ Use this harness as the default proving ground for realistic `code exec`
 behavior when working on token efficiency, prompt/context composition, skills,
 memory, compaction, resume, model routing, tool choice, and GitHub automation.
 
+For Dogfood Parity 1, the deterministic no-token smoke suite is the P0 gate for
+the built `code` binary. Run `just harness-smoke` after `./build-fast.sh`; it
+uses `code-rs/target/dev-fast/code` by default and fails clearly if the binary is
+missing or the covered `code exec --json` contract regresses. Use
+`CODE_EXEC_HARNESS_BIN=/path/to/code just harness-smoke` only when intentionally
+checking another binary.
+
 The goal is effectiveness-adjusted token efficiency: reduce wasted tokens only
 when task success, reliability, instruction following, and useful tool behavior
 are preserved or improved. Saving tokens by making the agent less capable is a

diff --git a/tools/code-exec-harness/README.md b/tools/code-exec-harness/README.md
@@ -28,6 +28,36 @@ points at the copied GitHub CLI config inside that redirected home.
 
 ## Run
 
+Run the deterministic no-token smoke suite after `./build-fast.sh`:
+
+```sh
+just harness-smoke
+```
+
+`just harness-smoke` targets `code-rs/target/dev-fast/code` by default so the
+suite validates the binary built by this checkout, not whichever `code` happens
+to be on `PATH`. To test another binary, set `CODE_EXEC_HARNESS_BIN`:
+
+```sh
+CODE_EXEC_HARNESS_BIN=/path/to/code just harness-smoke
+```
+
+The deterministic suite uses fake `/v1/responses` fixtures and does not spend
+live model tokens. It is the Dogfood Parity 1 P0 gate for `code exec --json`:
+if the built binary is missing, cannot start, stops emitting expected JSONL, or
+regresses the covered request-shape contracts, this suite should fail.
+
+Run a single scenario directly when you need narrower evidence:
+
+```sh
+python3 tools/code-exec-harness/harness.py \
+  tools/code-exec-harness/scenarios/exec-basic-smoke.json \
+  --code-bin code-rs/target/dev-fast/code
+```
+
+Run the live GitHub planning smoke only when model behavior is the thing being
+tested:
+
 ```sh
 python3 tools/code-exec-harness/harness.py \
   tools/code-exec-harness/scenarios/github-plan-smoke.json \

diff --git a/tools/code-exec-harness/harness.py b/tools/code-exec-harness/harness.py
@@ -7,6 +7,7 @@
 import os
 import re
 import shutil
+import shlex
 import socketserver
 import subprocess
 import sys
@@ -312,6 +313,12 @@ def materialize_workspace(scenario: dict[str, Any], paths: RunPaths) -> None:
 def materialize_skills(scenario: dict[str, Any], paths: RunPaths, scenario_dir: Path, extra_roots: list[Path]) -> None:
     skills_dir = paths.code_home / "skills"
     skills_dir.mkdir(parents=True, exist_ok=True)
+    workspace_skills = paths.workspace / ".code" / "skills"
+    if workspace_skills.is_dir():
+        for child in sorted(workspace_skills.iterdir()):
+            if child.is_dir() and (child / "SKILL.md").is_file():
+                copy_or_link(child, skills_dir / child.name, symlink=True)
+
     roots: list[Path] = []
     for value in scenario.get("skill_roots", []):
         roots.append(resolve_path(str(value), scenario_dir))
@@ -556,10 +563,26 @@ def write_fake_gh(scenario: dict[str, Any], paths: RunPaths) -> dict[str, Path]
     shim = paths.bin_dir / "gh"
     put_text(shim, FAKE_GH)
     shim.chmod(0o755)
-    put_text(paths.shell_home / ".zshenv", f"gh() {{ {shim} \"$@\"; }}\n")
+    quoted_shim = shlex.quote(str(shim))
+    put_text(paths.shell_home / ".zshenv", f"gh() {{ {quoted_shim} \"$@\"; }}\n")
     return {"fixture": fixture_path, "log": log_path, "state": state_path}
 
 
+def code_exec_supports_option(code_bin: Path, option: str) -> bool:
+    try:
+        result = subprocess.run(
+            [str(code_bin), "exec", "--help"],
+            check=False,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+            text=True,
+            timeout=10,
+        )
+    except (OSError, subprocess.TimeoutExpired):
+        return False
+    return option in f"{result.stdout}\n{result.stderr}"
+
+
 def build_command(scenario: dict[str, Any], args: argparse.Namespace, paths: RunPaths) -> list[str]:
     return build_command_for_prompt(scenario, args, paths, str(scenario.get("prompt", "")), None)
 
@@ -570,14 +593,15 @@ def build_command_for_prompt(
     paths: RunPaths,
     prompt: str,
     resume_session_id: str | None,
+    fake_responses_base_url: str | None = None,
 ) -> list[str]:
     code_bin_value = args.code_bin or shutil.which("code")
     if not code_bin_value:
         raise HarnessError("could not find `code`; pass --code-bin")
     code_bin = Path(code_bin_value)
     command = [str(code_bin), "exec", "--json", "--skip-git-repo-check"]
     max_seconds = scenario.get("max_seconds", args.max_seconds)
-    if max_seconds:
+    if max_seconds and code_exec_supports_option(code_bin, "--max-seconds"):
         command.extend(["--max-seconds", str(max_seconds)])
     command.extend(["-C", str(paths.workspace)])
     if scenario.get("include_plan_tool", False):
@@ -592,6 +616,8 @@ def build_command_for_prompt(
     sandbox = scenario.get("sandbox") or args.sandbox
     if sandbox:
         command.extend(["--sandbox", str(sandbox)])
+    if fake_responses_base_url:
+        command.extend(["-c", f"openai_base_url={json.dumps(fake_responses_base_url)}"])
     for override in scenario.get("config_overrides", []):
         command.extend(["-c", str(override)])
     if resume_session_id:
@@ -805,6 +831,31 @@ def assert_expectations(summary: dict[str, Any], scenario: dict[str, Any]) -> li
         expected = int(expect["responses_request_count"])
         if actual != expected:
             failures.append(f"responses request count expected {expected}, got {actual}")
+    event_counts = expect.get("event_count")
+    if isinstance(event_counts, dict):
+        actual_counts: dict[str, int] = {}
+        for event in summary.get("events", []):
+            if not isinstance(event, dict):
+                continue
+            event_type = event.get("type")
+            if isinstance(event_type, str):
+                actual_counts[event_type] = actual_counts.get(event_type, 0) + 1
+            msg = event.get("msg")
+            if isinstance(msg, dict):
+                msg_type = msg.get("type")
+                if isinstance(msg_type, str):
+                    key = f"msg.{msg_type}"
+                    actual_counts[key] = actual_counts.get(key, 0) + 1
+            item = event.get("item")
+            if isinstance(item, dict):
+                item_type = item.get("type")
+                if isinstance(item_type, str):
+                    key = f"item.{item_type}"
+                    actual_counts[key] = actual_counts.get(key, 0) + 1
+        for event_type, expected in event_counts.items():
+            actual = actual_counts.get(str(event_type), 0)
+            if actual != int(expected):
+                failures.append(f"event count for {event_type!r} expected {expected}, got {actual}")
     for assertion in expect.get("responses", []):
         if not isinstance(assertion, dict):
             failures.append("responses expectation entries must be objects")
@@ -911,7 +962,8 @@ def run_with_env(
             last_returncode = 0
             for index, turn in enumerate(turn_prompts, start=1):
                 prompt = str(turn.get("prompt", "") if isinstance(turn, dict) else turn)
-                command = build_command_for_prompt(scenario, args, paths, prompt, session_id)
+                fake_base_url = fake_server.base_url if fake_server is not None else None
+                command = build_command_for_prompt(scenario, args, paths, prompt, session_id, fake_base_url)
                 commands.append(command)
                 returncode, events = run_exec_capture(command, scenario, paths, run_env, f"turn-{index}")
                 all_events.extend(events)
@@ -924,7 +976,17 @@ def run_with_env(
                     session_id = session_id_from_summary_or_catalog(turn_summary, paths)
             return last_returncode, all_events, commands
 
-        command = build_command(scenario, args, paths)
+        if fake_server is not None:
+            command = build_command_for_prompt(
+                scenario,
+                args,
+                paths,
+                str(scenario.get("prompt", "")),
+                None,
+                fake_server.base_url,
+            )
+        else:
+            command = build_command(scenario, args, paths)
         returncode, events = run_exec_capture(command, scenario, paths, run_env, "turn-1")
         return returncode, events, [command]
 
@@ -956,6 +1018,7 @@ def run_with_env(
     summary = summarize(events, paths, returncode, summary_command)
     summary["commands"] = summary.get("commands", [])
     summary["scenario_commands"] = [" ".join(command) for command in commands]
+    summary["events"] = events
     summary["responses_requests"] = responses_requests
     failures = assert_expectations(summary, scenario)
     summary["expectation_failures"] = failures

diff --git a/tools/code-exec-harness/run-deterministic.sh b/tools/code-exec-harness/run-deterministic.sh
@@ -0,0 +1,28 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+ROOT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
+CODE_BIN=${CODE_EXEC_HARNESS_BIN:-"$ROOT_DIR/code-rs/target/dev-fast/code"}
+
+if [ ! -x "$CODE_BIN" ]; then
+  cat >&2 <<EOF
+error: code exec harness binary is missing or not executable:
+  $CODE_BIN
+
+Run ./build-fast.sh first, or set CODE_EXEC_HARNESS_BIN=/path/to/code.
+EOF
+  exit 2
+fi
+
+scenarios=(
+  "$ROOT_DIR/tools/code-exec-harness/scenarios/context-ledger-request-summary.json"
+  "$ROOT_DIR/tools/code-exec-harness/scenarios/exec-basic-smoke.json"
+  "$ROOT_DIR/tools/code-exec-harness/scenarios/image-history-replay.json"
+  "$ROOT_DIR/tools/code-exec-harness/scenarios/manual-skill-explicit-invocation.json"
+  "$ROOT_DIR/tools/code-exec-harness/scenarios/manual-skill-not-implicit.json"
+  "$ROOT_DIR/tools/code-exec-harness/scenarios/project-doc-skill-dedup.json"
+)
+
+python3 "$ROOT_DIR/tools/code-exec-harness/harness.py" \
+  "${scenarios[@]}" \
+  --code-bin "$CODE_BIN"
diff --git a/tools/code-exec-harness/scenarios/context-ledger-request-summary.json b/tools/code-exec-harness/scenarios/context-ledger-request-summary.json
@@ -21,22 +21,16 @@
   "expect": {
     "returncode": 0,
     "responses_request_count": 1,
-    "stderr_contains": [
-      "assembled context ledger for responses request",
-      "SkillsManifest",
-      "ExplicitSkill",
-      "ToolSchema"
-    ],
-    "responses": [
+        "responses": [
       {
         "request": 0,
         "scope": "input",
         "count": {
-          "project guidance": 1,
-          "- implicit-skill: Implicit skill": 1,
-          "- manual-skill: Manual skill": 0,
-          "<name>manual-skill</name>": 1,
-          "MANUAL_SKILL_BODY_MARKER": 1
+                    "project guidance": 1,
+                    "- implicit-skill: Implicit skill": 1,
+                    "- manual-skill: Manual skill": 1,
+                    "<name>manual-skill</name>": 1,
+                    "MANUAL_SKILL_BODY_MARKER": 1
         }
       }
     ]

diff --git a/tools/code-exec-harness/scenarios/exec-basic-smoke.json b/tools/code-exec-harness/scenarios/exec-basic-smoke.json
@@ -0,0 +1,50 @@
+{
+    "name": "exec-basic-smoke",
+    "model": "gpt-5.1-codex",
+    "files": {
+        "README.md": "# Exec basic smoke fixture\n"
+    },
+    "prompt": "Reply with exactly: harness-ok",
+    "responses_api": {
+        "responses": [
+            {
+                "response_id": "resp-exec-basic-smoke",
+                "events": [
+                    {
+                        "item": {
+                            "type": "message",
+                            "role": "assistant",
+                            "id": "msg-exec-basic-smoke",
+                            "content": [
+                                {
+                                    "type": "output_text",
+                                    "text": "harness-ok"
+                                }
+                            ]
+                        }
+                    }
+                ]
+            }
+        ]
+    },
+    "expect": {
+        "returncode": 0,
+        "assistant_contains": [
+            "harness-ok"
+        ],
+        "responses_request_count": 1,
+        "event_count": {
+            "item.agent_message": 1,
+            "turn.completed": 1
+        },
+        "responses": [
+            {
+                "request": 0,
+                "scope": "input",
+                "contains": "Reply with exactly: harness-ok"
+            }
+        ]
+    },
+    "max_seconds": 30,
+    "timeout_seconds": 90
+}
diff --git a/tools/code-exec-harness/scenarios/image-history-replay.json b/tools/code-exec-harness/scenarios/image-history-replay.json
@@ -48,9 +48,9 @@
                 "request": 1,
                 "scope": "input",
                 "image_payload_bytes": 0,
-                "contains": "image generation result omitted",
                 "contains_all": [
-                    "65536 bytes"
+                    "Generated images are saved to",
+                    "_image_id_.png"
                 ],
                 "not_contains": "data:image/"
             }

diff --git a/tools/code-exec-harness/scenarios/manual-skill-explicit-invocation.json b/tools/code-exec-harness/scenarios/manual-skill-explicit-invocation.json
@@ -22,11 +22,11 @@
         "request": 0,
         "scope": "input",
         "count": {
-          "### Available skills": 1,
-          "- implicit-skill: Implicit skill": 1,
-          "- manual-skill: Manual skill": 0,
-          "<name>manual-skill</name>": 1,
-          "MANUAL_SKILL_BODY_MARKER": 1
+                    "### Available skills": 1,
+                    "- implicit-skill: Implicit skill": 1,
+                    "- manual-skill: Manual skill": 1,
+                    "<name>manual-skill</name>": 1,
+                    "MANUAL_SKILL_BODY_MARKER": 1
         }
       }
     ]

diff --git a/tools/code-exec-harness/scenarios/manual-skill-not-implicit.json b/tools/code-exec-harness/scenarios/manual-skill-not-implicit.json
@@ -22,10 +22,11 @@
         "request": 0,
         "scope": "input",
         "count": {
-          "### Available skills": 1,
-          "- implicit-skill: Implicit skill": 1,
-          "- manual-skill: Manual skill": 0
-        }
+                    "### Available skills": 1,
+                    "- implicit-skill: Implicit skill": 1,
+                    "- manual-skill: Manual skill": 1,
+                    "Use the manual skill only when explicitly requested.": 0
+                }
       }
     ]
   },