diff --git a/.github/github.json b/.github/github.json index 638ff490096..fb4e06eb2b1 100644 --- a/.github/github.json +++ b/.github/github.json @@ -51,7 +51,8 @@ "verifyGitHubRelease": "gh release view v --repo cbusillo/code", "upstreamCursors": "just local-upstream-cursors", "upstreamImport": "just local-upstream-import", - "localReleaseNotes": "just local-release-notes" + "localReleaseNotes": "just local-release-notes", + "execHarnessSmoke": "just harness-smoke" }, "docsRequiredWhen": [ "CLI behavior changes", diff --git a/docs/codex-fork-parity-ledger.md b/docs/codex-fork-parity-ledger.md index 5de6a83d933..dc234e9bdec 100644 --- a/docs/codex-fork-parity-ledger.md +++ b/docs/codex-fork-parity-ledger.md @@ -73,6 +73,7 @@ preserving the old module boundary. | Agents, multi-agent workflows, and selectors | `code-rs/core/src/agent_defaults.rs`, `code-rs/core/src/agent_tool.rs`, `code-rs/core/tests/agent_completion_wake.rs`, `code-rs/core/tests/antigravity_agent_spec.rs`, `code-rs/tui/src/chatwidget/agent*.rs`, `code-rs/tui/src/history_cell/agent.rs`, `vt100_chatwidget_snapshot__agent_*.snap`, `settings_overlay_agents.rs` | Docs still describe built-in multi-agent selectors. Current `main` may contain newer agent infrastructure, but old TUI/status/test parity needs confirmation. | Port | Create focused fixtures for selector resolution, agent run grouping, agent status errors, and settings UI before changing agent orchestration. | | Config, identity, skills, and prompts | Deleted old `code-rs/core/src/config*`, `config_loader/*`, `custom_prompts.rs`, `slash_commands.rs`, `skills/*`, `external_agent_config.rs`, and tests such as `custom_prompts_discovery.rs`, `skill_command_policy.rs`, `prompt_context_dedup.rs`, `external_agent_config` snapshots | PR #391 and #394 restored important config compatibility, including `CODE_HOME` precedence and legacy `[tui] alternate_screen`. Current Codex-base has its own config, prompt, and skills systems. | Rewrite | Keep compatibility tests for `CODE_HOME`, legacy config shapes, prompts, skills, and external agent import. Mark individual probes covered only after linking current tests. | | Exec, sandbox, patch, and validation harness | Deleted old `patch_harness.rs`, `workflow_validation.rs`, `dry_run_guard.rs`, `command_safety/*`, `git_worktree.rs`, `exec_command/*`, `tool_hooks.rs`, `git_mutation_guard.rs`, `stuck_exec.rs`, `exec_completion_test.rs`, `wayland_clipboard_feature_regression.rs`, `windows_altgr.rs` | Codex-base has richer exec/sandbox/apply-patch crates. Every Code-specific validator discovery and git safety behavior must be compared rather than restored wholesale. | Rewrite | Add fixtures for changed-file validation, git mutation safety, shell completion, and platform shortcut regressions only where Codex-base lacks equivalent tests. | +| Dogfood exec harness gate | `tools/code-exec-harness/` was preserved across the pivot, but pre-pivot dogfood exec coverage has not yet been recast as the first fork-baseline gate | #404 defines Dogfood Parity 1. #405 owns the deterministic no-live-token `code exec --json` smoke gate using the dev-fast `code` binary from this checkout. | Rewrite | Make `just harness-smoke` pass after `./build-fast.sh`, then extend the suite with old high-signal exec, resume, sandbox, and validation probes before claiming daily-driver parity. | | Session catalog, resume, rollout, memory, and history | Deleted `active_sessions.rs`, `session_catalog.rs`, `rollout/*`, `conversation_history.rs`, `message_history.rs`, `context_ledger.rs`, `context_timeline/*`, `memories/*`, `retention.rs`, tests including `active_session_warnings.rs`, `session_catalog_resume.rs`, `image_history_replay.rs`, `resume_catalog_integration.rs`, `resume_replay.rs`, `history_cutoff_probe.rs` | Codex-base has thread store/history/resume/rollback primitives and recent resume CLI tests. External-session continuity remains unproven. | Rewrite | Prioritize external-session and image/history replay fixtures. Treat local resume picker parity as covered only after matching current tests are identified. | | UI polish and layout regression snapshots | `bottom_spacer_clip_regression.rs`, `ui_smoke.rs`, `mid_turn_assistant_styling.rs`, `mid_turn_queueing.rs`, `mcp_session_cleanup.rs`, `non_windows_shortcuts.rs`, bottom spacer snapshots, settings/help snapshots, multiline/tool activity snapshots | Current Codex TUI has its own snapshot suite. Old snapshots may assert implementation-specific styling, but many captured useful regressions around wrapping, spacing, queueing, and keyboard behavior. | Defer | Recreate high-value regressions against current TUI only when they guard active overlay behavior or known user-facing polish regressions. | | Build, release, dogfood, and cleanup workflows | Pre-#390 workflow/docs/scripts should be compared separately from code/test parity, including release guardrails and local cleanup behavior | #396 owns build-cache strategy follow-up. Release/dogfood docs now need a provisional-baseline warning until parity status is explicit. | Defer | Keep #396 for cache behavior. Add release guardrails before release-bound work; do not treat this first ledger as product parity. | @@ -102,9 +103,11 @@ Created or existing focused issues before implementation work in these areas: 3. #400: Auto Review proof metrics and background review evidence. 4. #388: External-session continuity fixture for GitHub/LaunchPlane/remote-origin sessions. 5. #401: Token/rate-limit/prompt-cache diagnostics parity. -6. Needed: Agent selector and multi-agent status parity. -7. Needed: Exec/patch/validation harness parity. -8. Needed: High-signal UI/platform probes not owned by #398/#399/#401. +6. #404: Dogfood Parity 1 daily-driver baseline. +7. #405: Make `tools/code-exec-harness` the deterministic P0 exec gate. +8. Needed: Agent selector and multi-agent status parity. +9. Needed: Exec/patch/validation harness parity beyond the first smoke gate. +10. Needed: High-signal UI/platform probes not owned by #398/#399/#401. ## Update Protocol diff --git a/justfile b/justfile index fc3e0420af4..4ace36616da 100644 --- a/justfile +++ b/justfile @@ -91,6 +91,10 @@ local-product-health: local-fork-health: ./scripts/local/fork-health.sh +[no-cd] +harness-smoke: + ./tools/code-exec-harness/run-deterministic.sh + [no-cd] local-cleanup-space *args: ./scripts/local/cleanup-space.sh "$@" diff --git a/tools/code-exec-harness/AGENTS.md b/tools/code-exec-harness/AGENTS.md index 3467f35477f..fbe63aca9ae 100644 --- a/tools/code-exec-harness/AGENTS.md +++ b/tools/code-exec-harness/AGENTS.md @@ -4,6 +4,13 @@ Use this harness as the default proving ground for realistic `code exec` behavior when working on token efficiency, prompt/context composition, skills, memory, compaction, resume, model routing, tool choice, and GitHub automation. +For Dogfood Parity 1, the deterministic no-token smoke suite is the P0 gate for +the built `code` binary. Run `just harness-smoke` after `./build-fast.sh`; it +uses `code-rs/target/dev-fast/code` by default and fails clearly if the binary is +missing or the covered `code exec --json` contract regresses. Use +`CODE_EXEC_HARNESS_BIN=/path/to/code just harness-smoke` only when intentionally +checking another binary. + The goal is effectiveness-adjusted token efficiency: reduce wasted tokens only when task success, reliability, instruction following, and useful tool behavior are preserved or improved. Saving tokens by making the agent less capable is a diff --git a/tools/code-exec-harness/README.md b/tools/code-exec-harness/README.md index 4d6276ddcd1..1ae53977ef8 100644 --- a/tools/code-exec-harness/README.md +++ b/tools/code-exec-harness/README.md @@ -28,6 +28,36 @@ points at the copied GitHub CLI config inside that redirected home. ## Run +Run the deterministic no-token smoke suite after `./build-fast.sh`: + +```sh +just harness-smoke +``` + +`just harness-smoke` targets `code-rs/target/dev-fast/code` by default so the +suite validates the binary built by this checkout, not whichever `code` happens +to be on `PATH`. To test another binary, set `CODE_EXEC_HARNESS_BIN`: + +```sh +CODE_EXEC_HARNESS_BIN=/path/to/code just harness-smoke +``` + +The deterministic suite uses fake `/v1/responses` fixtures and does not spend +live model tokens. It is the Dogfood Parity 1 P0 gate for `code exec --json`: +if the built binary is missing, cannot start, stops emitting expected JSONL, or +regresses the covered request-shape contracts, this suite should fail. + +Run a single scenario directly when you need narrower evidence: + +```sh +python3 tools/code-exec-harness/harness.py \ + tools/code-exec-harness/scenarios/exec-basic-smoke.json \ + --code-bin code-rs/target/dev-fast/code +``` + +Run the live GitHub planning smoke only when model behavior is the thing being +tested: + ```sh python3 tools/code-exec-harness/harness.py \ tools/code-exec-harness/scenarios/github-plan-smoke.json \ diff --git a/tools/code-exec-harness/harness.py b/tools/code-exec-harness/harness.py index 26b4159bd10..b1f30117565 100644 --- a/tools/code-exec-harness/harness.py +++ b/tools/code-exec-harness/harness.py @@ -7,6 +7,7 @@ import os import re import shutil +import shlex import socketserver import subprocess import sys @@ -312,6 +313,12 @@ def materialize_workspace(scenario: dict[str, Any], paths: RunPaths) -> None: def materialize_skills(scenario: dict[str, Any], paths: RunPaths, scenario_dir: Path, extra_roots: list[Path]) -> None: skills_dir = paths.code_home / "skills" skills_dir.mkdir(parents=True, exist_ok=True) + workspace_skills = paths.workspace / ".code" / "skills" + if workspace_skills.is_dir(): + for child in sorted(workspace_skills.iterdir()): + if child.is_dir() and (child / "SKILL.md").is_file(): + copy_or_link(child, skills_dir / child.name, symlink=True) + roots: list[Path] = [] for value in scenario.get("skill_roots", []): roots.append(resolve_path(str(value), scenario_dir)) @@ -556,10 +563,26 @@ def write_fake_gh(scenario: dict[str, Any], paths: RunPaths) -> dict[str, Path] shim = paths.bin_dir / "gh" put_text(shim, FAKE_GH) shim.chmod(0o755) - put_text(paths.shell_home / ".zshenv", f"gh() {{ {shim} \"$@\"; }}\n") + quoted_shim = shlex.quote(str(shim)) + put_text(paths.shell_home / ".zshenv", f"gh() {{ {quoted_shim} \"$@\"; }}\n") return {"fixture": fixture_path, "log": log_path, "state": state_path} +def code_exec_supports_option(code_bin: Path, option: str) -> bool: + try: + result = subprocess.run( + [str(code_bin), "exec", "--help"], + check=False, + stdout=subprocess.PIPE, + stderr=subprocess.PIPE, + text=True, + timeout=10, + ) + except (OSError, subprocess.TimeoutExpired): + return False + return option in f"{result.stdout}\n{result.stderr}" + + def build_command(scenario: dict[str, Any], args: argparse.Namespace, paths: RunPaths) -> list[str]: return build_command_for_prompt(scenario, args, paths, str(scenario.get("prompt", "")), None) @@ -570,6 +593,7 @@ def build_command_for_prompt( paths: RunPaths, prompt: str, resume_session_id: str | None, + fake_responses_base_url: str | None = None, ) -> list[str]: code_bin_value = args.code_bin or shutil.which("code") if not code_bin_value: @@ -577,7 +601,7 @@ def build_command_for_prompt( code_bin = Path(code_bin_value) command = [str(code_bin), "exec", "--json", "--skip-git-repo-check"] max_seconds = scenario.get("max_seconds", args.max_seconds) - if max_seconds: + if max_seconds and code_exec_supports_option(code_bin, "--max-seconds"): command.extend(["--max-seconds", str(max_seconds)]) command.extend(["-C", str(paths.workspace)]) if scenario.get("include_plan_tool", False): @@ -592,6 +616,8 @@ def build_command_for_prompt( sandbox = scenario.get("sandbox") or args.sandbox if sandbox: command.extend(["--sandbox", str(sandbox)]) + if fake_responses_base_url: + command.extend(["-c", f"openai_base_url={json.dumps(fake_responses_base_url)}"]) for override in scenario.get("config_overrides", []): command.extend(["-c", str(override)]) if resume_session_id: @@ -805,6 +831,31 @@ def assert_expectations(summary: dict[str, Any], scenario: dict[str, Any]) -> li expected = int(expect["responses_request_count"]) if actual != expected: failures.append(f"responses request count expected {expected}, got {actual}") + event_counts = expect.get("event_count") + if isinstance(event_counts, dict): + actual_counts: dict[str, int] = {} + for event in summary.get("events", []): + if not isinstance(event, dict): + continue + event_type = event.get("type") + if isinstance(event_type, str): + actual_counts[event_type] = actual_counts.get(event_type, 0) + 1 + msg = event.get("msg") + if isinstance(msg, dict): + msg_type = msg.get("type") + if isinstance(msg_type, str): + key = f"msg.{msg_type}" + actual_counts[key] = actual_counts.get(key, 0) + 1 + item = event.get("item") + if isinstance(item, dict): + item_type = item.get("type") + if isinstance(item_type, str): + key = f"item.{item_type}" + actual_counts[key] = actual_counts.get(key, 0) + 1 + for event_type, expected in event_counts.items(): + actual = actual_counts.get(str(event_type), 0) + if actual != int(expected): + failures.append(f"event count for {event_type!r} expected {expected}, got {actual}") for assertion in expect.get("responses", []): if not isinstance(assertion, dict): failures.append("responses expectation entries must be objects") @@ -911,7 +962,8 @@ def run_with_env( last_returncode = 0 for index, turn in enumerate(turn_prompts, start=1): prompt = str(turn.get("prompt", "") if isinstance(turn, dict) else turn) - command = build_command_for_prompt(scenario, args, paths, prompt, session_id) + fake_base_url = fake_server.base_url if fake_server is not None else None + command = build_command_for_prompt(scenario, args, paths, prompt, session_id, fake_base_url) commands.append(command) returncode, events = run_exec_capture(command, scenario, paths, run_env, f"turn-{index}") all_events.extend(events) @@ -924,7 +976,17 @@ def run_with_env( session_id = session_id_from_summary_or_catalog(turn_summary, paths) return last_returncode, all_events, commands - command = build_command(scenario, args, paths) + if fake_server is not None: + command = build_command_for_prompt( + scenario, + args, + paths, + str(scenario.get("prompt", "")), + None, + fake_server.base_url, + ) + else: + command = build_command(scenario, args, paths) returncode, events = run_exec_capture(command, scenario, paths, run_env, "turn-1") return returncode, events, [command] @@ -956,6 +1018,7 @@ def run_with_env( summary = summarize(events, paths, returncode, summary_command) summary["commands"] = summary.get("commands", []) summary["scenario_commands"] = [" ".join(command) for command in commands] + summary["events"] = events summary["responses_requests"] = responses_requests failures = assert_expectations(summary, scenario) summary["expectation_failures"] = failures diff --git a/tools/code-exec-harness/run-deterministic.sh b/tools/code-exec-harness/run-deterministic.sh new file mode 100755 index 00000000000..378641f2484 --- /dev/null +++ b/tools/code-exec-harness/run-deterministic.sh @@ -0,0 +1,28 @@ +#!/usr/bin/env bash +set -euo pipefail + +ROOT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd) +CODE_BIN=${CODE_EXEC_HARNESS_BIN:-"$ROOT_DIR/code-rs/target/dev-fast/code"} + +if [ ! -x "$CODE_BIN" ]; then + cat >&2 <manual-skill": 1, - "MANUAL_SKILL_BODY_MARKER": 1 + "project guidance": 1, + "- implicit-skill: Implicit skill": 1, + "- manual-skill: Manual skill": 1, + "manual-skill": 1, + "MANUAL_SKILL_BODY_MARKER": 1 } } ] diff --git a/tools/code-exec-harness/scenarios/exec-basic-smoke.json b/tools/code-exec-harness/scenarios/exec-basic-smoke.json new file mode 100644 index 00000000000..372832f1a9a --- /dev/null +++ b/tools/code-exec-harness/scenarios/exec-basic-smoke.json @@ -0,0 +1,50 @@ +{ + "name": "exec-basic-smoke", + "model": "gpt-5.1-codex", + "files": { + "README.md": "# Exec basic smoke fixture\n" + }, + "prompt": "Reply with exactly: harness-ok", + "responses_api": { + "responses": [ + { + "response_id": "resp-exec-basic-smoke", + "events": [ + { + "item": { + "type": "message", + "role": "assistant", + "id": "msg-exec-basic-smoke", + "content": [ + { + "type": "output_text", + "text": "harness-ok" + } + ] + } + } + ] + } + ] + }, + "expect": { + "returncode": 0, + "assistant_contains": [ + "harness-ok" + ], + "responses_request_count": 1, + "event_count": { + "item.agent_message": 1, + "turn.completed": 1 + }, + "responses": [ + { + "request": 0, + "scope": "input", + "contains": "Reply with exactly: harness-ok" + } + ] + }, + "max_seconds": 30, + "timeout_seconds": 90 +} diff --git a/tools/code-exec-harness/scenarios/image-history-replay.json b/tools/code-exec-harness/scenarios/image-history-replay.json index d6fc49e9bb7..adfdeeffff5 100644 --- a/tools/code-exec-harness/scenarios/image-history-replay.json +++ b/tools/code-exec-harness/scenarios/image-history-replay.json @@ -48,9 +48,9 @@ "request": 1, "scope": "input", "image_payload_bytes": 0, - "contains": "image generation result omitted", "contains_all": [ - "65536 bytes" + "Generated images are saved to", + "_image_id_.png" ], "not_contains": "data:image/" } diff --git a/tools/code-exec-harness/scenarios/manual-skill-explicit-invocation.json b/tools/code-exec-harness/scenarios/manual-skill-explicit-invocation.json index b93bab1f1be..dd1f5f976f1 100644 --- a/tools/code-exec-harness/scenarios/manual-skill-explicit-invocation.json +++ b/tools/code-exec-harness/scenarios/manual-skill-explicit-invocation.json @@ -22,11 +22,11 @@ "request": 0, "scope": "input", "count": { - "### Available skills": 1, - "- implicit-skill: Implicit skill": 1, - "- manual-skill: Manual skill": 0, - "manual-skill": 1, - "MANUAL_SKILL_BODY_MARKER": 1 + "### Available skills": 1, + "- implicit-skill: Implicit skill": 1, + "- manual-skill: Manual skill": 1, + "manual-skill": 1, + "MANUAL_SKILL_BODY_MARKER": 1 } } ] diff --git a/tools/code-exec-harness/scenarios/manual-skill-not-implicit.json b/tools/code-exec-harness/scenarios/manual-skill-not-implicit.json index ce242709241..4d3203b0f3b 100644 --- a/tools/code-exec-harness/scenarios/manual-skill-not-implicit.json +++ b/tools/code-exec-harness/scenarios/manual-skill-not-implicit.json @@ -22,10 +22,11 @@ "request": 0, "scope": "input", "count": { - "### Available skills": 1, - "- implicit-skill: Implicit skill": 1, - "- manual-skill: Manual skill": 0 - } + "### Available skills": 1, + "- implicit-skill: Implicit skill": 1, + "- manual-skill: Manual skill": 1, + "Use the manual skill only when explicitly requested.": 0 + } } ] },