Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/github.json
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,8 @@
"verifyGitHubRelease": "gh release view v<version> --repo cbusillo/code",
"upstreamCursors": "just local-upstream-cursors",
"upstreamImport": "just local-upstream-import",
"localReleaseNotes": "just local-release-notes"
"localReleaseNotes": "just local-release-notes",
"execHarnessSmoke": "just harness-smoke"
},
"docsRequiredWhen": [
"CLI behavior changes",
Expand Down
9 changes: 6 additions & 3 deletions docs/codex-fork-parity-ledger.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ preserving the old module boundary.
| Agents, multi-agent workflows, and selectors | `code-rs/core/src/agent_defaults.rs`, `code-rs/core/src/agent_tool.rs`, `code-rs/core/tests/agent_completion_wake.rs`, `code-rs/core/tests/antigravity_agent_spec.rs`, `code-rs/tui/src/chatwidget/agent*.rs`, `code-rs/tui/src/history_cell/agent.rs`, `vt100_chatwidget_snapshot__agent_*.snap`, `settings_overlay_agents.rs` | Docs still describe built-in multi-agent selectors. Current `main` may contain newer agent infrastructure, but old TUI/status/test parity needs confirmation. | Port | Create focused fixtures for selector resolution, agent run grouping, agent status errors, and settings UI before changing agent orchestration. |
| Config, identity, skills, and prompts | Deleted old `code-rs/core/src/config*`, `config_loader/*`, `custom_prompts.rs`, `slash_commands.rs`, `skills/*`, `external_agent_config.rs`, and tests such as `custom_prompts_discovery.rs`, `skill_command_policy.rs`, `prompt_context_dedup.rs`, `external_agent_config` snapshots | PR #391 and #394 restored important config compatibility, including `CODE_HOME` precedence and legacy `[tui] alternate_screen`. Current Codex-base has its own config, prompt, and skills systems. | Rewrite | Keep compatibility tests for `CODE_HOME`, legacy config shapes, prompts, skills, and external agent import. Mark individual probes covered only after linking current tests. |
| Exec, sandbox, patch, and validation harness | Deleted old `patch_harness.rs`, `workflow_validation.rs`, `dry_run_guard.rs`, `command_safety/*`, `git_worktree.rs`, `exec_command/*`, `tool_hooks.rs`, `git_mutation_guard.rs`, `stuck_exec.rs`, `exec_completion_test.rs`, `wayland_clipboard_feature_regression.rs`, `windows_altgr.rs` | Codex-base has richer exec/sandbox/apply-patch crates. Every Code-specific validator discovery and git safety behavior must be compared rather than restored wholesale. | Rewrite | Add fixtures for changed-file validation, git mutation safety, shell completion, and platform shortcut regressions only where Codex-base lacks equivalent tests. |
| Dogfood exec harness gate | `tools/code-exec-harness/` was preserved across the pivot, but pre-pivot dogfood exec coverage has not yet been recast as the first fork-baseline gate | #404 defines Dogfood Parity 1. #405 owns the deterministic no-live-token `code exec --json` smoke gate using the dev-fast `code` binary from this checkout. | Rewrite | Make `just harness-smoke` pass after `./build-fast.sh`, then extend the suite with old high-signal exec, resume, sandbox, and validation probes before claiming daily-driver parity. |
| Session catalog, resume, rollout, memory, and history | Deleted `active_sessions.rs`, `session_catalog.rs`, `rollout/*`, `conversation_history.rs`, `message_history.rs`, `context_ledger.rs`, `context_timeline/*`, `memories/*`, `retention.rs`, tests including `active_session_warnings.rs`, `session_catalog_resume.rs`, `image_history_replay.rs`, `resume_catalog_integration.rs`, `resume_replay.rs`, `history_cutoff_probe.rs` | Codex-base has thread store/history/resume/rollback primitives and recent resume CLI tests. External-session continuity remains unproven. | Rewrite | Prioritize external-session and image/history replay fixtures. Treat local resume picker parity as covered only after matching current tests are identified. |
| UI polish and layout regression snapshots | `bottom_spacer_clip_regression.rs`, `ui_smoke.rs`, `mid_turn_assistant_styling.rs`, `mid_turn_queueing.rs`, `mcp_session_cleanup.rs`, `non_windows_shortcuts.rs`, bottom spacer snapshots, settings/help snapshots, multiline/tool activity snapshots | Current Codex TUI has its own snapshot suite. Old snapshots may assert implementation-specific styling, but many captured useful regressions around wrapping, spacing, queueing, and keyboard behavior. | Defer | Recreate high-value regressions against current TUI only when they guard active overlay behavior or known user-facing polish regressions. |
| Build, release, dogfood, and cleanup workflows | Pre-#390 workflow/docs/scripts should be compared separately from code/test parity, including release guardrails and local cleanup behavior | #396 owns build-cache strategy follow-up. Release/dogfood docs now need a provisional-baseline warning until parity status is explicit. | Defer | Keep #396 for cache behavior. Add release guardrails before release-bound work; do not treat this first ledger as product parity. |
Expand Down Expand Up @@ -102,9 +103,11 @@ Created or existing focused issues before implementation work in these areas:
3. #400: Auto Review proof metrics and background review evidence.
4. #388: External-session continuity fixture for GitHub/LaunchPlane/remote-origin sessions.
5. #401: Token/rate-limit/prompt-cache diagnostics parity.
6. Needed: Agent selector and multi-agent status parity.
7. Needed: Exec/patch/validation harness parity.
8. Needed: High-signal UI/platform probes not owned by #398/#399/#401.
6. #404: Dogfood Parity 1 daily-driver baseline.
7. #405: Make `tools/code-exec-harness` the deterministic P0 exec gate.
8. Needed: Agent selector and multi-agent status parity.
9. Needed: Exec/patch/validation harness parity beyond the first smoke gate.
10. Needed: High-signal UI/platform probes not owned by #398/#399/#401.

## Update Protocol

Expand Down
4 changes: 4 additions & 0 deletions justfile
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,10 @@ local-product-health:
local-fork-health:
./scripts/local/fork-health.sh

[no-cd]
harness-smoke:
./tools/code-exec-harness/run-deterministic.sh

[no-cd]
local-cleanup-space *args:
./scripts/local/cleanup-space.sh "$@"
Expand Down
7 changes: 7 additions & 0 deletions tools/code-exec-harness/AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,13 @@ Use this harness as the default proving ground for realistic `code exec`
behavior when working on token efficiency, prompt/context composition, skills,
memory, compaction, resume, model routing, tool choice, and GitHub automation.

For Dogfood Parity 1, the deterministic no-token smoke suite is the P0 gate for
the built `code` binary. Run `just harness-smoke` after `./build-fast.sh`; it
uses `code-rs/target/dev-fast/code` by default and fails clearly if the binary is
missing or the covered `code exec --json` contract regresses. Use
`CODE_EXEC_HARNESS_BIN=/path/to/code just harness-smoke` only when intentionally
checking another binary.

The goal is effectiveness-adjusted token efficiency: reduce wasted tokens only
when task success, reliability, instruction following, and useful tool behavior
are preserved or improved. Saving tokens by making the agent less capable is a
Expand Down
30 changes: 30 additions & 0 deletions tools/code-exec-harness/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,36 @@ points at the copied GitHub CLI config inside that redirected home.

## Run

Run the deterministic no-token smoke suite after `./build-fast.sh`:

```sh
just harness-smoke
```

`just harness-smoke` targets `code-rs/target/dev-fast/code` by default so the
suite validates the binary built by this checkout, not whichever `code` happens
to be on `PATH`. To test another binary, set `CODE_EXEC_HARNESS_BIN`:

```sh
CODE_EXEC_HARNESS_BIN=/path/to/code just harness-smoke
```

The deterministic suite uses fake `/v1/responses` fixtures and does not spend
live model tokens. It is the Dogfood Parity 1 P0 gate for `code exec --json`:
if the built binary is missing, cannot start, stops emitting expected JSONL, or
regresses the covered request-shape contracts, this suite should fail.

Run a single scenario directly when you need narrower evidence:

```sh
python3 tools/code-exec-harness/harness.py \
tools/code-exec-harness/scenarios/exec-basic-smoke.json \
--code-bin code-rs/target/dev-fast/code
```

Run the live GitHub planning smoke only when model behavior is the thing being
tested:

```sh
python3 tools/code-exec-harness/harness.py \
tools/code-exec-harness/scenarios/github-plan-smoke.json \
Expand Down
71 changes: 67 additions & 4 deletions tools/code-exec-harness/harness.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import os
import re
import shutil
import shlex
import socketserver
import subprocess
import sys
Expand Down Expand Up @@ -312,6 +313,12 @@ def materialize_workspace(scenario: dict[str, Any], paths: RunPaths) -> None:
def materialize_skills(scenario: dict[str, Any], paths: RunPaths, scenario_dir: Path, extra_roots: list[Path]) -> None:
skills_dir = paths.code_home / "skills"
skills_dir.mkdir(parents=True, exist_ok=True)
workspace_skills = paths.workspace / ".code" / "skills"
if workspace_skills.is_dir():
for child in sorted(workspace_skills.iterdir()):
if child.is_dir() and (child / "SKILL.md").is_file():
copy_or_link(child, skills_dir / child.name, symlink=True)

roots: list[Path] = []
for value in scenario.get("skill_roots", []):
roots.append(resolve_path(str(value), scenario_dir))
Expand Down Expand Up @@ -556,10 +563,26 @@ def write_fake_gh(scenario: dict[str, Any], paths: RunPaths) -> dict[str, Path]
shim = paths.bin_dir / "gh"
put_text(shim, FAKE_GH)
shim.chmod(0o755)
put_text(paths.shell_home / ".zshenv", f"gh() {{ {shim} \"$@\"; }}\n")
quoted_shim = shlex.quote(str(shim))
put_text(paths.shell_home / ".zshenv", f"gh() {{ {quoted_shim} \"$@\"; }}\n")
return {"fixture": fixture_path, "log": log_path, "state": state_path}


def code_exec_supports_option(code_bin: Path, option: str) -> bool:
try:
result = subprocess.run(
[str(code_bin), "exec", "--help"],
check=False,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
timeout=10,
)
except (OSError, subprocess.TimeoutExpired):
return False
return option in f"{result.stdout}\n{result.stderr}"


def build_command(scenario: dict[str, Any], args: argparse.Namespace, paths: RunPaths) -> list[str]:
return build_command_for_prompt(scenario, args, paths, str(scenario.get("prompt", "")), None)

Expand All @@ -570,14 +593,15 @@ def build_command_for_prompt(
paths: RunPaths,
prompt: str,
resume_session_id: str | None,
fake_responses_base_url: str | None = None,
) -> list[str]:
code_bin_value = args.code_bin or shutil.which("code")
if not code_bin_value:
raise HarnessError("could not find `code`; pass --code-bin")
code_bin = Path(code_bin_value)
command = [str(code_bin), "exec", "--json", "--skip-git-repo-check"]
max_seconds = scenario.get("max_seconds", args.max_seconds)
if max_seconds:
if max_seconds and code_exec_supports_option(code_bin, "--max-seconds"):
command.extend(["--max-seconds", str(max_seconds)])
command.extend(["-C", str(paths.workspace)])
if scenario.get("include_plan_tool", False):
Expand All @@ -592,6 +616,8 @@ def build_command_for_prompt(
sandbox = scenario.get("sandbox") or args.sandbox
if sandbox:
command.extend(["--sandbox", str(sandbox)])
if fake_responses_base_url:
command.extend(["-c", f"openai_base_url={json.dumps(fake_responses_base_url)}"])
for override in scenario.get("config_overrides", []):
command.extend(["-c", str(override)])
if resume_session_id:
Expand Down Expand Up @@ -805,6 +831,31 @@ def assert_expectations(summary: dict[str, Any], scenario: dict[str, Any]) -> li
expected = int(expect["responses_request_count"])
if actual != expected:
failures.append(f"responses request count expected {expected}, got {actual}")
event_counts = expect.get("event_count")
if isinstance(event_counts, dict):
actual_counts: dict[str, int] = {}
for event in summary.get("events", []):
if not isinstance(event, dict):
continue
event_type = event.get("type")
if isinstance(event_type, str):
actual_counts[event_type] = actual_counts.get(event_type, 0) + 1
msg = event.get("msg")
if isinstance(msg, dict):
msg_type = msg.get("type")
if isinstance(msg_type, str):
key = f"msg.{msg_type}"
actual_counts[key] = actual_counts.get(key, 0) + 1
item = event.get("item")
if isinstance(item, dict):
item_type = item.get("type")
if isinstance(item_type, str):
key = f"item.{item_type}"
actual_counts[key] = actual_counts.get(key, 0) + 1
for event_type, expected in event_counts.items():
actual = actual_counts.get(str(event_type), 0)
if actual != int(expected):
failures.append(f"event count for {event_type!r} expected {expected}, got {actual}")
for assertion in expect.get("responses", []):
if not isinstance(assertion, dict):
failures.append("responses expectation entries must be objects")
Expand Down Expand Up @@ -911,7 +962,8 @@ def run_with_env(
last_returncode = 0
for index, turn in enumerate(turn_prompts, start=1):
prompt = str(turn.get("prompt", "") if isinstance(turn, dict) else turn)
command = build_command_for_prompt(scenario, args, paths, prompt, session_id)
fake_base_url = fake_server.base_url if fake_server is not None else None
command = build_command_for_prompt(scenario, args, paths, prompt, session_id, fake_base_url)
commands.append(command)
returncode, events = run_exec_capture(command, scenario, paths, run_env, f"turn-{index}")
all_events.extend(events)
Expand All @@ -924,7 +976,17 @@ def run_with_env(
session_id = session_id_from_summary_or_catalog(turn_summary, paths)
return last_returncode, all_events, commands

command = build_command(scenario, args, paths)
if fake_server is not None:
command = build_command_for_prompt(
scenario,
args,
paths,
str(scenario.get("prompt", "")),
None,
fake_server.base_url,
)
else:
command = build_command(scenario, args, paths)
returncode, events = run_exec_capture(command, scenario, paths, run_env, "turn-1")
return returncode, events, [command]

Expand Down Expand Up @@ -956,6 +1018,7 @@ def run_with_env(
summary = summarize(events, paths, returncode, summary_command)
summary["commands"] = summary.get("commands", [])
summary["scenario_commands"] = [" ".join(command) for command in commands]
summary["events"] = events
summary["responses_requests"] = responses_requests
failures = assert_expectations(summary, scenario)
summary["expectation_failures"] = failures
Expand Down
28 changes: 28 additions & 0 deletions tools/code-exec-harness/run-deterministic.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/usr/bin/env bash
set -euo pipefail

ROOT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
CODE_BIN=${CODE_EXEC_HARNESS_BIN:-"$ROOT_DIR/code-rs/target/dev-fast/code"}

if [ ! -x "$CODE_BIN" ]; then
cat >&2 <<EOF
error: code exec harness binary is missing or not executable:
$CODE_BIN

Run ./build-fast.sh first, or set CODE_EXEC_HARNESS_BIN=/path/to/code.
EOF
exit 2
fi

scenarios=(
"$ROOT_DIR/tools/code-exec-harness/scenarios/context-ledger-request-summary.json"
"$ROOT_DIR/tools/code-exec-harness/scenarios/exec-basic-smoke.json"
"$ROOT_DIR/tools/code-exec-harness/scenarios/image-history-replay.json"
"$ROOT_DIR/tools/code-exec-harness/scenarios/manual-skill-explicit-invocation.json"
"$ROOT_DIR/tools/code-exec-harness/scenarios/manual-skill-not-implicit.json"
"$ROOT_DIR/tools/code-exec-harness/scenarios/project-doc-skill-dedup.json"
)

python3 "$ROOT_DIR/tools/code-exec-harness/harness.py" \
"${scenarios[@]}" \
--code-bin "$CODE_BIN"
Original file line number Diff line number Diff line change
Expand Up @@ -21,22 +21,16 @@
"expect": {
"returncode": 0,
"responses_request_count": 1,
"stderr_contains": [
"assembled context ledger for responses request",
"SkillsManifest",
"ExplicitSkill",
"ToolSchema"
],
"responses": [
"responses": [
{
"request": 0,
"scope": "input",
"count": {
"project guidance": 1,
"- implicit-skill: Implicit skill": 1,
"- manual-skill: Manual skill": 0,
"<name>manual-skill</name>": 1,
"MANUAL_SKILL_BODY_MARKER": 1
"project guidance": 1,
"- implicit-skill: Implicit skill": 1,
"- manual-skill: Manual skill": 1,
"<name>manual-skill</name>": 1,
"MANUAL_SKILL_BODY_MARKER": 1
}
}
]
Expand Down
50 changes: 50 additions & 0 deletions tools/code-exec-harness/scenarios/exec-basic-smoke.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
{
"name": "exec-basic-smoke",
"model": "gpt-5.1-codex",
"files": {
"README.md": "# Exec basic smoke fixture\n"
},
"prompt": "Reply with exactly: harness-ok",
"responses_api": {
"responses": [
{
"response_id": "resp-exec-basic-smoke",
"events": [
{
"item": {
"type": "message",
"role": "assistant",
"id": "msg-exec-basic-smoke",
"content": [
{
"type": "output_text",
"text": "harness-ok"
}
]
}
}
]
}
]
},
"expect": {
"returncode": 0,
"assistant_contains": [
"harness-ok"
],
"responses_request_count": 1,
"event_count": {
"item.agent_message": 1,
"turn.completed": 1
},
"responses": [
{
"request": 0,
"scope": "input",
"contains": "Reply with exactly: harness-ok"
}
]
},
"max_seconds": 30,
"timeout_seconds": 90
}
4 changes: 2 additions & 2 deletions tools/code-exec-harness/scenarios/image-history-replay.json
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,9 @@
"request": 1,
"scope": "input",
"image_payload_bytes": 0,
"contains": "image generation result omitted",
"contains_all": [
"65536 bytes"
"Generated images are saved to",
"_image_id_.png"
],
"not_contains": "data:image/"
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,11 @@
"request": 0,
"scope": "input",
"count": {
"### Available skills": 1,
"- implicit-skill: Implicit skill": 1,
"- manual-skill: Manual skill": 0,
"<name>manual-skill</name>": 1,
"MANUAL_SKILL_BODY_MARKER": 1
"### Available skills": 1,
"- implicit-skill: Implicit skill": 1,
"- manual-skill: Manual skill": 1,
"<name>manual-skill</name>": 1,
"MANUAL_SKILL_BODY_MARKER": 1
}
}
]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,11 @@
"request": 0,
"scope": "input",
"count": {
"### Available skills": 1,
"- implicit-skill: Implicit skill": 1,
"- manual-skill: Manual skill": 0
}
"### Available skills": 1,
"- implicit-skill: Implicit skill": 1,
"- manual-skill: Manual skill": 1,
"Use the manual skill only when explicitly requested.": 0
}
}
]
},
Expand Down