Skip to content

bench(parity): cg HTTP and cg-mcp share the same 8-verb surface#696

Draft
DvirDukhan wants to merge 3 commits into
dvirdukhan/api-v2-mcp-parityfrom
dvirdukhan/bench-mcp-parity
Draft

bench(parity): cg HTTP and cg-mcp share the same 8-verb surface#696
DvirDukhan wants to merge 3 commits into
dvirdukhan/api-v2-mcp-parityfrom
dvirdukhan/bench-mcp-parity

Conversation

@DvirDukhan
Copy link
Copy Markdown
Contributor

Summary

Pairs with #api-v2 (the /api/v2/* MCP-parity endpoints). With those endpoints in place, the SWE-bench harness can now run the HTTP-transport sibling (cg) on the same verb surface as the stdio-MCP sibling (cg-mcp), so a head-to-head benchmark measures transport overhead rather than API-surface differences.

Changes

  • bench/agents/code_graph_adapter.py — add v2 client methods on CodeGraphClient that POST to the new /api/v2/* endpoints (search_code, get_callers, get_callees, get_dependencies, impact_analysis, find_path_v2, ask_v2). Existing UI-shaped methods kept for back-compat with tests/test_cli.py.
  • bench/cli/cg.py — rewrite to expose the 8 MCP-style verbs (index_repo, search_code, get_callers, get_callees, get_dependencies, impact_analysis, find_path, ask) alongside the legacy UI verbs. Mirrors cg_mcp.py's _compact_list / _strip_worktree_prefix helpers so token compaction is byte-identical between transports.
  • bench/runners/mini_runner.pyINSTANCE_TEMPLATE_CODE_GRAPH now documents the new verb surface. The cg track exports PROJECT_NAME + BRANCH like the MCP track, and indexes via /api/analyze_folder with explicit branch=_default so both tracks share the code:<project>:<branch> graph namespace.
  • bench/tools/code_graph/system_preamble.md — rewritten to mirror bench/tools/code_graph_mcp/system_preamble.md verb-for-verb.

Validation

Parity verified byte-for-byte on a pre-indexed pytest-6202 graph: cg search_code/get_callers/get_callees/impact_analysis returns identical output to the cg-mcp equivalents (1 KB payload diff'd). All 27 existing bench + CLI tests still pass.

Stacked

  • Base: dvirdukhan/api-v2-mcp-parity (needs the v2 endpoints).

Pairs with #api-v2 (api/v2/* MCP-parity endpoints). With those
endpoints in place, the bench harness can now run the HTTP-transport
sibling (cg) on the same verb surface as the stdio-MCP sibling
(cg-mcp), so a head-to-head benchmark measures *transport overhead*
rather than API-surface differences.

Changes:

* bench/agents/code_graph_adapter.py — add v2 client methods on
  CodeGraphClient that POST to the new /api/v2/* endpoints
  (search_code, get_callers, get_callees, get_dependencies,
  impact_analysis, find_path_v2, ask_v2). Existing UI-shaped
  methods (graph_entities, get_neighbors, find_paths, ...) kept
  for back-compat with tests/test_cli.py.

* bench/cli/cg.py — rewrite to expose the 8 MCP-style verbs
  (index_repo, search_code, get_callers, get_callees,
  get_dependencies, impact_analysis, find_path, ask) alongside the
  legacy UI verbs. Mirrors cg_mcp.py's _compact_list /
  _strip_worktree_prefix helpers so token compaction is
  byte-identical between transports.

* bench/runners/mini_runner.py — INSTANCE_TEMPLATE_CODE_GRAPH now
  documents the new verb surface. The cg track exports
  PROJECT_NAME + BRANCH like the MCP track, and indexes via
  /api/analyze_folder with explicit branch=_default so both tracks
  share the code:<project>:<branch> graph namespace.

* bench/tools/code_graph/system_preamble.md — rewritten to mirror
  bench/tools/code_graph_mcp/system_preamble.md verb-for-verb.

Parity verified byte-for-byte on a pre-indexed pytest-6202 graph:
cg search_code/get_callers/get_callees/impact_analysis returns
identical output to the cg-mcp equivalents (1 KB payload diff'd).
All 27 existing bench + CLI tests still pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 48e4c33d-4336-4e70-8700-dd23c3eef2cc

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dvirdukhan/bench-mcp-parity

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

DvirDukhan and others added 2 commits May 30, 2026 07:58
Iter3 root-cause: with the verb surfaces and tool outputs now
byte-identical between the HTTP (cg) and MCP (cg-mcp) tracks, the
remaining token gap traced entirely to reading strategy. On 2/10
instances the agent fell into a 19x full-file `cat` loop instead of
reading the bounded span the graph already pointed at, inflating
input tokens 3-4x on those instances.

Both preambles now explicitly forbid `cat`-ing a whole source file
and require `sed -n 'START,ENDp'` anchored on the graph's line
number. This attacks the actual token driver and applies equally to
both transports so a head-to-head stays apples-to-apples.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
sample_instances() was called with only `stage` (size from
STAGE_SIZES), then the result was sliced `[:limit]`. That let
--limit shrink the sample below the stage size but never grow it,
so `--stage calibration --limit 40` silently ran just 10 instances.

Pass n=args.limit straight into sample_instances so the limit sets
the exact sample size (falling back to the stage size when unset).
Because random.sample is prefix-stable for our seed, the n=10
calibration set stays a subset of the n=40 set, so existing
trajectories/indexed graphs still resume-skip cleanly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant