bench(parity): cg HTTP and cg-mcp share the same 8-verb surface#696
Draft
DvirDukhan wants to merge 3 commits into
Draft
bench(parity): cg HTTP and cg-mcp share the same 8-verb surface#696DvirDukhan wants to merge 3 commits into
DvirDukhan wants to merge 3 commits into
Conversation
Pairs with #api-v2 (api/v2/* MCP-parity endpoints). With those endpoints in place, the bench harness can now run the HTTP-transport sibling (cg) on the same verb surface as the stdio-MCP sibling (cg-mcp), so a head-to-head benchmark measures *transport overhead* rather than API-surface differences. Changes: * bench/agents/code_graph_adapter.py — add v2 client methods on CodeGraphClient that POST to the new /api/v2/* endpoints (search_code, get_callers, get_callees, get_dependencies, impact_analysis, find_path_v2, ask_v2). Existing UI-shaped methods (graph_entities, get_neighbors, find_paths, ...) kept for back-compat with tests/test_cli.py. * bench/cli/cg.py — rewrite to expose the 8 MCP-style verbs (index_repo, search_code, get_callers, get_callees, get_dependencies, impact_analysis, find_path, ask) alongside the legacy UI verbs. Mirrors cg_mcp.py's _compact_list / _strip_worktree_prefix helpers so token compaction is byte-identical between transports. * bench/runners/mini_runner.py — INSTANCE_TEMPLATE_CODE_GRAPH now documents the new verb surface. The cg track exports PROJECT_NAME + BRANCH like the MCP track, and indexes via /api/analyze_folder with explicit branch=_default so both tracks share the code:<project>:<branch> graph namespace. * bench/tools/code_graph/system_preamble.md — rewritten to mirror bench/tools/code_graph_mcp/system_preamble.md verb-for-verb. Parity verified byte-for-byte on a pre-indexed pytest-6202 graph: cg search_code/get_callers/get_callees/impact_analysis returns identical output to the cg-mcp equivalents (1 KB payload diff'd). All 27 existing bench + CLI tests still pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Iter3 root-cause: with the verb surfaces and tool outputs now byte-identical between the HTTP (cg) and MCP (cg-mcp) tracks, the remaining token gap traced entirely to reading strategy. On 2/10 instances the agent fell into a 19x full-file `cat` loop instead of reading the bounded span the graph already pointed at, inflating input tokens 3-4x on those instances. Both preambles now explicitly forbid `cat`-ing a whole source file and require `sed -n 'START,ENDp'` anchored on the graph's line number. This attacks the actual token driver and applies equally to both transports so a head-to-head stays apples-to-apples. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
sample_instances() was called with only `stage` (size from STAGE_SIZES), then the result was sliced `[:limit]`. That let --limit shrink the sample below the stage size but never grow it, so `--stage calibration --limit 40` silently ran just 10 instances. Pass n=args.limit straight into sample_instances so the limit sets the exact sample size (falling back to the stage size when unset). Because random.sample is prefix-stable for our seed, the n=10 calibration set stays a subset of the n=40 set, so existing trajectories/indexed graphs still resume-skip cleanly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Pairs with #api-v2 (the
/api/v2/*MCP-parity endpoints). With those endpoints in place, the SWE-bench harness can now run the HTTP-transport sibling (cg) on the same verb surface as the stdio-MCP sibling (cg-mcp), so a head-to-head benchmark measures transport overhead rather than API-surface differences.Changes
CodeGraphClientthat POST to the new/api/v2/*endpoints (search_code,get_callers,get_callees,get_dependencies,impact_analysis,find_path_v2,ask_v2). Existing UI-shaped methods kept for back-compat withtests/test_cli.py.index_repo,search_code,get_callers,get_callees,get_dependencies,impact_analysis,find_path,ask) alongside the legacy UI verbs. Mirrorscg_mcp.py's_compact_list/_strip_worktree_prefixhelpers so token compaction is byte-identical between transports.INSTANCE_TEMPLATE_CODE_GRAPHnow documents the new verb surface. Thecgtrack exportsPROJECT_NAME+BRANCHlike the MCP track, and indexes via/api/analyze_folderwith explicitbranch=_defaultso both tracks share thecode:<project>:<branch>graph namespace.bench/tools/code_graph_mcp/system_preamble.mdverb-for-verb.Validation
Parity verified byte-for-byte on a pre-indexed pytest-6202 graph:
cg search_code/get_callers/get_callees/impact_analysisreturns identical output to the cg-mcp equivalents (1 KB payload diff'd). All 27 existing bench + CLI tests still pass.Stacked
dvirdukhan/api-v2-mcp-parity(needs the v2 endpoints).