diff --git a/skills/cldk-sdk-frontend/references/schema-contract.md b/skills/cldk-sdk-frontend/references/schema-contract.md index 07d2f9d..acb5511 100644 --- a/skills/cldk-sdk-frontend/references/schema-contract.md +++ b/skills/cldk-sdk-frontend/references/schema-contract.md @@ -1,7 +1,16 @@ # The analysis.json contract the SDK models must satisfy +> **Schema v2 — this file predates it.** The analyzer contract is now the backend skill's v2 +> keystone (`codeanalyzer-backend/references/canonical-schema.md`): one additive node-tree + typed +> edges (a CPG), `can://` ids, `application → symbol_table{module} → types/functions → callables → +> body`, split edge lists, `source` per module, `max_level`. Mapping the SDK's Pydantic models to +> v2 while **keeping the same public API** (`CLDK.(...)`, the same accessors) is a **major SDK +> release** (the backend hand-off's `§ c`) and is the next rebuild of *this* skill. Until that lands, +> read the sections below as the *old* (v1) contract; the authoritative shape is the v2 keystone and +> the analyzer's real sample `analysis.json`. + The analyzer (built by the **codeanalyzer-backend** skill) emits a single `analysis.json`. Your job -in this skill is to encode SDK-side `` models that **load and validate that JSON**, plus a facade +in this skill is to encode SDK-side models that **load and validate that JSON**, plus a facade that queries it. This file states the **invariant contract** the models must satisfy. It is *not* the exhaustive field catalog — **the authoritative, complete field list is whatever the analyzer's sample `analysis.json` actually contains** (plus the node kinds recorded in the backend's diff --git a/skills/codeanalyzer-backend/SKILL.md b/skills/codeanalyzer-backend/SKILL.md index 00bd09b..6456aa7 100644 --- a/skills/codeanalyzer-backend/SKILL.md +++ b/skills/codeanalyzer-backend/SKILL.md @@ -1,516 +1,238 @@ --- name: codeanalyzer-backend description: >- - Build the BACKEND language analyzer for CodeLLM-DevKit (CLDK): a - `codeanalyzer-` that parses a NEW programming language and emits the canonical - `analysis.json` (symbol table + resolver-based call graph), then packages and releases it as a - thin `codeanalyzer-` PyPI distribution. Use this whenever a CLDK maintainer wants to "add a - language", "build a codeanalyzer for ", "write a CLDK backend/analyzer for ", or - "support in CLDK" at the analyzer level — even if they don't say the word "skill". The core - move is a guided, informed decision about the analyzer's backend tooling (parser, resolver, - enrichment, packaging) for the target language, then scaffolding a MODULAR analyzer to a working, - validated level-1 analysis and shipping it via tag-triggered release automation. This skill stops - at the analyzer; wiring the analyzer into a CLDK SDK (Python/TS/…) is the companion - **cldk-sdk-frontend** skill. Do NOT use this for adding an extension/contribution point to an - EXISTING analyzer (that's codeanalyzer-extension-builder), or for merely *using* CLDK to analyze - code. + Build or migrate the BACKEND language analyzer for CodeLLM-DevKit (CLDK): a `codeanalyzer-` + that parses a programming language and emits the **canonical schema v2** — one additive + node-tree-plus-typed-edges (a CPG) — in BOTH `analysis.json` and Neo4j, then packages and + releases it. Use this whenever a CLDK maintainer wants to "add a language", "build a + codeanalyzer for ", "migrate 's analyzer to the new schema", "emit CFG/PDG/SDG/dataflow + for ", or "support in CLDK" at the analyzer level — even if they don't say "skill". Two + entry paths: a NEW language (scaffold the analyzer from scratch) or an EXISTING analyzer (a + major release that adapts it to emit schema v2). The core move is designing/confirming the + canonical schema for the language, then building it up **level by level** (L1 symbol table → L2 + call graph → L3 intraprocedural dataflow → L4 interprocedural SDG), each an additive layer, + shipped via tag-triggered release automation with a CLAUDE.md agent guide. This skill stops at + the analyzer; wiring it into a CLDK SDK is the companion **cldk-sdk-frontend** skill. Do NOT use + this for adding a contribution point to an existing analyzer (codeanalyzer-extension-builder), + or for merely *using* CLDK to analyze code. --- # CLDK analyzer backend -Build a new language's **backend analyzer** `codeanalyzer-`: it parses the language and emits -the canonical `analysis.json` (symbol table + call graph), then ships as a thin -`codeanalyzer-` PyPI distribution. This skill owns **one surface** — the analyzer and its -distribution. Wiring that analyzer into a CLDK **frontend SDK** (`CLDK(language="") -.analysis(...)` in the Python SDK, and later the TS/Rust/Go/Java SDKs) is the separate -**cldk-sdk-frontend** skill, which consumes this skill's output. Keep that boundary: here you -produce a validated, released analyzer + its `analysis.json` contract; the frontend skill binds it. +Build (new language) or migrate (existing analyzer) a `codeanalyzer-` that emits the +**canonical schema v2** (`references/canonical-schema.md` — read it first, it is the keystone). +The schema is **one additive structure** — a tree of code nodes with typed edge overlays, a CPG — +emitted in **two projections**: `analysis.json` and a Neo4j graph. Both are first-class +deliverables. This skill owns that analyzer and its distribution; wiring it into a CLDK **frontend +SDK** is the separate **cldk-sdk-frontend** skill, which consumes this skill's output. -The skill's defining move is **not** picking a template — it's running a guided, informed decision -about *how to build the backend* for this specific language, then scaffolding from that decision. A -new language's analyzer must live in that language's own ecosystem to reach its best tooling, so the -tooling choices genuinely differ per language and the user owns them. +The organizing principle is the schema's own: -## Before you start: orient - -- Confirm the **target language** and locate the CLDK reference repos — you anchor the schema - and construction on the **already-implemented** analyzers. They normally sit as siblings: - `codeanalyzer-java/`, `codeanalyzer-python/` (analyzer templates), `codeanalyzer-ts/` (a - **cautionary** reference — see below), and `python-sdk/` (which also contains the **C** analyzer - under `cldk/analysis/c/` — the procedural, non-class anchor — and is the model SDK Pydantic - schema the analyzer's output must validate against). **If any of these is not present locally, - clone it into `/tmp` and anchor on that copy** (read-only — never push to these): - ``` - for r in codeanalyzer-java codeanalyzer-python codeanalyzer-ts python-sdk; do - [ -d "/tmp/$r" ] || git clone --depth 1 https://github.com/codellm-devkit/$r.git "/tmp/$r" - done - ``` - Prefer a local sibling checkout if one exists (it may be ahead of `main`); fall back to the - `/tmp` clone. Don't invent locations, and don't proceed to schema design without at least the - Java and Python analyzers plus `python-sdk` available to read. -- Skim the analyzer references to ground yourself: **`codeanalyzer-python` is the model to - replicate** — the modern, pluggable, cleanly-separated template (tree-sitter + Jedi); - `codeanalyzer-java` is the heavyweight WALA one. Most new languages follow the *structure* of - the Python one but in their own ecosystem. **`codeanalyzer-ts` is a cautionary reference: it - runs and validates, but it was generated as a flat monolith** (a 968-line grab-bag of free - functions, a `core` that inlines everything and hardcodes `entrypoints: {}`, and **no - pluggable pass/registry/finder layer at all**). Read it to learn the anti-patterns to avoid — - not the structure to copy. **Producing a modular package, not a working monolith, is a - first-class success criterion of this skill** (see `references/analyzer-architecture.md`). -- Read these reference files now — they are the spec the scaffolding must satisfy: - - `references/analyzer-architecture.md` — **the modular package skeleton the analyzer must - have** (anchored on `codeanalyzer-python`, with `codeanalyzer-ts` as the anti-example). - Read it before scaffolding: the seams are laid up front, not retrofitted. - - `references/canonical-schema.md` — the `analysis.json` contract and its invariants. **Read first.** - - `references/schema-reference.md` — the exhaustive, field-by-field schema derived from the - SDK Pydantic models. This is what the analyzer must mirror **comprehensively** (every - field, not a subset), and the basis for the validation success criterion. - - `references/schema-design-loop.md` — **the method** for *Schema Design*: design the schema node by - node by anchoring on Java + Python and **bringing every divergence to the user as a - decision**. - - `references/project-materialization.md` — *Project Materialization*: the build/dependency phase that must run - **before parsing** (Java downloads deps for the SymbolSolver classpath; Python builds a - venv for Jedi) so the resolver can populate types. - - `references/symbol-table-construction.md` — *Symbol Table Construction*: how to walk files and populate the - table, modeled on how Java (`SymbolTable.extractAll`) and Python (`core.py` rglob loop) - actually do it. - - `references/backend-recipe.md` — the 9-step methodology for building the analyzer. - - `references/tooling-menu.md` — the per-language decision you'll walk the user through. - - `references/cli-contract.md` — the CLI flags the analyzer must expose (the contract the - frontend SDKs depend on; owned here). - - `references/neo4j-projection.md` — the **optional second output surface**: projecting the - same IR into a Neo4j graph via `--emit neo4j` (Cypher snapshot + live Bolt push). Every - mature analyzer ships it; add it once level-1 JSON is solid. - - `references/dataflow-graphs.md` — the **levels 3–4 contract**: native intraprocedural - CFG/DFG/PDG (L3) and interprocedural SDG + clients (L4), the CPG projection, node identity, - `program_graphs` emission, parity clause, and verification gates. Read before any dataflow work. - - `references/dataflow-construction.md` — the **method**: the stage-by-stage algorithm ladder, - split at the L3/L4 seam (Stages 1–4 intraprocedural / AST-only; Stages 5–8 interprocedural / - oracle-backed), per-language lowering checklists, gates, and fixture minimums. - - `references/dataflow-substrate-menu.md` — the dataflow counterpart of the tooling menu: - per-language CFG / def-use / points-to substrate decisions (the points-to slot is the L4 gate). - - `references/dataflow-issue-template.md` — the planning template a language instantiates - (goals, locked substrate decisions, staged PRs, caveats) before building levels 3–4. - - `references/testing-and-validation.md` — **all analyzer-side verification criteria, fixture - design rules, and definitions of done.** Read before writing any tests. (SDK-side testing - is the frontend skill's `references/sdk-testing.md`.) - - `references/packaging-and-release.md` — **the distribution layer**: cross-compile the binary, - ship it as a thin `codeanalyzer-` PyPI package (+ raw binaries as GitHub Release assets + - a `brew install codeanalyzer-` formula pushed to the shared `codellm-devkit/homebrew-tap`), - and cut tag-triggered releases. Standing up `packaging/python/` + `packaging/homebrew/` + - `release.yml` is a first-class deliverable. - -## Workflow - -Work the steps below in order, and **don't design the whole thing up front**. Design the schema, -**scaffold the modular package skeleton**, materialize the project's dependencies, construct the -symbol table file by file, then build the cheap resolver-based call graph. *Symbol Table Construction* + *Call Graph Construction* together -are **level 1 — the cheap, resolver-based analysis** (symbol table *and* call graph, both from -the same Tier-1 resolver). The heavy **level 2 — framework-based** analysis (WALA/Joern/ -SVF) is optional and comes later. Each step models itself on what the mature reference analyzers -(Java + Python) do. - -### Orient & choose the backend tooling -The developer's real first move: *what backend am I using?* Walk the user through the tooling -menu (`references/tooling-menu.md`). **Pre-fill a recommendation for each slot** (runtime, -structural parser, resolver, optional enrichment, build/dep materialization, packaging) and ask -for confirmation — don't silently choose, don't ask an open-ended "what do you want?". Use -`AskUserQuestion` for the load-bearing slots, especially *is the structural tool also the -resolver, or are they separate?* — that reshapes everything downstream. Note what the chosen -resolver needs materialized (Jedi→venv, TS checker→`tsconfig`+`node_modules`, `go/types`→`go mod -download`). - -Also ask the **analysis depth** they want (`AskUserQuestion`): -- **Rapid — level 1 (default):** symbol table + the cheap resolver-based call graph. The - framework backend is left stubbed. -- **Deep — level 2:** also stand up the framework-based backend (Joern/SVF/WALA), - flipping the *Level 2: framework-based analysis* step from stubbed to implemented. - -Default to **rapid (level 1)** — level 1 is always built (it's the floor; level 2 builds on it), -and deep is opt-in. Record the agreed choices — including the depth **and the packaging build -strategy** (single-host cross-compile vs native-runner matrix; `packaging-and-release.md`) — under -an **"Architecture & Tooling"** heading in the analyzer's own `codeanalyzer-/README.md`. This -is deliberately a public, top-level doc: it documents for human readers *which backend tooling -was chosen and why*, and it doubles as the guide any later session (you included, or the -**cldk-sdk-frontend** skill) reads to recover the locked decisions without re-litigating them. -Capture each load-bearing slot (runtime, structural parser, resolver, optional enrichment, -build/dep materialization, packaging, depth, extra node kinds) and a one-line rationale per -non-default choice. Keep the *Schema Design* `SCHEMA_DECISIONS.md` under the analyzer's `.claude/` -folder (create it if needed); only these tooling decisions are promoted into the README. +> **Codeanalyzer is an additive analysis paradigm: each analysis level is the same tree grown one +> layer deeper, plus one edge family over the new layer.** -**Then check the toolchain is installed, before building anything.** The chosen tooling has hard -prerequisites (Node + the analyzer's deps for ts-morph; the Go toolchain for `go/types`; the -Rust toolchain + rust-analyzer; clang/libclang for C++; plus any framework backend like -Joern if *deep*) **and the packaging/release toolchain that cross-compiles and publishes the -`codeanalyzer-` package** (e.g. Bun for `bun build --compile --target=...`, GraalVM -`native-image` for JVM, the cross-compile target for Go/Rust; plus Python `build`/`wheel`/`twine` -+ `auditwheel` for the platform wheels — see `references/packaging-and-release.md`). Probe for -them (e.g. `node --version`, `go version`, `rustc --version`, `clang --version`, `bun --version`). -**If anything required is missing, stop and instruct the user to install it** -— give the exact install commands for their platform and what each is for — and **wait** until -they confirm it's available. Do **not** proceed to scaffold-and-leave-unverified: an analyzer you -can't run is an analyzer you can't validate against the schema, which is the whole success -criterion. Only continue once the toolchain is present. +So you don't "build an analyzer" and then bolt on features — you **grow one structure, level by +level**, and each level is independently shippable. -### Schema Design (interactive, node by node) -Design the canonical schema once — it is the **contract** the analyzer's `analysis.json` emits, and -the contract the **cldk-sdk-frontend** skill later encodes as SDK models. Here you produce **two -things in lockstep: the analyzer-side types AND the contract** (`canonical-schema.md` / -`schema-reference.md`); the per-SDK `cldk/models//` Pydantic models (and TS types) are built -later by the frontend skill against this same approved contract. Run the loop in -`references/schema-design-loop.md` per node (spine first: `Module` → -`Class` → `Callable` → `Callsite` → `CallEdge`, then language-native kinds): +## Two paths -1. **Anchor** — read the node in **Java** (`cldk/models/java/models.py`) and **Python** - (`py_schema.py`) side by side. Catalog the shared spine and **every place they diverge**. -2. **Differentiate** — ask *"how is the `` language structurally different here?"* - (language semantics, not domain) and note each genuinely new concept. -3. **Decide each open point WITH the user.** This is the rule: for every divergence and every - new concept, **don't choose silently — ask** (`AskUserQuestion`). Present it as *"Java did X, - Python did Y; for ``, concept Z, how do you want to model it?"* with explained options - and a recommended default. (E.g. *Java annotations are flat strings, Python uses structured - `PyDecorator`; for TS decorators that carry args, option 1: structured `TSDecorator` - (recommended) …*.) Record each answer in `.claude/SCHEMA_DECISIONS.md`. -4. **Define** — encode the decisions into the analyzer-side type and update the contract; - snake_case, optional-with-defaults, spine untouched, identity-only edges. (These same decisions - drive the SDK models the frontend skill builds — `SCHEMA_DECISIONS.md` is its input.) +Decide which up front (`AskUserQuestion` if unclear); the rest of the workflow branches lightly on +it: -No files are walked yet. Output: a complete, user-approved schema contract + the analyzer types + -`SCHEMA_DECISIONS.md`. +- **(A) New language.** No analyzer exists. Choose the backend tooling, scaffold a modular + analyzer, and build the schema up level by level. Most of this file. +- **(B) Existing analyzer → schema v2.** A `codeanalyzer-` exists on the **old** schema + (flat `symbol_table` + rich or identity edges). This is a **major release**: keep the + analyzer's parsing/resolution guts, and adapt its *emission* to schema v2 (both JSON and Neo4j), + level by level. Follow `references/schema-migration.md`; the level structure below still governs + the order you migrate in. -### Scaffold the modular skeleton (seams first, before filling phases) -Before writing any analysis logic, lay out the analyzer as a **modular package that mirrors -`codeanalyzer-python`'s structure** — one subpackage per phase plus the pluggable pass layer — -following `references/analyzer-architecture.md`. Create the boxes empty-but-wired: a thin CLI -entry; a `core` **orchestrator that only delegates** (no inlined parsing, and never a hardcoded -`entrypoints: {}`); `syntactic_analysis/`, `semantic_analysis/` (with the framework backend in -its **own subpackage**, seams scaffolded even when stubbed); and the extensibility layer — -`analysis/` (the `AnalysisPass` base + a registry that discovers, topo-orders by -`requires`/`provides`, and runs a `run_pipeline`) and `frameworks/` (the entrypoint-finder base). -The built-in pass list and concrete finders may start empty, but **the seams and entry-point -discovery must exist now** — that is exactly the layer the generated TS analyzer was missing, and -where `codeanalyzer-extension-builder` later plugs in. Retrofitting modularity into a monolith is -the failure this step prevents. +Either way the target is identical: an analyzer whose output validates against +`canonical-schema.md` at its implemented `max_level`, in both projections. -### Project Materialization (build & dependency resolution) -Before parsing, materialize the target project's dependencies so the resolver can populate -types — this is a real phase with its own failure modes. Follow -`references/project-materialization.md`, modeled on Java -(`BuildProject.downloadLibraryDependencies` runs *before* the symbol table, for the -SymbolSolver classpath) and Python (`core.py` builds a **venv** + `pip install` and passes it -to the symbol-table builder, because Jedi needs it). For the new language: detect the manifest -(`tsconfig.json`+`package.json`, `go.mod`), run the ecosystem installer (`npm ci` → -`node_modules`; `go mod download`), **cache** it under `cache_dir`, **degrade gracefully** to -partial types on failure (never crash), and honor `--no-build`/`--eager`. Source-level -resolvers (TS checker, `go/types`, Jedi) need deps **present**, not a full compile; defer any -heavier compile to just before *Call Graph Construction* if your call-graph backend needs build artifacts. - -### Symbol Table Construction (file by file) -Now populate the schema. Follow `references/symbol-table-construction.md`, which is built by -**studying how Java (`SymbolTable.extractAll` → `symbolTable.put(path, ...)`) and Python -(`core.py`'s `rglob` loop → `build_pymodule_from_file` → `symbol_table[file_key] = module`) -iterate over files** — then doing the same for the new language: discover source files (skip -vendored/test trees), compute stable relative `file_key`s, per-file cache-check then build the -`Module` (filling classes/functions/native kinds + **unresolved** call sites with -`callee_signature` null + cache metadata), and assemble `symbol_table: Dict[path, Module]`. -Support whole-project, `-t` target-files, and (optional) single-source modes. This stage -records call sites but doesn't resolve them into edges yet — the cheap resolution is the very -next stage (still level 1). - -**Path predicate pitfall — apply filters to the relative path, never the absolute path.** -Every file-skip predicate (`IsVendored`, `IsTestFile`, and any custom equivalent) must be -evaluated against the path *relative to the project root* — not the absolute path. Absolute -paths carry segments from the analyzer's own directory layout (`testdata`, `vendor`, `.git`, -etc.) that falsely trigger the filter and silently empty the symbol table. Resolve the project -root to an absolute path at the top of the analysis entry point, then derive all relative keys -as `rel(projectRoot, absFilePath)`. Using the process's working directory as the base -(e.g. `rel(".", absPath)`) is a separate trap: it produces the right answer only when the -process happens to run from the project root, which is never the case in tests. - -**Cross-file type/method attachment — check whether your language requires a two-pass build.** -In some languages a type and its method bodies can be spread across multiple files of the same -unit (Go packages, C# partial classes and extension methods, Kotlin extension functions, Ruby -open classes). A single-pass, file-by-file builder that resolves receiver types only within -the current file silently drops every method defined in a sibling file. Identify whether the -target language has this property before writing the builder. If it does, use a two-pass -approach: pass 1 collects all type declarations from every file and builds a -`(unit, typeName) → ownerFile` index; pass 2 attaches methods using that index. Retrofitting -this after the fact is costly — the fix lives in the core iteration loop. - -**Symbol-table gate (verify):** Run the analyzer on the fixture and confirm the criteria in -`references/testing-and-validation.md § 2` (symbol-table gate). Don't proceed until this passes. - -### Call Graph Construction (resolver-based, cheap — completes level 1) -This is **cheap and part of level 1**, not a heavy pass: the same Tier-1 resolver that typed the -symbol table (Jedi/tsc/rust-analyzer/clang) is already loaded, so resolving call sites into edges -is inexpensive. For each recorded call site: resolve the callee → **backfill `callee_signature` -in place** → emit an identity-only edge `source_sig → target_sig` with `provenance` = your -resolver. Handle constructors/`new`, receiver-type dispatch, and an explicit unresolved fallback -(record the site, skip the edge, never crash). Don't mutate the symbol table beyond filling -`callee_signature`. +## Before you start: orient -**Its precision is a decision the references disagree on — so ask.** Don't frame the tiers as -"whole-program vs not" — once deps are materialized the resolver resolves across the -whole program too; the axis is the *engine* (`tooling-menu.md` § "Call-graph tiers"). Python's -cheap `jedi` call graph lives here at level 1 and **drops** unresolved sites; **Java is the -outlier** — it has no cheap resolver call graph, so its call graph *is* the heavy Tier-2 WALA -pass (`makeRTABuilder` → **RTA**), which for a new resolver-capable language belongs in the -*Level 2: framework-based analysis* step. For the chosen resolver, surface the dispatch choice -(declared-type only ≈ CHA, + instantiated subtypes ≈ RTA-style); heavier framework-based -precision (WALA/Joern/SVF) belongs to that level-2 step, not here. +- Confirm the **target language** and locate the CLDK reference repos (read-only; prefer a local + sibling checkout, else clone into `/tmp` from `github.com/codellm-devkit/`): + `codeanalyzer-java` (WALA — already ships L3/L4 via its slicer, the worked example of the full + ladder), `codeanalyzer-python`, `codeanalyzer-typescript`, and `python-sdk` (the SDK your output + must validate against). For an existing-analyzer migration, its own repo is the primary anchor. +- **Read the keystone first**, then the rest: + - `references/canonical-schema.md` — **the v2 model.** The tree, the id grammar, the additive + levels, the two projections. Everything else serves this. + - `references/schema-reference.md` — the per-kind field/edge appendix. + - `references/schema-design-loop.md` — **the method** for confirming the language's schema node + by node (which kinds/fields it adds), anchored on the keystone + Java/Python. + - `references/schema-migration.md` — path (B): old schema → v2, field-by-field, as a major + release. + - `references/analyzer-architecture.md` — the **modular package skeleton** (delegating `core`, + per-phase subpackages, pluggable pass layer). Producing a *modular* analyzer is a success + criterion, not a nicety. + - `references/tooling-menu.md` — the L1/L2 backend-tooling decision (parser, resolver). + - `references/dataflow-substrate-menu.md` — the L3/L4 substrate decision (CFG source, def-use, + points-to oracle). The points-to slot is the L4 gate. + - `references/dataflow-graphs.md` + `references/dataflow-construction.md` — the L3/L4 contract + and construction method (CFG → dominance → def-use → PDG → summaries → SDG). + - `references/cli-contract.md` — the CLI flags (`-a 1|2|3|4`, `--emit`, `--graphs`). + - `references/neo4j-projection.md` — the co-primary graph projection (always full-depth). + - `references/project-materialization.md`, `references/testing-and-validation.md`, + `references/packaging-and-release.md` — build/deps, gates, distribution. + +## Workflow — grow the tree, level by level + +Work in order. Design the schema, scaffold the modular skeleton, materialize dependencies, then +**build the structure one level at a time**, each additive and gated. Every level emits **both** +projections (JSON + Neo4j). Levels 1–2 are the floor (always built); levels 3–4 are the dataflow +tier (opt-in, added when asked and when the substrate is chosen). -**Verify:** confirm the criteria in `references/testing-and-validation.md § 2` (call-graph -gate) — every edge endpoint matches a real signature (no dangling nodes) and output still -validates. (`backend-recipe.md` step 6.) +### Orient & choose the backend tooling +Walk the user through `references/tooling-menu.md` (runtime, structural parser, resolver, +build/dep materialization, packaging) and — **if L3/L4 are in scope** — +`references/dataflow-substrate-menu.md` (CFG source, def-use source, points-to oracle). Pre-fill a +recommendation per slot and confirm (`AskUserQuestion` for load-bearing ones). Ask the **target +depth** (`max_level`): L1–2 (symbol table + call graph, the default floor), L3 (intraprocedural +dataflow), or L4 (interprocedural SDG + taint). Record the locked decisions under an **Architecture +& Tooling** heading in the analyzer's `README.md`, and keep schema decisions in `.claude/ +SCHEMA_DECISIONS.md`. **Then verify the toolchain is installed** (parser, resolver, the points-to +oracle if L4, plus the packaging/release toolchain) — if anything required is missing, stop and +give exact install commands, and wait. An analyzer you can't run is one you can't validate. + +### Schema design (confirm the language's shape against the keystone) +The schema is already designed — it's `canonical-schema.md`. Here you **confirm the +language-specific expansion**: which type kinds, callable kinds, body-node kinds, CFG-edge kinds, +and typed fields this language adds to the shared spine (`references/schema-design-loop.md`). Run +it node by node, anchoring on the keystone and on how Java/Python model the same concept, and +**bring every genuine divergence to the user** (`AskUserQuestion`) — *"the spine has `type` with a +`kind`; Go needs `struct` + a receiver on methods; model receiver as X?"*. Record each answer in +`.claude/SCHEMA_DECISIONS.md`. Output: the confirmed per-language kind/field set, still the same +tree. (Path B: this is where you map old fields → v2 kinds; see `schema-migration.md`.) + +### Scaffold the modular skeleton (seams first) +Lay out the analyzer as a **modular package** mirroring `codeanalyzer-python` +(`references/analyzer-architecture.md`): a thin CLI; a `core` **orchestrator that only delegates**; +`syntactic_analysis/` (the tree builder), `semantic_analysis/` (call graph + the dataflow passes, +framework backend isolated in its own subpackage), a `neo4j/` projection subpackage, and the +pluggable `analysis/` pass layer + `frameworks/` finder layer. Create the boxes empty-but-wired. +Retrofitting modularity into a monolith is the failure this prevents (`codeanalyzer-ts`'s original +flat build is the anti-example). + +### Project materialization (build & dependency resolution) +Before parsing, materialize the target project's dependencies so the resolver can populate types +(`references/project-materialization.md`) — Java downloads deps for the classpath, Python builds a +venv for Jedi, Go runs `go mod download` for `go/packages`. Cache under `cache_dir`, degrade +gracefully to partial types on failure, honor `--no-build`/`--eager`. + +### L1 — build the tree (symbol table) +Grow the containment tree to **callable depth**: `application → symbol_table{module} → +types{}/functions{} → callables{}`, each node with its `can://` id, `kind`, `span` (with byte +offsets), and the module's `source` stored once (`references/symbol-table-construction.md`). +Populate the language-native kinds/fields confirmed in schema design. This is the floor; +everything hangs off it. **Emit both projections** (JSON tree + Neo4j nodes/`HAS_*` edges). +**Gate:** output validates against the SDK `Application` model; `symbol_table` keys are relative +paths (no absolute, no `..`); `get_method_body` slices `module.source` correctly; re-run reuses +cache. (`references/testing-and-validation.md` § symbol-table gate.) + +### L2 — call graph +Add the **`call_graph`** edge list at the application scope: resolve each call into a +`callable → callable` edge with `prov` and `weight`, using the Tier-1 resolver +(`references/dataflow-graphs.md` § levels). Backfill the `callee` refinement slot on call nodes +(`null → id`) — the one sanctioned mutation. Call edges are **immutable once written** (never +re-anchored to a statement at L3). Framework enrichment (Joern/WALA) merges *into this same list* +with added provenance — it's the orthogonal precision axis, not a level. **Gate:** every edge +endpoint is a real callable id (no dangling); output still validates. + +### L3 — intraprocedural dataflow (optional; the first dataflow level) +Grow the tree **below the callable**: populate each callable's `body` with statement nodes, and +add the intra-callable edge lists `cfg`, `cdg`, `ddg` (**syntactic** — name-equality, no points-to +oracle needed). Build stage by stage per `references/dataflow-construction.md` (CFG → dominance → +def-use → PDG). AST-only and **per-callable parallel** (`-j`). This is a complete, shippable +capability (`-a 3`). **Gate:** the intraprocedural backward-slice on the fixture equals the +hand-computed node set, exactly. + +### L4 — interprocedural dataflow (optional; needs the points-to oracle) +Add the **synthetic parameter vertices** (`formal_in/out`, `actual_in/out`) to `body`, the +cross-function `param_in`/`param_out` edge lists, the intra-caller `summary` edges, and the +**semantic** (alias-aware) `ddg` edges (`prov:["points-to"]`) — the whole-program SDG. Needs the +points-to oracle from the substrate menu + the summary fixpoint (stages 5–8 of +`dataflow-construction.md`). `-a 4`. **Gate:** no dangling SDG endpoints; a known source→sink taint +flow is found and its sanitized variant reported sanitized. + +### Neo4j projection (co-primary, always full-depth) +The Neo4j graph is not an afterthought — it's the **second required projection** +(`references/neo4j-projection.md`). Build it as the modular `neo4j/` subpackage (pure +`project() → GraphRows → cypher/bolt writers` + a declarative schema catalog). Containment renders +as typed `HAS_*`/`DECLARES` edges; every overlay edge renders as a typed relationship; nodes carry +their `can://` id. `--emit neo4j` always runs at **maximum implemented depth** — analysis levels +gate the JSON path only; combining `-a`/`--graphs` with `--emit neo4j` is an explicit error. Keep +the graph schema versioned and in lockstep with the JSON schema (same kinds → labels). ### CLI, caching/incremental, packaging & release -Add the CLI family surface (`cli-contract.md`) with `analysis.json` as the only facade-visible -output. **Validate all flag values** — unrecognized or unimplemented values (e.g. `--format -msgpack` before msgpack is implemented) must return a non-zero exit with a clear message, never -silently fall back (`cli-contract.md § Flag validation requirements`). - -**Caching has three independent layers — implement and test each explicitly:** - -1. **Materialization cache** — memoizes the dependency-fetch step (`go mod download`, `npm ci`, - venv build) by hashing the manifest (`go.sum`, `package-lock.json`, `requirements.txt`). - Stored in `cache_dir`. Bypassed by `--eager`. -2. **Per-run output cache** (`analysis_cache.json`) — written to `cache_dir` after every - successful `Analyze()` call. Always rewritten; gives the SDK something to read without - re-invoking the binary. `--eager` rewrites it; non-eager runs still write it (it's not - a skip guard at the binary level). -3. **SDK-level skip** — the Python facade reads the *output dir*'s `analysis.json`, validates - it, and **skips invoking the binary entirely** if valid. This is where the real "don't - re-run the binary" logic lives (frontend skill). The binary itself always runs fresh - analysis when invoked. - -The behavioral tests for caching are in `references/testing-and-validation.md § 2`. - -**For packaging, be opinionated and follow `references/packaging-and-release.md`: -build a self-contained binary for every platform, then ship it as a thin -`codeanalyzer-` PyPI package** — one platform-tagged wheel per OS/arch, carrying the binary -and exposing `bin_path()` — **plus raw binaries as GitHub Release assets, plus a Homebrew formula -`Formula/codeanalyzer-.rb` pushed to the shared `codellm-devkit/homebrew-tap`** (so end users -get `brew install codeanalyzer-`), all cut by a **tag-triggered `release.yml`**. The brew -formula reuses the same Release-asset binaries (compiled case) or the same PyPI package (Python -case) — never a rebuild. The frontend SDKs *depend on* that published package; they never -bundle or build the binary. Build it by **single-host cross-compile where the toolchain allows** (TS -via `bun build --compile --target=`; Go via `GOOS`/`GOARCH`; Rust via target triples) **or a -native-runner build matrix where it doesn't** (JVM via GraalVM `native-image`, which can't -cross-compile; C++/clang with per-target sysroots). A Python analyzer is the same PyPI package but -its wheel carries code, imported in-process. **Release automation is standard practice, not optional:** stand up -`packaging/python/` (the `build_wheels.sh` + `pyproject.toml` + `bin_path()` package) and -`.github/workflows/release.yml`, tag releases `vX.Y.Z` with real notes modeled on -`codeanalyzer-python`'s GitHub Releases (Keep-a-Changelog *Added/Changed/Fixed* + auto-generated -*Detailed Changes*), publish to PyPI under `codeanalyzer-` (prefer OIDC Trusted Publishing), -and **record the published name + version** so the frontend skill can pin it. (`backend-recipe.md` -steps 3, 8, 9; full spec in `references/packaging-and-release.md`; rationale in `tooling-menu.md` -§ "Packaging".) - -### (Optional) Neo4j graph projection — a second output surface -Once the level-1 `analysis.json` path is solid, add the **optional Neo4j projection** every -mature analyzer now ships (`references/neo4j-projection.md`). It is not an ingestion of the -JSON — it's an **alternative projection of the same in-memory IR**, selected by `--emit neo4j`, -producing either a self-contained `graph.cypher` snapshot or a live Bolt push, plus `--emit -schema` for the static `schema.neo4j.json` contract. Build it as a modular `neo4j/` subpackage -(`project` → `GraphRows` → `cypher`/`bolt` writers + a declarative `schema`), keep the driver an -**optional/lazy** dependency, and hold the graph schema in lockstep with the JSON schema (same -`SCHEMA_DECISIONS.md` node kinds → node labels; identity-only call edges → `CALLS`). The SDK's -Neo4j backend (frontend skill) reconstructs the canonical model from this graph, so the node -families and `--app-name` anchor must match. **The graph is always full-depth:** analysis levels -gate the JSON path only — `--emit neo4j` runs at maximum implemented depth (once levels 3–4 exist, -the complete SDG/CPG, unconditionally), and combining `-a`/`--graphs` with it is an explicit -error (`neo4j-projection.md § Depth rule`). Leave the projection out only if the user explicitly -scopes to JSON-only; otherwise it's a standard deliverable of the CLI/packaging stage. - -### (Optional) Level 2: framework-based analysis -Gated on the depth choice from *Orient & choose the backend tooling*. The heavy tier — a dedicated analysis engine -(Joern/SVF, or WALA-style; `backend-recipe.md` step 7) for points-to/dataflow edges the -cheap resolver can't reach. If the user picked **rapid (default)**, leave it a wired, flag-gated -extension point with a clear TODO. If they picked **deep**, implement it now and merge its edges -into the resolver graph by `(source, target)` with provenance union. (For a language whose call -graph is *only* available this way — e.g. Java/WALA — this stage is where that call graph lives, -regardless of the depth choice.) - -### (Optional) Levels 3–4: native dataflow graphs -A separate, later body of work — never part of the initial language bring-up, and itself **two -shippable levels**. When the user asks for dataflow, slicing, or taint ("CFG/PDG/SDG", -"reachability", "what does this value affect"), plan it with -`references/dataflow-issue-template.md` (one epic issue, staged PRs), decide the substrate slots -from `references/dataflow-substrate-menu.md` (confirmed with the user, recorded in the README's -*Architecture & Tooling*), and build stage by stage per `references/dataflow-construction.md` -against the contract in `references/dataflow-graphs.md`. - -The **L3/L4 split is the key planning decision**: **level 3** (`-a 3`, intraprocedural CFG/DFG/PDG -per function) is AST-only, per-callable parallel, and shippable with *no points-to oracle* — ship -it first. **Level 4** (`-a 4`, the interprocedural SDG + taint/slicing clients) is the heavier -tier that needs the oracle from the substrate menu and the whole-program summary fixpoint — add it -once the oracle lands. The rules that bind both: everything is **native and in-process**; graphs -are keyed by `(signature, node_id)` on the same `signatureOf()`; each stage's gate passes before -the next; `-a 1`/`-a 2` stay untouched and `-a 3` must not pay L4's cost; the **SDG (L4) is the -core artifact** (clients query it), and the CPG is only its Neo4j projection — skip the CPG if the -Neo4j surface isn't in scope. +Add the CLI family (`references/cli-contract.md`): `-a 1|2|3|4`, `--emit json|neo4j|schema`, +`--graphs`, `-j/--jobs`, `--eager`, `-c/--cache-dir`. **Validate all flag values** (unimplemented +→ non-zero error, never silent fallback). Cache by hash/mtime with vendored/test trees skipped. +**For packaging, be opinionated and follow `references/packaging-and-release.md`:** a +self-contained binary per platform, shipped as a thin `codeanalyzer-` PyPI wheel (+ GitHub +Release binaries + a `codellm-devkit/homebrew-tap` formula), cut by a tag-triggered `release.yml`. +The SDKs depend on the published package; they never build the binary. For an existing analyzer +migrating to v2, this is a **major version bump** — the schema change is breaking. ### Write the analyzer README (last build step) -The analyzer's `codeanalyzer-/README.md` already holds the **Architecture & Tooling** -decisions recorded back in *Orient & choose the backend tooling*. As the **final build step**, -grow that file into a complete, user-facing README modeled on the reference analyzers' -**`main`-branch** READMEs — `codeanalyzer-python/README.md` (the model to replicate) and -`codeanalyzer-java/README.md`. Don't invent a layout; mirror theirs, in this order: -- **Logo + title + one-line what-it-is** — open with the shared CLDK logo, reusing the Python - repo's hosted URL (the analyzers share branding) rather than committing a per-language copy: - ```md - ![logo](https://github.com/codellm-devkit/codeanalyzer-python/blob/main/docs/assets/logo.png?raw=true) - ``` - Then name the language and the chosen backend tooling (e.g. "Static analysis for `` - using `` + ``"), echoing the reference openers. -- **Prerequisites / installation** — the toolchain confirmed installed up front (runtime, - parser, resolver, plus any framework backend if *deep*), with exact per-platform install - commands as Python does for `venv`/build tools. Read the minimum version from the **build - manifest** (`go.mod`'s `go` directive, `Cargo.toml`'s `rust` field, `pyproject.toml`'s - `requires-python`, etc.) — not from what happens to be installed. Record both the minimum - and the version the analyzer was actually tested on. -- **Building, packaging & releasing** — how to build the self-contained binary and ship it - as the `codeanalyzer-` PyPI package + GitHub Release assets, and how releases are cut - (`packaging/python/` + `packaging/homebrew/` + tag-triggered `release.yml`), per *CLI, - caching/incremental, packaging & release* and `references/packaging-and-release.md`. For an SDK - user it's just `pip install codeanalyzer-`; for an end user, `brew tap codellm-devkit/tap && - brew install codeanalyzer-`. -- **Usage + CLI options** — paste the real `--help` output (from `cli-contract.md`), then a few - worked **examples** like the Python README (basic symbol table, `--output`, level-2/framework - flag). -- **Analysis levels** — what level 1 (symbol table + resolver call graph) emits today and what - level 2 (framework backend) adds — flagged stubbed-vs-implemented per the depth choice. -- **Output schema** — point at the canonical `analysis.json` / `Application` contract. -- **SDK integration** — note that the CLDK SDKs bind this analyzer (Python: - `CLDK(language="").analysis(...)`; others later), wired by the **cldk-sdk-frontend** skill. -- Keep the **Architecture & Tooling** section (the locked decisions) intact as its own heading. - -Write only what actually runs — don't document level-2 as working if it's a stubbed extension -point. The README is the human-readable counterpart to the validated `analysis.json`: like every -other stage, it describes the analyzer as it really is. +Grow the `README.md` (which already holds the Architecture & Tooling decisions) into a complete, +user-facing README modeled on `codeanalyzer-python`'s: logo + one-liner; prerequisites (read the +minimum toolchain version from the build manifest, not what's installed); building/packaging/ +releasing; usage + real `--help`; **the analysis levels** (what L1–L4 emit today, flagged +implemented-vs-stubbed by `max_level`); the schema contract (point at `canonical-schema.md`); and +SDK integration (bound by **cldk-sdk-frontend**). Write only what actually runs. ### Write the agent guide (CLAUDE.md + AGENTS.md symlink) — a default artifact -Every analyzer repo ships an **agent onboarding guide as a standing deliverable**, not an -afterthought: a root `CLAUDE.md`, with `AGENTS.md` as a **symlink pointing at it**, so Claude Code -and the generic-agent convention read one source of truth. Always produce these — even for a -minimal analyzer. - -**The template is `codeanalyzer-typescript/CLAUDE.md` — mirror it.** It is the canonical form; do -not invent a layout. Read it and reproduce its structure, regenerating the analyzer-specific -sections for `` and carrying the standard sections over near-verbatim (adjusted for the new -repo). `CLAUDE.md` is the *contributor/maintainer* counterpart to the user-facing README — it tells -a coding agent how this repo is built, not how to use the CLI. Keep it concise and **specific to -the analyzer as actually built** (no boilerplate), in the template's order: - -- **Title + one-liner** — `Agent guidance for codellm-devkit/codeanalyzer- ()`. -- **What this project is** — the language, the chosen backend tooling, that it emits the canonical - `analysis.json` (symbol table + resolver call graph) **and** (if built) the optional Neo4j - projection, and that it **mirrors the Java/Python/TS sibling analyzers so output-shape parity is - a first-class concern**. One line, pointing at the README's *Architecture & Tooling* section for - the locked decisions. -- **Architecture — follow the pipeline** — name the single `analyze()`/`core` orchestrator and - list its ordered stages (materialize → symbol table → call graph → cache → output/neo4j), the way - the template walks `src/core.ts`. State the **modularity rules as invariants** a change must - preserve (no inlined analysis in `core`, no hardcoded `entrypoints: {}`, builder split by node - kind — from `references/analyzer-architecture.md`), and that `Application` in the schema is - the output contract. -- **Directory map** — a path → responsibility table for the actual package layout. -- **Commands** — the real build/test/run/typecheck/schema-gen commands (e.g. `bun run build`, - `bun test`, `bun run gen:schema`; or the Go/Rust/Python equivalents), and the fixture used to - validate `analysis.json`. -- **Schema + packaging contract** — output must validate against the SDK `Application` model - (point at `.claude/SCHEMA_DECISIONS.md`); the Neo4j schema is versioned and enforced by a - conformance test — treat it as a contract; and the version-lockstep rule across the manifest, - `packaging/python/`, the SDK pins, and the brew formula (`references/packaging-and-release.md`). -- **The standard working-style + rules + auxiliary sections** — carry the template's *"I implement - features myself — you assist"*, the numbered **Rules** (think before coding; simplicity; - issue → branch → PR; guard the contract), the teaching-loop / spaced-repetition section (which - defers to `~/.claude/CLAUDE.md`), and the *Auxiliary support tasks* (e.g. tidy up the release - announcement) over near-verbatim, adjusting repo name, short-name, and the upgrade one-liners - (`pip install -U codeanalyzer-`, the brew tap) for this analyzer. -- **Repo rules** — carry over any unbreakable conventions the repo already states (never add - AI-authorship trailers / `🤖` signoffs to PRs); preserve an existing `CLAUDE.md`'s rules rather - than dropping them. - -Create the symlink as a **relative** link at the repo root so it survives clone/checkout: -```bash -ln -sf CLAUDE.md AGENTS.md -``` -**Watch the global-gitignore trap:** many setups exclude `AGENTS.md` in a global -`~/.gitignore_global`, so `git add AGENTS.md` silently no-ops and the symlink never gets -committed. Un-ignore it in the repo's local `.gitignore` (a repo rule overrides the global one), -then commit: -```gitignore -# Un-ignore the agent guide past a global gitignore that excludes AGENTS.md -!CLAUDE.md -!AGENTS.md -``` -Verify with `git check-ignore AGENTS.md` (should print nothing) and confirm `git ls-files` shows -both. If the negation isn't enough in your setup, `git add -f AGENTS.md`. Commit both files (git -stores the symlink). If a `CLAUDE.md` already exists (as a one-line rule file), **fold its content -into the new guide** before adding the symlink — never silently discard it. +Every analyzer repo ships a root **`CLAUDE.md`, with `AGENTS.md` as a relative symlink** to it, so +Claude Code and the generic-agent convention read one source of truth. **Mirror +`codeanalyzer-typescript/CLAUDE.md`** as the template, and it must **describe the schema v2 model +in detail** (for maintainability): the additive paradigm, the node tree + edge overlays, the +`can://` ids, the level structure, and the two projections — so a future agent understands *what +this analyzer emits and why* without re-deriving it. Cover: what the repo is + chosen tooling; the +modular architecture and its invariants; how to build/test/run + the validation fixture; the schema +contract (link `canonical-schema.md` + `.claude/SCHEMA_DECISIONS.md`); packaging/release + version +lockstep; and repo rules (never add AI-authorship trailers). Watch the **global-gitignore trap** — +many setups exclude `AGENTS.md`, so un-ignore it in the repo's local `.gitignore` (`!AGENTS.md`) +and verify `git ls-files AGENTS.md` (or `git add -f`). Fold any existing `CLAUDE.md` in rather than +discarding it. ### Summarize & hand off to the frontend skill -Report: the build plan, the schema decisions the user made (`SCHEMA_DECISIONS.md`), what runs today -(the cheap level-1 analysis — symbol table + resolver call graph — on the fixture), what's stubbed -(the level-2 framework backend), the **distribution artifacts** (the `codeanalyzer-` PyPI -package under `packaging/python/`, the `packaging/homebrew/` formula generator + the -`codellm-devkit/homebrew-tap` push, the tag-triggered `release.yml`, and the **published package -name + version**), the analyzer `README.md` and the **`CLAUDE.md` agent guide with its `AGENTS.md` -symlink** (mirroring `codeanalyzer-typescript/CLAUDE.md`), and the diff summary. Confirm -the **modularity** checks from `references/analyzer-architecture.md` actually hold (delegating -`core`, node-kind-split builder, isolated framework subpackage, present-and-wired `analysis/` + -`frameworks/` layer) — report it as a checklist, not an aspiration. - -**Hand-off to cldk-sdk-frontend.** This skill ends at a working, released analyzer. To make the -language usable from a CLDK SDK, run the **cldk-sdk-frontend** skill next; it consumes exactly what -you produced here: a sample `analysis.json`, the approved schema contract + `SCHEMA_DECISIONS.md`, -the CLI contract (`--help`), and the published `codeanalyzer-` package name + version to pin. -State these explicitly in the summary so the frontend skill (or a later session) has its inputs. - -> **Never fake verification.** Every stage's verify step must actually run. If a required tool -> is found missing mid-build, stop and instruct the user to install it (exact commands + what -> it's for) and wait — don't scaffold-and-leave-unverified, and don't claim a stage passed -> without running it. Full criteria, fixture design rules, and definitions of done: -> `references/testing-and-validation.md`. +Report: the two-path choice, the schema decisions (`SCHEMA_DECISIONS.md`), which `max_level` runs +today and what each level emits (on the fixture, both projections), the distribution artifacts +(PyPI package + version, Release binaries, brew formula, `release.yml`), the `README.md` and the +`CLAUDE.md`/`AGENTS.md` guide, and the diff summary. Confirm the **modularity** checks from +`analyzer-architecture.md` and the **schema gates** from `testing-and-validation.md` actually hold. +**Hand-off to cldk-sdk-frontend:** the SDK binding is a *separate* major release (`§ c`) — it +revises the Pydantic models to the v2 schema while keeping the same public API. Hand over a sample +`analysis.json` (each level), the schema contract + `SCHEMA_DECISIONS.md`, the CLI `--help`, and +the published package name + version to pin. + +> **Never fake verification.** Every level's gate must actually run. If a required tool is found +> missing mid-build, stop and instruct the user to install it and wait. Full criteria, fixture +> design, and definitions of done: `references/testing-and-validation.md`. ## Guardrails -- **Modularity is a success criterion, not a nicety.** A monolithic analyzer that emits valid - `analysis.json` has met the schema bar and *failed* the maintainability bar — both are - required. Mirror `codeanalyzer-python`'s package structure (`references/analyzer-architecture.md`): - a delegating `core` (never inlined analysis, never a hardcoded `entrypoints: {}`), a cohesive - symbol-table builder split by node kind (not a flat pile of free functions), the framework - backend isolated in its own subpackage, and a real pluggable layer — `analysis/` (pass + - registry) and `frameworks/` (finder base), scaffolded even when the built-in pass list is - empty. `codeanalyzer-ts` is the anti-example of every one of these; do not reproduce it. -- **The schema contract is the success criterion.** An analyzer that runs but emits - non-conformant JSON has failed the real job — the SDK can't load it. Mirror the schema - **comprehensively** (`schema-reference.md`) and prove it by validating output against the - SDK `Application` Pydantic model at every level. A thin schema that "looks right" but - drops fields is a silent failure. -- **Expand the schema for the language — that's a feature, not a deviation.** Keep the - invariant spine (root keys, Module→Class/Callable nesting, identity-only edges, - `signatureOf()`), then add the target language's own node kinds and fields as first-class - data rather than forcing it into the Java/Python mold. The contract you design here is what the - frontend skill encodes as SDK models, so record every new kind/field in `SCHEMA_DECISIONS.md`. - See the expansion rubric in `schema-reference.md`. -- **Don't fake the call graph.** Identity-only edges must reference signatures that actually - exist in the symbol table, produced by the same `signatureOf()`. Dangling edges are worse - than no edges. -- **Scope discipline.** This skill builds the *analyzer* and its distribution — nothing in a CLDK - SDK repo. Wiring the analyzer into the Python/TS/… SDKs is **cldk-sdk-frontend**. Enriching an - *existing* analyzer with a new contribution point is `codeanalyzer-extension-builder`. -- **No invented tooling.** If a recommended parser/resolver doesn't exist for the language, - say so and fall back per the menu's reasoning (compiler API → tree-sitter + external - resolver → Joern), rather than inventing a package name. -- **Path predicates must operate on relative paths.** Any skip predicate (`IsVendored`, - `IsTestFile`, or a custom equivalent) applied to an absolute file path will silently match - directory segments from the analyzer's own source tree and discard all files under them. - Apply every such predicate to the path relative to the project root — never to the absolute - path. This is an invisible failure: the analyzer compiles cleanly, all tests pass on the - project, and the symbol table is empty. -- **Every language-specific schema field needs a test that asserts its value.** Pydantic - validation confirms the JSON is structurally well-formed; it does not confirm that - language-specific fields are correctly populated. For every field added beyond the - Java/Python spine, write at least one test asserting a known concrete value. A field with - no value test is guaranteed to break silently when the builder logic changes. +- **The schema is the success criterion.** An analyzer that runs but emits non-v2 JSON has failed + the real job — the SDK can't load it, and the Neo4j graph won't match. Validate output against + the SDK `Application` model at every level, in both projections. Mirror the schema + **comprehensively** (`schema-reference.md`); a thin schema that "looks right" but drops fields is + a silent failure. +- **Additive, never rewriting.** Each level only *adds* nodes/edges (plus the one `callee` + refinement). `L1 ⊆ L2 ⊆ L3 ⊆ L4` is a CI-checkable superset gate. If a "higher" level would + rewrite a lower level's fact, the model is wrong — fix the model. +- **Hold the parity line.** The shared vocabulary (node kinds, edge lists, `can://` grammar) is + identical across analyzers; language extras are **additive** and recorded in `SCHEMA_DECISIONS.md`. + This is what lets the SDK model the schema once and the Neo4j schema be one contract. +- **Modularity is a success criterion.** Mirror `codeanalyzer-python`'s structure — delegating + `core`, a builder split by node kind, the framework backend and the `neo4j/` projection isolated + in their own subpackages, a real pluggable pass layer. `codeanalyzer-ts`'s original monolith is + the anti-example. +- **Two projections, always.** JSON and Neo4j are co-primary. Neo4j is always full-depth; levels + gate the JSON path only. +- **No invented tooling.** If a recommended parser/resolver/oracle doesn't exist for the language, + say so and fall back per the menu's reasoning, rather than inventing a package name. +- **Scope discipline.** This skill builds the *analyzer* and its distribution. Wiring it into the + Python/TS/… SDKs is **cldk-sdk-frontend**; enriching an existing analyzer with a contribution + point is `codeanalyzer-extension-builder`. diff --git a/skills/codeanalyzer-backend/references/analyzer-architecture.md b/skills/codeanalyzer-backend/references/analyzer-architecture.md index 376ab35..ba8cc0f 100644 --- a/skills/codeanalyzer-backend/references/analyzer-architecture.md +++ b/skills/codeanalyzer-backend/references/analyzer-architecture.md @@ -28,16 +28,21 @@ codeanalyzer/ core.py # ORCHESTRATOR ONLY. Delegates every phase; inlines no analysis logic. options/ # CLI option / AnalysisOptions model config/ # static / environment config, distinct from CLI options - schema/ # the Pydantic (or native) models — the data contract - syntactic_analysis/ # symbol-table construction (the per-file builder) - semantic_analysis/ # call-graph construction - call_graph.py # the resolver-based graph + graph<->schema adaptation + schema/ # the node/edge models — the v2 data contract (canonical-schema.md) + syntactic_analysis/ # L1: the tree builder (per-file, to callable depth) + call nodes + semantic_analysis/ # L2: call graph; L3/L4: the dataflow passes (cfg/pdg/sdg) + call_graph.py # the resolver-based call_graph + graph<->schema adaptation / # the heavy framework backend (joern/wala/svf), ISOLATED in its own subpackage + neo4j/ # the CO-PRIMARY projection: project() -> GraphRows -> cypher/bolt + schema catalog analysis/ # the PLUGGABLE pass layer (registry + AnalysisPass base) frameworks/ # entrypoint-finder base + concrete finders, built ON the pass layer utils/ # logging, progress, fs helpers — no analysis logic ``` +The `neo4j/` subpackage is **not optional** — the Neo4j graph is a co-primary output +(`neo4j-projection.md`), so its seam exists in the skeleton like any other, isolated behind +`project()`. + Not every language needs every box on day one (the framework backend and pass finders may ship empty), but the **skeleton and the seams must exist from the start**, because that is what makes the analyzer extensible without a rewrite. diff --git a/skills/codeanalyzer-backend/references/backend-recipe.md b/skills/codeanalyzer-backend/references/backend-recipe.md index d01d676..a4ae95f 100644 --- a/skills/codeanalyzer-backend/references/backend-recipe.md +++ b/skills/codeanalyzer-backend/references/backend-recipe.md @@ -27,10 +27,12 @@ note when they're the *same* tool. TS's checker does both; tree-sitter languages need a separate resolver (an LSP or a type checker). This single fact drives steps 5 and 6. -## 2. Mirror the canonical schema, then extend at the leaves -Reproduce `Application { symbol_table: Map, call_graph: Edge[] }`, the -Module → Class/Callable hierarchy, identity-only edges that reference signature strings with -a provenance tag, and a Callsite that holds the rich per-call metadata. That spine is the +## 2. Mirror the canonical schema (v2), then extend at the leaves +Reproduce the **additive tree** of `canonical-schema.md`: `application → symbol_table{module} → +types{}/functions{} → callables{} → body{}`, with `can://` ids, spans (byte offsets), module-level +`source`, and the split identity-only edge lists (`call_graph` at the application; `cfg`/`cdg`/ +`ddg`/`summary` on the callable). Edges reference node ids with a `prov` tag; call sites are `call` +nodes in `body`. That spine is the invariant. **Then expand the schema to capture what's idiomatic in the target language as first-class data** — add node kinds (interfaces/type-aliases/enums for TS; structs/interfaces for Go; traits/impls for Rust), typed fields (receiver types, async/unsafe flags, generics), @@ -58,25 +60,23 @@ JS/TS reads `tsconfig.json` and ensures `node_modules`. Make it **idempotent**, and **degrade to partial types rather than crashing**. Full detail and timing (source-level vs bytecode resolvers): `project-materialization.md`. -## 5. Build the structural symbol table (level 1, part 1) -Walk the parse tree per file and populate Module → {imports, comments, classes, -interfaces/types/enums, functions, module vars}; each class → methods/properties; each -callable → params, return type, decorators, locals, spans, raw code, and the **unresolved -call sites** (callee name + receiver expr + arg exprs + position, with `callee_signature` -left null). Stamp per-file caching metadata (content hash, mtime, size). This step records -call sites but doesn't resolve them into edges — that's the cheap next step (still level 1; -type fields may still be filled here if your resolver is a same-tool checker). Do this -file-by-file, modeled on how Java's `SymbolTable.extractAll` and Python's `core.py` iterate -the project — see `symbol-table-construction.md`. - -## 6. Build the resolver-based call graph (level 1, part 2 — cheap, strictly additive) -This is **cheap and part of the level-1 analysis**: the same Tier-1 resolver already loaded for -the symbol table resolves call sites into edges. For each recorded call site, map the callee to -a declaration, write its signature into `callee_signature` (**backfilling the site in place**), -and emit an identity-only edge `source_sig → target_sig` with `provenance` set to your resolver -(e.g. `"tsc"`). Handle constructors/`new`, method dispatch via receiver type, and an explicit -unresolved-fallback path (record the site, skip the edge — never crash). Never mutate the symbol -table beyond filling `callee_signature`. +## 5. L1 — build the tree (symbol table) +Walk the parse tree per file and populate the module → `{imports, types, functions}`, each type → +`callables`/`fields`, each callable → params, return type, decorators, spans (byte offsets), and +its **`call` nodes in `body`** (callee name + receiver expr + arg exprs + span, `callee: null`). +Store the file's text once as the module's **`source`** (all node text slices off it). Stamp +per-file caching metadata (content hash, mtime, size). This step records call sites but doesn't +resolve them into edges — that's L2. Do it file-by-file, modeled on Java's `SymbolTable.extractAll` +and Python's `core.py` — see `symbol-table-construction.md`. + +## 6. L2 — the resolver-based call graph (cheap, strictly additive) +The same Tier-1 resolver already loaded for the tree resolves the `call` nodes into edges. For +each `call` node, map the callee to a declaration, backfill its **`callee`** id in place +(`null → id`, the one sanctioned mutation), and emit a `call_graph` edge `{src, dst}` (both +callable ids) with `prov` set to your resolver (e.g. `["tsc"]`). Handle constructors/`new`, method +dispatch via receiver type, and an explicit unresolved-fallback (record the `call` node, skip the +edge — never crash). Never mutate the tree beyond filling `callee`. `call_graph` edges are +**immutable once written** — never re-anchored to a statement at L3. This base graph comes from the **Tier-1 resolver** and is deliberately lightweight — no points-to, dataflow, or k-CFA. *Don't call the tiers "whole-program vs not": once deps are diff --git a/skills/codeanalyzer-backend/references/canonical-schema.md b/skills/codeanalyzer-backend/references/canonical-schema.md index 5498fe0..7d3d8aa 100644 --- a/skills/codeanalyzer-backend/references/canonical-schema.md +++ b/skills/codeanalyzer-backend/references/canonical-schema.md @@ -1,118 +1,264 @@ -# The canonical CLDK analysis contract - -Every CLDK analyzer — Java, Python, and any new language — emits the **same shape** of -JSON so the SDK facades can parse them interchangeably. A new analyzer that drifts from -this contract is the single most common way a language pack fails: the analyzer "works" -in isolation but the Python SDK can't load it. Treat this file as the spec the generated -analyzer's output must satisfy, and as the source of truth for the Pydantic models you -add to the SDK. - -This file states the *rules*. For the exhaustive, field-by-field spec derived from the SDK -Pydantic models — the thing the generated analyzer must mirror comprehensively — see -`schema-reference.md`. The authoritative model code is -`codeanalyzer-python/codeanalyzer/schema/py_schema.py` (identity-only, recommended) and -`python-sdk/cldk/models/java/models.py` (legacy, rich-edge). - -## The three invariants - -1. **One root object, two required keys.** Output is - `Application { symbol_table: Map, call_graph: Edge[] }` - plus optional `entrypoints`. `symbol_table` is keyed by **file path** (relative to the - project root, stable across runs). `call_graph` is a flat list of edges. - -2. **Identity-only edges** (for new analyzers). A call-graph edge carries only `source` and - `target` — both **signature strings** that must exactly equal a `Callable.signature` - already in the symbol table. The rich per-call metadata (receiver expression, argument - types, line/column, resolved callee) lives on a `Callsite` inside the **caller's** - `call_sites`. This separation is what lets the SDK build a NetworkX graph whose nodes are - the symbol-table callables. If `source`/`target` don't byte-match a real signature, the - graph has dangling nodes. - *Caveat:* the **Java** analyzer is a legacy exception — its `JGraphEdges` embed rich - `JMethodDetail` objects instead of bare strings. Do **not** copy that for a new language; - follow the Python identity-only model (your recipe's step 2 mandates it). See - `schema-reference.md` § "The one design choice". - -3. **`signatureOf()` is the linchpin.** Define exactly **one** canonicalizer in the - analyzer that turns a declaration into its signature string, and use it everywhere a - signature is produced — when naming a `Callable`, when writing `callee_signature` on a - `Callsite`, and when emitting edge `source`/`target`. Caller-side and callee-side ids - must be produced by the same function so they are identical. Constructors normalize to a - single convention (Python uses `ClassName.__init__`; pick the target language's - equivalent and apply it consistently). When in doubt, prefer a fully-qualified, - human-readable string like `module.Class.method` over an opaque hash — downstream LLM - consumers read these. - -## JSON conventions (non-negotiable for SDK compatibility) - -- **snake_case keys.** Java emits via Gson with - `LOWER_CASE_WITH_UNDERSCORES`; Python via Pydantic's snake_case defaults. A new analyzer - in any host language must serialize keys in snake_case so the shared SDK models parse it. -- **`analysis.json` is the only facade-visible artifact.** Whatever the analyzer does - internally (caches, intermediate DBs), the contract the SDK depends on is a single - `analysis.json` (or compact JSON on stdout when no output dir is given). -- **Round-trip safety.** Open-vocabulary fields (`provenance`, `tags`, `detection_source`) - are plain strings/string-maps so a persisted `analysis.json` loads even if the producing - extensions aren't installed. Don't model them as closed enums. - -## Core node types - -These are the canonical Python field names. For a new language, replicate the **same field -names and nesting**; add language-specific node kinds rather than renaming the shared ones. - -### Module (a compilation unit / file) -`file_path`, `module_name`, `imports[]`, `comments[]`, `classes{sig→Class}`, -`functions{sig→Callable}`, `variables[]`, plus caching metadata `content_hash`, -`last_modified`, `file_size`. - -### Class -`name`, `signature` (e.g. `module.ClassName`), `comments[]`, `code`, `decorators[]`, -`base_classes[]` (signature strings), `methods{sig→Callable}`, `attributes{name→Attr}`, -`inner_classes{sig→Class}`, `start_line`, `end_line`. - -### Callable (function / method / constructor) -`name`, `path`, `signature` (e.g. `module.Class.method`), `comments[]`, `decorators[]`, -`parameters[]`, `return_type`, `code`, `start_line`/`end_line`/`code_start_line`, -`accessed_symbols[]`, **`call_sites[]`** (the unresolved-then-backfilled call records), -`inner_callables{}`, `inner_classes{}`, `local_variables[]`, `cyclomatic_complexity`, -`is_entrypoint`, `entrypoint_framework`. - -### Callsite (rich per-call metadata; lives on the caller) -`method_name`, `receiver_expr`, `receiver_type`, `argument_types[]`, `return_type`, -**`callee_signature`** (null when the site is first recorded; backfilled in place when the -resolver call graph is built), -`is_constructor_call`, and `start_line`/`start_column`/`end_line`/`end_column`. - -### CallEdge (identity-only) -`source` (caller signature), `target` (callee signature), `type` (`"CALL_DEP"`), -`weight` (int, accumulated when merging backends), `provenance[]` (e.g. `["tsc"]`, -`["jedi","joern"]`), `tags{}` (free-form, extension-namespaced). - -### Entrypoint (optional) -`signature` (references a Callable), `framework`, `detection_source` -(`decorator|base_class|url_resolver|...|extension`), plus flat optional route/method -fields and a free-form `tags{}`. - -## Mapping the contract onto a new language - -Keep the spine identical; extend at the leaves: - -| Canonical concept | TypeScript adds | Go adds | -| --- | --- | --- | -| Class | `interface`, `type`-alias, `enum` as sibling node kinds | `struct`, `interface` | -| Callable | arrow functions, methods, getters/setters | functions, methods (receiver type), closures | -| `base_classes` | `extends` + `implements` chains | embedded structs / satisfied interfaces | -| decorators | TS decorators (`@Injectable`) | struct tags (in `tags`) | - -When you introduce a new node kind, give it its own `signature` produced by the same -`signatureOf()`, so edges can point at it. - -## How the SDK consumes this - -The SDK defines a parallel set of Pydantic models per language under -`python-sdk/cldk/models//models.py` (e.g. `TSApplication`, `TSModule`, `TSCallable`, -`TSCallEdge`). They must mirror these field names so `Application(**json.load(...))` -validates. The Java models (`cldk/models/java/models.py`) and the re-exported Python models -(`cldk/models/python/__init__.py`) are the two worked examples to copy from — copy the one -whose invocation pattern (subprocess vs in-process) matches your analyzer. The SDK side of -this — the per-language models and facade — is built by the **cldk-sdk-frontend** skill (its -`python-sdk-wiring.md`), not here. +# The canonical CLDK analysis schema (v2) — the keystone + +This is the contract every CLDK analyzer emits and every SDK consumes. It is the **single +source of truth** for this skill: the analyzer you build (or migrate) exists to produce this +shape, in **two projections** — `analysis.json` and a Neo4j graph — and the SDK models mirror +it. `schema-reference.md` is the field-by-field appendix; this file states the model. + +## The one idea: an additive analysis paradigm + +> **Codeanalyzer is an additive analysis paradigm: each analysis level is the same tree grown +> one layer deeper, plus one edge family over the new layer.** + +There is exactly **one structure** — a tree of nodes with typed edges laid over it (a Code +Property Graph). Every "section" anyone has ever named — symbol table, call graph, CFG, PDG, +SDG, taint — is a **projection of that one structure**, not a separate thing. Analysis +**levels** are how deeply the structure is populated; they only ever *add*, never rewrite. + +### The atom + +One **scale-free node** — a region of code — is the whole vocabulary. A `file`, a `struct`, a +`method`, a `statement` are not different kinds of thing; they are the same node at different +granularity. Every node has: + +- an **`id`** (see § Identity), +- a **`kind`** (the node-kind ladder below), +- a **`span`** (the one universal attribute — where in source it lives), +- **children** (containment), and +- **edges** (typed overlays). + +### Two relations, and only two + +1. **Containment** — the **single-parent** relation. Every node has exactly one parent (the + root `application` excepted). *Exactly one* is what makes it a **tree**, and what + distinguishes it from the overlays. +2. **Typed edges** — the multi-valued overlays: `call_graph`, `cfg`, `cdg`, `ddg`, `param_in`, + `param_out`, `summary`. A node has one parent but many edges. + +A node + containment tree + typed edge overlays **is a CPG.** Hold this and the rest follows. + +## The hierarchy (named-map containment) + +Containment above the callable is expressed as **named maps** — the classic symbol-table +shape, keyed for lookup. The tree grows *downward* as the level rises: + +``` +application ← the root; carries an id + symbol_table: { : module } ← L1 + module: { types{}, functions{} } ← L1 (per-file / compilation-unit container) + type: { callables{}, fields{} } ← L1 + callable: { body{}, cfg[], cdg[], ddg[], summary[] } + body: { : node } ← L3+ (statements, then synthetic vertices) + call_graph[], param_in[], param_out[] ← cross-function edges, at the application scope +``` + +- **Above the callable**: name-keyed maps (`types`, `functions`, `callables`) — each node + carries its full `id`. +- **At the callable**: `body` is the container that grows at L3+. It is a map keyed by the + node's **local id** (a source position `line:col`, or an `@tag` for synthetic vertices). +- **Edges live at the lowest common ancestor of their endpoints**: intra-callable edges + (`cfg`/`cdg`/`ddg`/`summary`) hang on the callable; cross-callable edges + (`call_graph`/`param_in`/`param_out`) hang on the application. + +### Node-kind ladder + +``` +application → file/module → type (class|struct|interface|enum|…) → callable (function|method|constructor) + → statement (statement|call|return|branch|loop|…) → [expression, opt] +``` + +plus the **synthetic** vertices introduced at L4: `entry`, `exit`, `formal_in`, `formal_out`, +`actual_in`, `actual_out`. A node is therefore *either an AST region or a synthetic analysis +vertex*; both fit the tree (synthetic vertices are children of the callable or of a call-site +statement). + +## Identity + +Two tiers, and the boundary is the **callable leaf line** — the same line where L1 stops. + +- **Durable ids (≥ callable)** — files, modules, types, callables get stable + [`cldk://`-style](../../cldk-sdk-frontend/references/schema-contract.md) URIs that survive + re-analysis and are what external tools (SCIP export, cross-repo joins) address. The grammar + is a **containment path** with an application segment so multiple apps in one language don't + collide: + + ``` + can:////// + can://go/myapp/src/util.go/Hasher/Hash(string)uint64 + ``` + +- **Ordinal ids (< callable)** — statements and synthetic vertices are addressed *within* their + callable by a **source position** (real nodes) or an **`@tag`** (synthetic): + + ``` + @: e.g. …/Hash(string)uint64@15:2 (a statement) + @ e.g. …/Hash(string)uint64@entry (synthetic) + …@formal_in:0, …@16:2/actual_in:0 + ``` + + Positions are addressable (the SDK can expose `flows_to_statement("util.go:42")` as a + line-level query) and unique within a single analysis when they carry `line:col` — a bare + line is **not** unique (`if err != nil { return err }`), so always keep the column. These are + content-stable within one run; they are **not** promised across edits (analysis is recomputed + wholesale, so cross-edit durability is a non-goal below the callable line). + +The delimiters `/`, `@`, `:` are chosen to not collide with the durable `#`/symbol grammar; the +`can://` scheme and the app segment are extensions to be kept in lockstep with the SDK's +`schema-contract.md` and the upstream `cldk://` RFC. + +## The levels (what each one grows) + +The levels are progressive population of the one tree, each additive over the last — +`L1 ⊆ L2 ⊆ L3 ⊆ L4`, superset **modulo null-refinement** (see § Monotonicity). Depth grows only +at **L1 and L3**; L2 adds only edges; L4 adds synthetic vertices + edges. + +| Level | Grows (nodes) | Adds (edges) | Cost / substrate | Flag | +| --- | --- | --- | --- | --- | +| **1** | the tree to **callable** depth, + `call` nodes in `body` (call sites, `callee` unresolved) | — | cheap, parser + resolver | `-a 1` | +| **2** | `callee` on each `call` node backfilled (`null → id`) | `call_graph` (callable → callable) | cheap | `-a 2` | +| **3** | the **rest of `body`** (non-call statements) under each callable | `cfg`, `cdg`, `ddg` (**syntactic**) — all intra-callable | heavy, **AST-only, per-callable parallel** | `-a 3` | +| **4** | **synthetic** param vertices (formal/actual in-out) | `param_in`, `param_out`, `summary`, + `ddg` (**semantic**, alias-aware) | heaviest: **needs the points-to oracle** + summary fixpoint | `-a 4` | + +`body` therefore begins at **L1** holding just the `call` nodes (so `get_call_sites` is an L1 +accessor, preserving the old SDK surface) and completes at **L3** with the remaining statements. +Depth grows at L1 (tree + call sites) and L3 (full body); L2 is a pure refinement + edge add; L4 +adds synthetic vertices + cross edges. + +`-a 3` implies `-a 2`; `-a 4` implies `-a 3`. Framework enrichment (Joern/WALA) and points-to +precision are an **orthogonal axis** — provenance-merged evidence into an existing edge family, +**not** a level. `max_level` in the payload declares which level was populated; consumers +**read it** rather than sniffing for keys. + +### Edge families and their placement + +| Edge list | Level | Endpoints | Lives on | Notes | +| --- | --- | --- | --- | --- | +| `call_graph` | 2 | callable → callable | application | `prov[]`, `weight`; immutable once written (never re-anchored to a statement) | +| `cfg` | 3 | statement → statement | callable | `kind`: `fallthrough`\|`true`\|`false`\|`switch_case`\|`loop_back`\|`exception`\|`return`\|… | +| `cdg` | 3 | statement → statement | callable | control dependence (from post-dominance) | +| `ddg` | 3→4 | statement → statement | callable | `var` (k-limited access path), `prov`: `["ssa"]` = **syntactic** (L3), `["points-to"]` = **semantic** (L4) | +| `summary` | 4 | actual_in → actual_out (same call) | callable | transitive intra-caller shortcut | +| `param_in` | 4 | actual_in → formal_in | application | argument into callee | +| `param_out` | 4 | formal_out → actual_out | application | result back to caller | + +Each list is keyed **by its type** (the list name *is* the type; no `type` field). Every edge +record is `{ src, dst, …attrs }` referencing node ids. **No dangling endpoints** — every `src` +and `dst` must resolve to a node in the tree (the same invariant at every level). + +## Source and text: one blob per module, everything slices off it + +The tree carries structure; source **text** is stored **once per file, on the module node**, as +`source`, and every node's text is a **slice** of it: + +- `get_method_body(sig)` → `module.source[callable.span.bytes]` +- a statement's text, a call's receiver expression → the same slice by its node span. + +To make slicing O(1), **spans carry byte/char offsets** alongside `line:col` (`line:col` to +address and display, offsets to slice). This is the minimum-size, self-contained choice — one +copy of each source file, zero per-node duplication, and it subsumes any per-callable `code` +field. + +## Monotonicity (the invariant that makes "additive" true) + +Levels **add** facts; they never contradict or delete. Exactly two sanctioned changes: + +1. **Additive** — new nodes deeper in the tree, new entries in an edge list. +2. **Refinement** — an unresolved fact becoming resolved: `callee` on a call node `null → id`. + Null-to-value only, never value-to-different-value. + +So `analysis.json(-a 1) ⊆ … ⊆ analysis.json(-a 4)`, a **CI-checkable superset gate**. The one +subtlety is the DDG: L3 emits the **syntactic** (name-equality, no-alias) def-use — a strict +subset — and L4 **adds** the alias-derived edges via points-to. This holds *because* the +precision posture is weak-update / over-approximate (no strong updates through aliases); a +strong update would remove an edge and break the chain. The `prov` tag (`ssa` vs `points-to`) +makes the syntactic/semantic split visible in the data. + +## Conventions + +- **snake_case keys**, everywhere, in every host language (Gson `LOWER_CASE_WITH_UNDERSCORES`, + Pydantic defaults) so one set of SDK models parses every analyzer. +- **A fact is present or absent — there is no `null`** (except the sanctioned `callee: null` + refinement slot). Absence *is* the "no fact" encoding; do not emit empty-vs-null noise. +- **`analysis.json` is one facade-visible artifact** (or compact JSON on stdout); the Neo4j + graph is the co-primary projection (below). Caches/DBs are internal. +- Open-vocabulary fields (`prov`, `tags`) are plain strings so a persisted payload loads even + without the producing extension installed. + +## Two projections of the one structure + +The same tree + overlays is emitted two ways; they must agree. + +- **`analysis.json`** — this document: named-map tree, `body` maps, split edge lists, + `source` per module. The facade contract. +- **Neo4j** — a near-identity projection (`neo4j-projection.md`): every node → a node row, + **containment → typed `HAS_*`/`DECLARES` edges** (the tree rendered as edges, since a graph DB + has no nesting), every overlay edge → a typed relationship. Node families and the `--app-name` + anchor must match this schema. The Neo4j graph is **always full-depth** — analysis levels gate + the JSON path only. + +Building **both** is a first-class deliverable for every analyzer (`§ a` of the skill), not an +afterthought. + +## Worked example (L1 → L4, additive) + +```jsonc +{ + "schema_version": "2.0.0", "language": "go", "max_level": 4, "k_limit": 3, + "application": { + "id": "can://go/myapp", "kind": "application", + "symbol_table": { + "src/util.go": { // L1 + "id": "can://go/myapp/src/util.go", "kind": "module", "package": "util", + "source": "package util\n\nimport \"hash/fnv\"\n\nfunc (h Hasher) Hash(s string) uint64 {\n\th := fnv.New64()\n\th.Write([]byte(s))\n\treturn h.Sum64()\n}\n", + "types": { + "Hasher": { + "id": "can://go/myapp/src/util.go/Hasher", "kind": "struct", + "span": { "start":[10,1], "end":[40,1], "bytes":[0,400] }, + "callables": { + "Hash(string)uint64": { + "id": "can://go/myapp/src/util.go/Hasher/Hash(string)uint64", "kind": "method", + "span": { "start":[14,1], "end":[22,1], "bytes":[42,180] }, + "body": { // L3+ + "@entry": { "kind":"entry" }, + "15:2": { "kind":"statement", "span":{ "start":[15,2],"end":[15,18],"bytes":[84,100] } }, + "16:2": { "kind":"call", "span":{...}, "callee":"can://go/myapp/src/fnv.go/New64()" }, + "17:2": { "kind":"return", "span":{...} }, + "@exit": { "kind":"exit" }, + "@formal_in:0": { "kind":"formal_in", "of":"s" }, // L4 + "@formal_out": { "kind":"formal_out", "of":"$ret" }, // L4 + "16:2/actual_in:0": { "kind":"actual_in", "of":"arg0", "parent":"16:2" }, // L4 + "16:2/actual_out": { "kind":"actual_out", "of":"$ret", "parent":"16:2" } + }, + "cfg": [ {"src":"@entry","dst":"15:2","kind":"fallthrough"}, // L3 + {"src":"15:2","dst":"16:2","kind":"fallthrough"} ], + "cdg": [ {"src":"@entry","dst":"15:2"} ], // L3 + "ddg": [ {"src":"15:2","dst":"17:2","var":"h","prov":["ssa"]}, // L3 syntactic + {"src":"16:2","dst":"17:2","var":"h","prov":["points-to"]} ], // L4 semantic + "summary": [ {"src":"16:2/actual_in:0","dst":"16:2/actual_out"} ] // L4 + } + } + } + }, + "functions": {} + } + }, + "call_graph": [ {"src":"can://go/myapp/src/util.go/Hasher/Hash(string)uint64", // L2 + "dst":"can://go/myapp/src/fnv.go/New64()","prov":["go/types"],"weight":1} ], + "param_in": [ {"src":"…/Hash(string)uint64@16:2/actual_in:0","dst":"…/New64()@formal_in:0"} ], // L4 + "param_out": [ {"src":"…/New64()@formal_out","dst":"…/Hash(string)uint64@16:2/actual_out"} ] // L4 + } +} +``` + +Every level only *added* — a key, `body` nodes, or edge entries. Nothing was rewritten except +the `callee: null → id` backfill. That is the additive paradigm made literal. + +## Cross-language parity clause + +The **vocabulary is shared; language extras are additive.** Node `kind`s, edge list names, edge +`kind`/`prov` values, and the shapes above are identical across analyzers. A language **adds** +kinds (Go `defer_resume` CFG edges, Rust `unsafe` flags, TS `interface`/`enum` types) — recorded +in its `.claude/SCHEMA_DECISIONS.md` — but must **never rename or repurpose** a shared name. This +is what lets the SDK model the schema **once** (one `Node`, one `Edge`, one `Application`), and +what lets the Neo4j schema be a single versioned contract. Hold the parity line, or the whole +one-model premise collapses. diff --git a/skills/codeanalyzer-backend/references/dataflow-graphs.md b/skills/codeanalyzer-backend/references/dataflow-graphs.md index e039059..b73b181 100644 --- a/skills/codeanalyzer-backend/references/dataflow-graphs.md +++ b/skills/codeanalyzer-backend/references/dataflow-graphs.md @@ -78,11 +78,19 @@ Cross-function edges (`CALL`, `PARAM_IN/OUT`, `SUMMARY`) reference both endpoint referenced signature exists in the symbol table, every referenced node_id exists in that function's emitted graph. -## Emission — `analysis.json` sections and flags - -Graphs are emitted as an **optional top-level section**, present from level 3, preserving the -facade invariant that `analysis.json` is the single facade-visible output. The `functions` map -(CFG + PDG) is the **level-3** payload; `sdg_edges` is added at **level 4**: +## Emission — where the graphs live in the tree + +> **Schema v2 supersedes the standalone `program_graphs` section below.** In the canonical schema +> (`canonical-schema.md`), dataflow is **not** a separate top-level object — it grows *inside the +> tree*: each callable gains a `body{}` map of statement/vertex nodes plus the intra-callable edge +> lists `cfg`/`cdg`/`ddg`/`summary`, and the application gains the cross-callable `param_in`/ +> `param_out` lists. Node endpoints are `can://…@line:col` ids, not `(signature, node)` pairs. +> Read `canonical-schema.md` for the authoritative shape; the ladder, gates, and construction +> stages in this file are shape-agnostic and still govern. The block below is retained only as the +> conceptual node/edge inventory (kinds, `cfg`/`pdg`/`sdg` families) — map it onto the v2 tree. + +Historically graphs were a top-level `program_graphs` object; the families (CFG, PDG = CDG+DDG, +SDG) and their level assignment are unchanged, only their placement: ```jsonc { diff --git a/skills/codeanalyzer-backend/references/neo4j-projection.md b/skills/codeanalyzer-backend/references/neo4j-projection.md index af3889f..9e47ab6 100644 --- a/skills/codeanalyzer-backend/references/neo4j-projection.md +++ b/skills/codeanalyzer-backend/references/neo4j-projection.md @@ -1,17 +1,23 @@ -# Neo4j projection (optional second output surface) - -Every mature CLDK analyzer now emits **two** projections of the same analysis: the canonical -`analysis.json` (the facade contract, always built) and an **optional Neo4j graph**. The graph -is not an ingestion of `analysis.json` — it is an **alternative projection of the same in-memory -IR** (the symbol table + call graph objects), selected by `--emit neo4j`. `analysis.json` -remains the SDK's default contract; the graph is a queryable, incrementally-updatable second -surface. Java, Python, and TypeScript analyzers all ship this; a new analyzer should mirror it -**once level-1 JSON is solid** — treat it as part of the *CLI, caching/incremental, packaging* -stage, not a prerequisite. - -Neo4j is **optional at every layer**: the driver is a lazy/optional dependency (Python/TS import -it on demand; Java loads it reflectively so GraalVM `native-image` can prune it), and nothing in -the JSON path depends on it. +# Neo4j projection (the co-primary output surface) + +Every CLDK analyzer emits **two projections of the one structure** (`canonical-schema.md`): the +`analysis.json` tree and a **Neo4j graph**. They are **co-primary** — building both is a +first-class deliverable, not an afterthought. The graph is not an ingestion of `analysis.json` — +it is a projection of the **same node tree + edge overlays**, selected by `--emit neo4j`. +`analysis.json` is the SDK's default contract; the graph is the queryable, incrementally-updatable +surface. Java, Python, and TypeScript analyzers all ship it. + +**Containment renders as edges.** A graph DB has no nesting, so the schema's containment tree +becomes typed `HAS_*`/`DECLARES` relationships (the `HAS_MODULE`/`DECLARES`/`HAS_CALLABLE`/ +`HAS_CFG_NODE` families below), while the overlay edges (`call_graph`, `cfg`, `cdg`, `ddg`, +`param_in`/`param_out`, `summary`) become their own typed relationships. Node labels are the v2 +node **kinds**; the `can://` id is the merge key. This is a **near-identity** projection of the +JSON tree — the same nodes and edges, rendered as a property graph. + +Neo4j stays **optional at run time** (you don't need a running DB to emit `analysis.json`): the +driver is a lazy/optional dependency (Python/TS import it on demand; Java loads it reflectively so +GraalVM `native-image` can prune it). "Co-primary" means *the analyzer must be able to produce it*, +not that every run does. ## CLI surface (add to `cli-contract.md`) diff --git a/skills/codeanalyzer-backend/references/schema-design-loop.md b/skills/codeanalyzer-backend/references/schema-design-loop.md index 87a6af1..8170e74 100644 --- a/skills/codeanalyzer-backend/references/schema-design-loop.md +++ b/skills/codeanalyzer-backend/references/schema-design-loop.md @@ -1,10 +1,14 @@ # Schema design as a comparison-and-differentiation loop -Designing the analyzer's schema is **not** "copy Java, bolt on a few fields." It is an -iterative, reflective process: you anchor on the **mature reference analyzers** (currently -**Java** and **Python**; more languages will join as they mature), interrogate how the target -language genuinely differs, and — crucially — **bring every divergence to the user as a -decision** rather than choosing silently. You do this **node by node**, not all at once. +The **shared spine is already designed** — it's the v2 keystone (`canonical-schema.md`): the node +tree, the `can://` ids, the additive levels, the edge families. This loop is **not** re-designing +that; it is confirming the **language-specific expansion** — which `type`/`callable`/`body` kinds, +which `cfg`-edge kinds, and which typed fields this language adds to the spine (the parity clause: +add at the leaves, never rename the shared vocabulary). You anchor on the **keystone plus the +mature reference analyzers** (**Java** and **Python**), interrogate how the target language +genuinely differs, and — crucially — **bring every divergence to the user as a decision** rather +than choosing silently. You do this **node by node**, not all at once, recording each answer in +`.claude/SCHEMA_DECISIONS.md`. This loop only *designs the schema* (the analyzer-side types + the SDK `` Pydantic models). Actually walking files to fill the table is a separate stage — see diff --git a/skills/codeanalyzer-backend/references/schema-migration.md b/skills/codeanalyzer-backend/references/schema-migration.md new file mode 100644 index 0000000..e838b2c --- /dev/null +++ b/skills/codeanalyzer-backend/references/schema-migration.md @@ -0,0 +1,87 @@ +# Migrating an existing analyzer to schema v2 (path B) + +For a `codeanalyzer-` that already exists on the **old** schema (flat +`symbol_table: {path → CompilationUnit}` + a `call_graph` of rich or identity edges, per-callable +`code`, `is_*` boolean flags). Moving it to the v2 keystone (`canonical-schema.md`) is a +**major release**: the parsing/resolution guts stay; the *emission layer* is rewritten to produce +the additive tree + typed edges, in both projections. This is a breaking output change — bump the +major version and coordinate the SDK release (`§ c`). + +**Golden rule:** keep everything that *computes* facts (the parser, the resolver, WALA/Jelly/ +go-ssa, the call-graph builder); replace only what *serializes* them. The analyzer already knows +the facts — v2 is a different shape for the same facts, plus deeper ones at L3/L4. + +## Do it level by level, lowest first + +Migrate in the same additive order you'd build a new analyzer, so each step is independently +validatable against the v2 SDK models: + +1. **L1 emission** — the tree + `source` + ids. The biggest structural change; do it first and + get the symbol-table gate green before touching edges. +2. **L2 emission** — the `call_graph` list at application scope. +3. **Neo4j projection** — re-point (or add) the graph emitter at the v2 node/edge families. +4. **L3/L4** — if the analyzer already computes dataflow (e.g. Java via WALA's slicer, which + already emits `program_graphs`), remap it into `body` + the split edge lists; otherwise it's new + construction per `dataflow-construction.md`. + +## Field-by-field: old → v2 + +### Root envelope +| Old | v2 | +| --- | --- | +| `{ symbol_table, call_graph }` (two top-level keys) | `{ schema_version, language, max_level, application: { id, symbol_table, call_graph, param_in, param_out } }` | +| — (no version) | `schema_version: "2.0.0"`, `max_level` (authoritative) | +| — (no app identity) | `application.id = can:///` — **new**, disambiguates apps | + +### Container / symbol nodes +| Old | v2 | +| --- | --- | +| `symbol_table[path]` = `CompilationUnit`/`Module` | `symbol_table[path]` = `module` node with `id`, `kind:"module"`, **`source`** (whole file, once) | +| `Type` with `is_interface`/`is_enum`/`is_record`/… booleans | one `type` node with a single **`kind`** (`class`\|`interface`\|`enum`\|`struct`\|…) | +| `CallSite.is_public/is_private/is_protected` booleans | one `access` field (or on the node) | +| flat-string `annotations[]` | structured `decorators[]` (`{name,args,span}`) | +| `thrown_exceptions[]` (Java) | generalized `error_channel[]` | +| per-callable `code` string | **dropped** — `get_method_body` slices `module.source[callable.span.bytes]` | +| `start_line`/`end_line`/`start_column`/`end_column` (flat ints) | `span: { start:[l,c], end:[l,c], bytes:[from,to] }` — **add byte offsets** for O(1) slicing | + +### Edges (the biggest semantic change) +| Old | v2 | +| --- | --- | +| Java **rich edges** (`JGraphEdges` embedding `JMethodDetail`) | **identity-only**: `call_graph: [{ src, dst, prov, weight }]` — ids only; join detail via id | +| identity edges `{ source, target, type:"CALL_DEP", provenance }` | rename keys → `{ src, dst, prov }`; `call_graph` list at application scope | +| call graph mixed granularity | `call_graph` is **callable→callable** and immutable; call-site-level linking is L4 `param_*` | +| `program_graphs.functions[sig].cfg.nodes` + `sdg_edges` (Java today) | move nodes into the callable's **`body{}`**; split edges into `cfg`/`cdg`/`ddg`/`summary` (intra) + `param_in`/`param_out` (cross); endpoints become `can://…@line:col` ids | +| `data_dependence: "no-heap"\|"full"` (Java) | this **is** the syntactic/semantic DDG split — emit as `ddg` edges tagged `prov:["ssa"]` (no-heap, L3) vs `prov:["points-to"]` (full, L4) | + +### Identity +| Old | v2 | +| --- | --- | +| `signature` string as the id | keep `signature` as the callable's human-readable field, but the **`id`** is the full `can://////` path | +| `(signature, node)` pair for graph endpoints (Java) | single string id `…@:` (or `@tag` for synthetic) | +| bare `signatureOf()` | unchanged — still the one canonicalizer; it now produces the *last path segment* of the id | + +## Practical mechanics + +- **Wrap, don't rewrite, the model layer.** If the analyzer builds in-memory model objects then + serializes, add a **v2 emitter** that walks those same objects and produces the new shape — the + cleanest diff, and it lets you keep the old emitter behind a flag during transition if useful. +- **Byte offsets:** the parser already has token positions; thread the byte/char offset through to + `span.bytes`. This is the one genuinely new datum L1 needs. +- **`source`:** you're already reading each file — retain its text on the module node instead of + slicing per-callable `code`. +- **Neo4j:** if the analyzer already has a graph projection (Java/Python/TS do), it's largely a + **relabel** to the v2 node/edge families and id scheme; if not, add the `neo4j/` subpackage per + `neo4j-projection.md`. +- **Validate against the SDK v2 models at each level** — the same gates as a new analyzer + (`testing-and-validation.md`), plus a **superset check** if you keep the old emitter: v2 output + must contain every fact the old output did (modulo the deliberate drops above). + +## Release & coordination + +- **Major version bump** on the analyzer; note the breaking output change in the release notes + (Keep-a-Changelog *Changed/Breaking*). +- **Coordinate with the SDK release (`§ c`):** the frontend skill revises the Pydantic models to v2 + in lockstep, keeping the public API stable. Pin the analyzer version in the SDK only once both + are cut. Until then, the SDK's old models won't parse v2 output — don't publish the analyzer's + new major as the SDK's pinned version prematurely. +- Update the repo's **`CLAUDE.md`** to describe the v2 model (it's now what the analyzer emits). diff --git a/skills/codeanalyzer-backend/references/schema-reference.md b/skills/codeanalyzer-backend/references/schema-reference.md index 83a59cf..9bd595e 100644 --- a/skills/codeanalyzer-backend/references/schema-reference.md +++ b/skills/codeanalyzer-backend/references/schema-reference.md @@ -1,204 +1,118 @@ -# Comprehensive schema reference (derived from the SDK Pydantic models) +# Schema reference (v2) — per-kind fields and edges -This is the **field-by-field** spec the generated analyzer's `analysis.json` must satisfy. It -is derived from the CLDK Python SDK's own Pydantic models — the code that will actually parse -your analyzer's output — so it is the authoritative source, not a paraphrase: +The field-by-field appendix to `canonical-schema.md`. That file states the model (the additive +tree + overlays); this one enumerates every node kind's fields and every edge list's shape, so +the analyzer emits them comprehensively and the SDK models them exactly. Every node shares the +**common node fields**; each kind adds its own. Absent = no fact (no `null`, except the one +sanctioned `callee` slot). -- **Identity-only / recommended** model: `codeanalyzer-python/codeanalyzer/schema/py_schema.py`, - re-exported by `python-sdk/cldk/models/python/__init__.py`. -- **Legacy / rich-edge** model: `python-sdk/cldk/models/java/models.py`. +## Common node fields (every node) -> **Mirror it comprehensively.** Reproduce **every** field below for the shared nodes — not a -> convenient subset. Fields you can't populate yet should still exist with sensible defaults -> (empty list, `-1` line numbers, `None`) so the SDK model validates and later passes can fill -> them. Then add the target language's own node kinds. - -## The one design choice: edge model - -The two reference analyzers diverge on call-graph edges. **New analyzers must use the -identity-only (Python) model** — your recipe's step 2 mandates it, and it's what keeps edges -cheap and the graph's nodes equal to the symbol-table callables. - -- **Identity-only (use this):** `call_graph: List[CallEdge]`, where an edge's `source`/`target` - are bare **signature strings** that exactly match a `Callable.signature` in the symbol table. - Rich per-call data lives on `Callsite.callee_signature` inside the caller. -- **Rich-edge (Java legacy — do NOT copy for new languages):** `JGraphEdges.source`/`target` - are `JMethodDetail` objects embedding `klass` + a full `JCallable`. This is heavier and - duplicates symbol-table data. Documented here only so you recognize and avoid it. - -## Root object - -**Recommended (identity-only):** -| field | type | notes | -| --- | --- | --- | -| `symbol_table` | `Dict[str, Module]` | keyed by file path (stable, relative to project root) | -| `call_graph` | `List[CallEdge]` | identity-only edges; empty `[]` for a symbol-table-only run (`-a 1`) | -| `entrypoints` | `Dict[str, List[Entrypoint]]` | optional; default `{}` | - -*Java additionally carries `version: str` and `system_dependency_graph: List[JGraphEdges]`, and -its `call_graph` is `None` (absent) for a symbol-table-only run. New languages: prefer `call_graph: []` over -`None`, and only add a `version`/SDG field if you actually produce them.* - -## Module (compilation unit / file) -| field | type | default | -| --- | --- | --- | -| `file_path` | `str` | — | -| `module_name` | `str` | — (Java uses `package_name`) | -| `imports` | `List[Import]` | `[]` | -| `comments` | `List[Comment]` | `[]` | -| `classes` | `Dict[str, Class]` | `{}` (Java: `type_declarations`) | -| `functions` | `Dict[str, Callable]` | `{}` (top-level/module functions) | -| `variables` | `List[VariableDeclaration]` | `[]` | -| `content_hash` | `Optional[str]` | `None` — caching metadata (step 8) | -| `last_modified` | `Optional[float]` | `None` | -| `file_size` | `Optional[int]` | `None` | - -## Class / Type -| field | type | default | -| --- | --- | --- | -| `name` | `str` | — | -| `signature` | `str` | e.g. `module.ClassName` (from `signatureOf()`) | -| `comments` | `List[Comment]` | `[]` | -| `code` | `str \| None` | `None` | -| `decorators` | `List[Decorator]` | `[]` (Java: `annotations: List[str]`) | -| `base_classes` | `List[str]` | `[]` (Java splits `extends_list` + `implements_list`) | -| `methods` | `Dict[str, Callable]` | `{}` (Java: `callable_declarations`) | -| `attributes` | `Dict[str, ClassAttribute]` | `{}` (Java: `field_declarations: List[JField]`) | -| `inner_classes` | `Dict[str, Class]` | `{}` | -| `start_line` / `end_line` | `int` | `-1` | - -*Java type-kind flags worth carrying as language node-kind info: `is_interface`, -`is_enum_declaration`, `is_record_declaration`, `is_annotation_declaration`, `is_inner_class`, -`is_nested_type`, `is_entrypoint_class`, plus `enum_constants`, `record_components`, -`initialization_blocks`.* - -## Callable (function / method / constructor) -| field | type | default | -| --- | --- | --- | -| `name` | `str` | — | -| `path` | `str` | file path of the declaration | -| `signature` | `str` | e.g. `module.Class.method` — **the edge id** | -| `comments` | `List[Comment]` | `[]` | -| `decorators` | `List[Decorator]` | `[]` (Java: `annotations`, `modifiers`) | -| `parameters` | `List[CallableParameter]` | `[]` | -| `return_type` | `Optional[str]` | `None` | -| `code` | `str \| None` | `None` | -| `start_line` / `end_line` / `code_start_line` | `int` | `-1` | -| `accessed_symbols` | `List[Symbol]` | `[]` (Java: `accessed_fields`, `referenced_types`) | -| `call_sites` | `List[Callsite]` | `[]` — **recorded during symbol-table build, callees backfilled when the resolver call graph runs** | -| `inner_callables` | `Dict[str, Callable]` | `{}` | -| `inner_classes` | `Dict[str, Class]` | `{}` | -| `local_variables` | `List[VariableDeclaration]` | `[]` (Java: `variable_declarations`) | -| `cyclomatic_complexity` | `int` | `0` | -| `is_entrypoint` | `bool` | `False` | -| `entrypoint_framework` | `Optional[str]` | `None` | - -*Java extras: `is_constructor`, `is_implicit`, `thrown_exceptions`, `declaration`, -`crud_operations`, `crud_queries`. Carry constructor-ness for any language (you need it for -the `new`/`__init__` normalization).* - -## Callsite (rich per-call metadata, on the caller) -| field | type | default | +| Field | Type | Notes | | --- | --- | --- | -| `method_name` | `str` | — | -| `receiver_expr` | `Optional[str]` | `None`/`""` | -| `receiver_type` | `Optional[str]` | `None` | -| `argument_types` | `List[str]` | `[]` | -| `return_type` | `Optional[str]` | `None` | -| `callee_signature` | `Optional[str]` | **`None` when recorded; filled in place when the resolver call graph is built** | -| `is_constructor_call` | `bool` | `False` | -| `start_line`/`start_column`/`end_line`/`end_column` | `int` | `-1` | - -*Java adds `argument_expr`, `is_static_call`/`is_private`/`is_public`/`is_protected`/ -`is_unspecified`, `crud_operation`, `crud_query`, and a `comment`.* - -## CallEdge (identity-only — the model to use) -| field | type | default | +| `id` | string | Durable `can://…` path (≥ callable) or `…@line:col` / `…@tag` (< callable). See `canonical-schema.md` § Identity. | +| `kind` | string | The node-kind (below). Closed vocabulary + additive language kinds. | +| `span` | `{ start:[line,col], end:[line,col], bytes:[from,to] }` | `line:col` addresses/displays; `bytes` slices `module.source`. Absent on some synthetic nodes. | +| `parent` | string | **Only** when the container ≠ the enclosing node (synthetic actuals → their call site; materialized blocks/exprs). Implicit otherwise. | + +## Root and container kinds + +### `application` +| Field | Type | Level | Notes | +| --- | --- | --- | --- | +| `id` | string | 1 | `can:///` — the app segment disambiguates apps in one language. | +| `symbol_table` | `{ file → module }` | 1 | Named map, keyed by relative file path (no absolute, no `..`). | +| `call_graph` | `edge[]` | 2 | Cross-function; see § Edges. | +| `param_in` / `param_out` | `edge[]` | 4 | Cross-function; see § Edges. | + +Top-level siblings of `application` carry the manifest: `schema_version`, `language`, +`max_level` (authoritative level marker), `k_limit` (present at L4), `analyzer{name,version}`. + +### `module` (per-file compilation unit) +| Field | Type | Level | Notes | +| --- | --- | --- | --- | +| `id`, `kind`, `span` | — | 1 | `kind:"module"`. | +| `package` / `namespace` | string | 1 | Language-native grouping the file belongs to. | +| `source` | string | 1 | **The whole file's text, once.** All node text slices from this. | +| `imports` | `import[]` | 1 | `{ name, path, alias?, span }`. | +| `types` | `{ name → type }` | 1 | Named map. | +| `functions` | `{ sig → callable }` | 1 | Module-level callables (Go/Python/C); empty for class-only languages. | +| `content_hash` | string | 1 | For incremental caching; not identity. | + +### `type` (`class` \| `struct` \| `interface` \| `enum` \| `trait` \| `type_alias` \| …) +| Field | Type | Level | Notes | +| --- | --- | --- | --- | +| `id`, `kind`, `span` | — | 1 | `kind` is the specific type kind, **not** a pile of `is_*` booleans. | +| `base_types` | `id[]` | 1 | `extends`/embeds — durable ids of supertypes. | +| `interfaces` | `id[]` | 1 | `implements`/satisfies. | +| `modifiers` | `string[]` | 1 | `public`/`abstract`/`sealed`/… (language set). | +| `decorators` | `decorator[]` | 1 | Structured `{ name, args[], span }` — **not** flat strings. | +| `callables` | `{ sig → callable }` | 1 | Methods/constructors. | +| `fields` | `{ name → field }` | 1 | `{ id, kind:"field", type, modifiers[], decorators[], span }`. | +| `nesting` | `{ parent?, is_local? }` | 1 | Nested/inner/local flags as data, not booleans on the node. | + +### `callable` (`function` \| `method` \| `constructor` \| `initializer` \| `lambda`) +| Field | Type | Level | Notes | +| --- | --- | --- | --- | +| `id`, `kind`, `span` | — | 1 | `span.bytes` → `get_method_body` = `module.source[bytes]`. | +| `signature` | string | 1 | Human-readable; the *last* path segment of the `id`, from the one `signatureOf()`. | +| `parameters` | `param[]` | 1 | `{ name, type, span, is_variadic? }`, ordered. | +| `return_type` | string | 1 | | +| `error_channel` | `string[]` | 1 | Generalized: Go `(T, error)`, Rust `Result`, Java `throws` — one field, not `thrown_exceptions`. | +| `modifiers`, `decorators` | — | 1 | As on `type`. | +| `metrics` | `{ cyclomatic }` | 1 | Extensible metrics map. | +| `refs` | `{ types:[id], fields:[id] }` | 1 | Cheap cross-refs the symbol pass already knows. | +| `body` | `{ local-id → node }` | 3 | The statement/vertex map (below). Absent until L3. | +| `cfg` / `cdg` / `ddg` / `summary` | `edge[]` | 3–4 | Intra-callable edges (below). | + +## Body node kinds (L3+, keyed by local id inside `body`) + +| kind | Level | Extra fields | Notes | +| --- | --- | --- | --- | +| `call` | **1** | `callee` (id, **`null` until L2/resolver backfill**), `arguments:[local-id]` | Call sites — recorded at L1 so `get_call_sites` is an L1 accessor; `callee` is the one refinement slot. | +| `entry` / `exit` | 3 | — | Synthetic CFG endpoints; one each per callable; no span. | +| `statement` | 3 | — | Ordinary statement; text = `module.source[span.bytes]`. | +| `return` | 3 | — | Edge to `exit`. | +| `branch` / `loop` / `switch` | 3 | — | Control constructs; sources of `cfg` branch edges + `cdg`. | +| `formal_in` | 4 | `of` (param name) | Synthetic; child of the callable; one per formal. | +| `formal_out` | 4 | `of` (`$ret` or a by-ref param) | Synthetic; callable exit. | +| `actual_in` | 4 | `of` (`argN`), `parent` (call-site local-id) | Synthetic; child of a call node. | +| `actual_out` | 4 | `of` (`$ret`), `parent` | Synthetic; child of a call node. | +| `expression` | opt | `expr_kind` | Only with `--materialize-expressions`; carries a `parent`. | +| `block` | opt | — | Only with `--materialize-basic-blocks`; a container between callable and statement. | + +## Edges + +Every edge is `{ src, dst, …attrs }` where `src`/`dst` are node **ids** (local within a +callable's edge lists; full `can://` ids in the application-scope lists). The **list name is the +type** — there is no `type` field. No dangling endpoints. + +| List | Scope (lives on) | Level | Endpoint kinds | Attrs | +| --- | --- | --- | --- | --- | +| `call_graph` | application | 2 | callable → callable | `prov:string[]`, `weight:int` (accumulated on merge) | +| `cfg` | callable | 3 | statement → statement | `kind` (`fallthrough`\|`true`\|`false`\|`switch_case`\|`loop_back`\|`exception`\|`return`\|`break`\|`continue`\|lang-adds) | +| `cdg` | callable | 3 | statement → statement | — (control dependence) | +| `ddg` | callable | 3→4 | statement → statement | `var` (k-limited access path), `prov` (`["ssa"]`=syntactic/L3, `["points-to"]`=semantic/L4) | +| `summary` | callable | 4 | actual_in → actual_out | — (transitive intra-caller) | +| `param_in` | application | 4 | actual_in → formal_in | — | +| `param_out` | application | 4 | formal_out → actual_out | — | + +## Language expansion — the rubric + +Keep the invariant spine (`application → module → type/callable → body`, identity-only edges, +one `signatureOf()`), then **add** at the leaves. Record every addition in the analyzer's +`.claude/SCHEMA_DECISIONS.md`. + +| Add a new… | How | Example | | --- | --- | --- | -| `source` | `str` | caller `Callable.signature` | -| `target` | `str` | callee `Callable.signature` | -| `type` | `Literal["CALL_DEP"]` | `"CALL_DEP"` | -| `weight` | `int` | `1` (accumulate when merging backends) | -| `provenance` | `List[str]` | `[]` — e.g. `["tsc"]`, `["jedi","joern"]` | -| `tags` | `Dict[str, str]` | `{}` — free-form, extension-namespaced | - -## Supporting leaf models -- **Import**: `module`, `name`, `alias?`, line/column span. (Java: `path`, `is_static`, - `is_wildcard`.) -- **Comment**: `content`, line/column span, `is_docstring` (Java: `is_javadoc`). -- **CallableParameter**: `name`, `type?`, `default_value?`, line/column span. (Java adds - `annotations`, `modifiers`.) -- **Decorator**: `name`, `qualified_name?`, `positional_arguments[]`, `keyword_arguments{}`, - span. (The Java equivalent is flat `annotations: List[str]`.) -- **Symbol**: `name`, `scope`, `kind`, `type?`, `qualified_name?`, `is_builtin`, `lineno`, - `col_offset`. -- **VariableDeclaration**: `name`, `type?`, `initializer?`, `value?`, `scope`, span. -- **ClassAttribute**: `name`, `type?`, `comments[]`, span. -- **Entrypoint** (optional): `signature`, `framework`, `detection_source`, route/method - fields, `tags{}`. - -## Expanding the schema for the target language (encouraged) - -Mirroring the shared fields is the floor, not the ceiling. A good language pack **captures -what is idiomatic and analytically important in the target language as first-class schema** — -it does not cram the language into the Java/Python mold and discard the rest. You are -explicitly free to add node kinds and fields. The only thing you may not change is the spine. - -**The invariant spine (never drift):** the root keys `symbol_table` (a `Dict[str, Module]`) -and `call_graph` (identity-only `List[CallEdge]`); the Module → Class/Callable nesting; one -`signatureOf()` producing every id; and edges whose `source`/`target` byte-match real -`Callable.signature`s. The shared SDK facade methods depend on exactly this and nothing more. - -**Everything else is yours to extend**, because the new language gets its **own** -`cldk/models//` Pydantic models. Add a field to the analyzer output *and* the -corresponding `` model in the same change, and validation still passes — you own both -sides. You are not limited to the fields in this reference. - -### Decision rubric — where does a new concept go? -1. **New top-level node kind** (sibling of Class/Callable in `Module`, or a new collection) — - when the concept is a *declaration* you'll want to look up by signature or point edges at - (TS `interface`/`type`-alias/`enum`; Go `struct`/`interface`; Rust `trait`/`impl`). Give it - its own `signature` from `signatureOf()` so edges and `base_classes` can reference it. -2. **New typed field on an existing node** — when the concept is an *attribute* of a callable/ - class/callsite that consumers will query directly and want validated (Go method - `receiver_type`; Rust `is_async`/`is_unsafe`; TS `type_parameters` for generics; visibility/ - mutability). Add it to both the output and the `` model with a sensible default. -3. **Open-vocabulary `tags` / `provenance`** — when the metadata is low-stakes, sparse, or - framework/extension-specific and not worth a typed field (Go struct tags, build constraints; - TS JSX flags; experimental attributes). These are `Dict[str,str]`/`List[str]`, so they - round-trip without schema churn and without every consumer needing to know about them. - -Prefer a typed field (1 or 2) when a consumer will branch on the value; prefer `tags` (3) when -it's descriptive metadata. When unsure, start with `tags` and promote to a field later. - -### Worked expansions -- **TypeScript**: `interface`, `type`-alias, and `enum` as Class-siblings; `type_parameters` - for generics; union/intersection types captured in `type` strings; `extends`/`implements` - chains → `base_classes`; TS decorators → `decorators`; ambient/`declare` and JSX flags → - `tags`. -- **Go**: `struct` and `interface` node kinds; method `receiver_type` on the callable; - embedded structs and satisfied interfaces → `base_classes`; goroutine launches and channel - ops are good `Callsite`/`tags` candidates; struct tags and build constraints → `tags`. -- **Rust**: `trait`, `impl` block, and `enum` (with variants) node kinds; `is_async`/ - `is_unsafe`/`is_const` and lifetime/generic params as fields; trait bounds → `base_classes`; - macro invocations as `Callsite`s tagged with provenance `"macro"`. - -Whatever you add, keep snake_case keys and make new fields optional-with-default so a partially -populated `analysis.json` (e.g. symbol-table-only, or a degraded resolve) still validates. - -## The validation contract (success criterion) -The generated analyzer's output is correct iff the SDK model loads it without error: - -```python -import json -from cldk.models. import Application # the models you add (subprocess backend) -app = Application(**json.load(open("analysis.json"))) # must not raise -assert app.symbol_table # non-empty -sigs = { ... all Callable.signature in app.symbol_table ... } -assert all(e.source in sigs and e.target in sigs for e in app.call_graph) # no dangling edges -``` - -Because the SDK `Application` model is itself a faithful mirror of this reference, "passes -Pydantic validation + no dangling edges" is the comprehensive, mechanical check that the schema -was mirrored fully and correctly. Build the SDK models first (from this reference), then make -the analyzer's output validate against them. +| **type kind** | new `kind` value on a `type` node | Go `struct`; TS `interface`, `enum`, `type_alias`; Rust `trait` | +| **callable kind** | new `kind` value on a `callable` | closures, getters/setters, `init` blocks | +| **body node kind** | new `kind` value in `body` | Go `select`, Python comprehension scope | +| **CFG edge kind** | new `kind` value on a `cfg` edge | Go `defer_resume`, JS `await_resume`, Python `yield_resume` | +| **typed field** | new field on a node | receiver type, `is_async`/`is_unsafe`, struct tags | +| **open-vocab attr** | a string in `tags{}` | anything not worth a first-class field | + +Never: rename a shared field, repurpose a shared `kind`, add a rich-edge variant, or introduce a +node that isn't reachable from `application` by containment. Those break the single-model +premise the whole SDK depends on. The **parity clause** in `canonical-schema.md` is the rule; +this table is how you satisfy it. diff --git a/skills/codeanalyzer-backend/references/symbol-table-construction.md b/skills/codeanalyzer-backend/references/symbol-table-construction.md index aeadfa2..bf5087b 100644 --- a/skills/codeanalyzer-backend/references/symbol-table-construction.md +++ b/skills/codeanalyzer-backend/references/symbol-table-construction.md @@ -1,10 +1,17 @@ -# Symbol Table Construction (file by file) +# L1 — build the tree (symbol table, file by file) -Once the schema is designed (`schema-design-loop.md`), this stage **populates** it: walk the -project file by file and build `symbol_table: Dict[file_path, Module]`. Like the schema, you -build this by **studying how the mature reference analyzers do it** and replicating the pattern -for the new language — they have already solved file discovery, per-file building, caching, and -the whole-project / target-files / single-source modes. +The first level of the additive schema (`canonical-schema.md`): grow the **containment tree to +callable depth** — `application → symbol_table{module} → types{}/functions{} → callables{}` — +file by file. This is the floor everything else hangs off. You build it by **studying how the +mature reference analyzers do it** and replicating the pattern for the new language — they have +already solved file discovery, per-file building, caching, and the whole-project / target-files / +single-source modes. + +**v2 shape (vs the old flat symbol table):** every node carries an `id` (the `can://` path), a +`kind`, and a `span` **with byte offsets**; the **module stores the whole file's `source` once** +(all node text slices off it — no per-callable `code`); and call sites are recorded as `call` +nodes in each callable's `body` with `callee: null` (so `get_call_sites` works at L1; L2 backfills +`callee`). The tree is otherwise the named-map hierarchy of `canonical-schema.md`. ## Anchor: how Java and Python construct the symbol table @@ -39,12 +46,13 @@ map keyed by path**, with three entry modes (all / target-files / single-source) `symbol_table` key and must be identical across runs (so caching and the SDK's file lookups work). 3. **Per file, cache-check then build.** If a prior `analysis_cache.json` has this file and its - `content_hash`/`last_modified`/`file_size` are unchanged, reuse the cached `Module`. + `content_hash`/`last_modified`/`file_size` are unchanged, reuse the cached `module`. Otherwise call your **per-file builder** (the analog of `build_pymodule_from_file` / - `processCompilationUnit`): parse with the structural tool, walk the tree, and fill the - `Module` with classes / functions / language-native kinds / callables — and on each callable - the **unresolved call sites** (callee name + receiver expr + arg exprs + position, - `callee_signature` left null). Stamp the cache metadata on the `Module`. + `processCompilationUnit`): parse with the structural tool, retain the file's text as the + module's **`source`**, walk the tree, and fill the `module` with `types` / `functions` / + language-native kinds / `callables` — each node with its `id`, `kind`, and `span` (with byte + offsets). On each callable, record the **call sites as `call` nodes in `body`** (callee name + + receiver expr + arg exprs + span, `callee: null`). Stamp the cache metadata on the `module`. 4. **Assemble** `symbol_table[file_key] = module` for every file. 5. **Support the three CLI modes** (`cli-contract.md`): whole-project (extractAll-style), `-t/--target-files` incremental (extract-style), and optionally single-source. @@ -65,18 +73,20 @@ split per node kind into sibling modules under `syntactic_analysis/`. **Do not** threading state through arguments, with `buildClass`/`buildInterface`/`buildEnum` scattered across the file. See `analyzer-architecture.md` rule 2. -Keep this stage to the symbol table — record call sites but **don't resolve them into edges** -yet. That resolution is the *cheap next step* (still level 1; `backend-recipe.md` step 6), where -the same resolver maps each site to its callee. Type fields may be populated here if your -resolver is a same-tool checker; only the edge resolution is deferred to the next stage. +Keep this stage to L1 — record the `call` nodes but **don't resolve them into edges** yet. That +resolution is L2 (`backend-recipe.md`), where the same resolver maps each `call` node to its +callee (backfilling `callee`) and emits the `call_graph`. Type fields may be populated here if +your resolver is a same-tool checker; only the edge resolution is deferred. -## Verify (the level-1 gate) +## Verify (the L1 gate) Run the analyzer on a tiny fixture project and confirm: -- the output **validates** against the SDK `Application` Pydantic model - (`Application(**json.load(...))` does not raise); -- `symbol_table` is non-empty and keyed by stable relative paths; -- spot-check one known file: its `Module` has the expected classes/functions, and callables - carry unresolved call sites with `callee_signature == null`; +- the output **validates** against the SDK `Application` model (`Application(**json.load(...))` + does not raise); +- `symbol_table` is non-empty and keyed by stable relative paths (no absolute, no `..`); +- spot-check one known file: its `module` has the expected `types`/`functions`, a `source` blob, + and callables carrying `call` nodes with `callee: null`; a callable's text = `module.source` + sliced by `span.bytes` (`get_method_body`); - re-running reuses cache for unchanged files (no rebuild). -Only when this passes do you move to call-graph construction. +Only when this passes do you move to L2 (call-graph construction). Full criteria: +`testing-and-validation.md`. diff --git a/skills/codeanalyzer-backend/references/testing-and-validation.md b/skills/codeanalyzer-backend/references/testing-and-validation.md index 83784ab..a0fa6c4 100644 --- a/skills/codeanalyzer-backend/references/testing-and-validation.md +++ b/skills/codeanalyzer-backend/references/testing-and-validation.md @@ -53,13 +53,13 @@ analyzer repo — see the frontend skill's `sdk-testing.md`. Run the analyzer on the fixture and confirm all of the following: -1. **Output validates** against the SDK `Application` Pydantic model — - `Application(**json.load(open("analysis.json")))` must not raise. +1. **Output validates** against the SDK `Application` Pydantic model — + `Application(**json.load(open("analysis.json")))` must not raise. 2. **`symbol_table` is non-empty** and keyed by **stable relative paths** — no key starts with `/` (absolute) or `..` (CWD-relative). Both are common bugs; assert them explicitly. 3. A known file's `Module` contains the expected types, functions, and call sites with - `callee_signature == null`. (Call sites are recorded but not resolved at this stage.) + `callee == null`. (Call sites are recorded but not resolved at this stage.) 4. **Re-running reuses the cache** — mtime of `analysis.json` (or `analysis_cache.json`) is unchanged on a second non-eager run. @@ -70,11 +70,11 @@ Do not proceed to Call Graph Construction until this passes. 1. Every edge endpoint matches a real signature in the symbol table — no dangling nodes. Check: `for e in app.call_graph: assert e.source in all_sigs and e.target in all_sigs`. 2. Every edge has a non-empty `provenance` list naming the resolver. -3. `callee_signature` is backfilled on successfully resolved call sites (non-null, non-empty +3. `callee` is backfilled on successfully resolved call sites (non-null, non-empty string). 4. A named expected edge is present — assert the exact `(source, target)` pair. 5. At least one cross-package/cross-module edge is present. -6. Output still validates against `Application`. +6. Output still validates against `Application`. ### Caching tests (add after implementing caching/incremental — `backend-recipe.md` step 8) @@ -85,7 +85,7 @@ Four behaviors to assert on the binary: | Test | What to assert | |------|----------------| | `CacheFileWritten` | After `Analyze()` with `CacheDir` set, `analysis_cache.json` exists in that dir. | -| `CacheContentsRoundTrip` | `analysis_cache.json` deserializes to a valid `Application` with the same symbol table key count as the in-memory result. | +| `CacheContentsRoundTrip` | `analysis_cache.json` deserializes to a valid `Application` with the same symbol table key count as the in-memory result. | | `SecondRunReuses` | Second run with same non-eager opts returns the same symbol table key count; `analysis.json` (or cache file) mtime is unchanged. | | `EagerForcesRebuild` | After seeding the cache, a run with `Eager=true` rewrites `analysis_cache.json` (mtime advances). Use `time.Sleep` / `time.sleep` before the eager run to ensure the filesystem timestamp differs. | @@ -95,6 +95,24 @@ Four behaviors to assert on the binary: clear message, never silently fall back to JSON. Assert the non-zero exit and the message. See `cli-contract.md § Flag validation requirements`. +### Monotonicity gate (the additive-paradigm invariant) + +The schema is additive (`canonical-schema.md` § Monotonicity), so the level outputs must nest: +run the analyzer at `-a 1`, `-a 2`, `-a 3`, `-a 4` on the fixture and assert +**`json(-a 1) ⊆ json(-a 2) ⊆ json(-a 3) ⊆ json(-a 4)`** — every node and edge present at a lower +level is present, unchanged, at every higher level. The **only** sanctioned differences are +additions (new `body` nodes, new edge-list entries) and the single `callee: null → id` +refinement. A diff that *changes* an existing fact (a rewritten span, a re-anchored `call_graph` +edge, a removed syntactic `ddg` edge) fails the gate — it means a level rewrote instead of added. +Also assert the two projections agree: the Neo4j node/edge counts at full depth match the JSON at +`max_level` (modulo the containment `HAS_*` edges Neo4j makes explicit). + +### Two-tier identity gate + +`can://` ids (≥ callable) are stable across two runs on unchanged source; `…@line:col` ids carry a +column (assert no id is a bare line); every edge endpoint resolves to a real node (no dangling, at +every level and in both projections). + --- ## 3. Definition of done (analyzer surface) @@ -103,7 +121,7 @@ Both this surface and the SDK surface (frontend skill) must pass before the lang considered complete. - [ ] `go test ./...` (or equivalent) passes — all symbol table, call graph, and caching tests. -- [ ] Output on the fixture validates against `Application` without error. +- [ ] Output on the fixture validates against `Application` without error. - [ ] `symbol_table` keys are relative paths; no key is absolute or `..`-prefixed. - [ ] Every language-specific field has at least one test asserting a concrete value. - [ ] Named expected call-graph edge is asserted (not just "non-empty").