You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
⚠️ Scope amendment (2026-07-02): analyzer is a pure graph provider — slicing & taint move to the SDK
This analyzer emits the graph substrate only; client analyses are frontend SDK queries. Established across the family in cldk-forge PR #7 (dataflow-graphs.md § provider/client boundary); reference instantiation codellm-devkit/codeanalyzer-java#171.
Superseded here → filed as codellm-devkit/python-sdk#229 (the shared client-analysis engine): GOAL 3 (slicing + taint as SDG queries, taint_flows output), PART 3 item 12 (backward slicing + taint), PR D's "MVP taint over the call graph", PR E (models-as-data + taint_flows), and the "client gates" / "sanitized + unsanitized taint pair" in the DoD (now frontend gates). TaintFlow drops from the SDK models co-evolved here.
Stays in this analyzer: the full graph substrate — GOAL 1 (program_graphs: CFG/PDG/SDG), the transitive SUMMARY edges (PART 2 item 9 SDG assembly; PR D "SDG assembly … SUMMARY edges"), and the CPG projection (GOAL 2 / PR G). The DoD fixture keeps a source→sink data-flow pair so the SDK can assert taint over it. Retitled: dropped "and taint analysis".
Depends on #1 (native C++/LibTooling rewrite). Level-3 is native, in-process dataflow —
the natural C/C++ substrate is clang::CFG + LLVM/SVF, which is a first-class C++ API in the
native backend and awkward-to-absent through the Python libclang bindings. Build this on top of
the C++ backend, not the current Python one. Not part of the initial bring-up (level 1 is done).
PROBLEM
codeanalyzer-clang today emits the level-1 symbol table and resolver call graph (-a 2), plus a
stubbed level-2 SVF enrichment toggle. It has no dataflow: no CFG, no dependence graphs, no way
to answer "what does this value affect" or "does user input reach this sink". This issue adds level 3 — native, whole-program dependence graphs built from Clang's own CFG, per the skillset's dataflow-graphs.md contract — and exposes slicing and taint as queries over them.
Native is the constraint: everything runs in the analyzer's own ecosystem (clang/LLVM), in-process.
GOALS (the contract)
Emit CFG, PDG (CDG+DDG), and SDG as first-class sections of analysis.json
(program_graphs, schema-versioned, keyed by canonical (signature, node_id)), gated by -a 3 / --graphs cfg,dfg,pdg,sdg. (Today the CLI caps -a at 2 and has no --graphs.)
Project the CPG (AST+CFG+PDG overlay) through the existing Neo4j emitter as new labels /
edge types; additive schema.neo4j.json bump + conformance test.
Expose backward slicing and taint as SDG queries; sources/sinks/sanitizers/library
models supplied as data (JSON spec + JSON Schema); emit a taint_flows section.
Hold the cross-language parity clause: shared node kinds / edge types / JSON shapes;
C/C++-specific additions are additive and recorded in .claude/SCHEMA_DECISIONS.md.
Keep -a 1 / -a 2 timings unaffected; content-hash + cache summaries with recorded
dependency edges (incrementality later, but its hooks now).
Deterministic parallelism (-j/--jobs): --jobs N output byte-identical to --jobs 1.
CFG source:clang::CFG (built per function from the Clang AST in the native backend).
Def-use source: hand-built reaching definitions over the clang CFG (no free SSA at this level;
LLVM SSA is an option if lowering to IR, but the clang-CFG path keeps node identity at
source-statement granularity).
Points-to oracle:type-based may-alias MVP, upgraded to a native Steensgaard (cheap)
and then SVF Andersen (the already-scaffolded semantic_analysis/svf/ level-2 backend, over
LLVM bitcode) as the strong oracle. Frozen oracle: read its solved state, never fork its solver.
Identity mapping: clang CFG / SVF node identities (source-location- / IR-based) mapped onto
our canonical signature + node_id keys — the same keys as symbol_table / call_graph. On
the critical path for every later part.
Precision posture: sound-leaning, over-approximate; flow-sensitive on locals, heap precision
capped by the oracle; k-limited access paths (--graph-field-depth, default 3).
PART 1 — INTRAPROCEDURAL (stages 1–4)
Exceptional CFG per callable: statement-level, synthetic ENTRY/EXIT, multi-exit normalized.
Explicit lowering for every C/C++ construct: if/switch+fallthrough, for/while/do-while, break/continue/goto+labels, &&/||/ternary short-circuit, C++ try/catch/throw, destructor/RAII cleanup edges at scope exit, return in the presence of cleanups.
Dominators + post-dominators (CHK iterative); synthetic edge for infinite loops; control
dependence via the post-dominance frontier.
Access-path variable model (k-limited); reaching definitions → DDG edges.
PDG assembly; intraprocedural backward-slice gate on the fixture (hand-computed expected node
set, exact match).
PART 2 — INTERPROCEDURAL (stages 5–9)
Oracle integration (type-based MVP first; SVF behind the flag); one whole-program solve;
identity-mapping layer; merge oracle call-graph edges into call_graph with provenance.
Record global/extern/static state and their read/write sites as summary inputs/outputs.
SCC condensation (Tarjan) of the call graph; hammock-region decomposition; bottom-up relational
region summaries; function-summary composition over the condensation DAG (ready-queue wavefront);
monotone fixpoint within SCCs; k-limiting mandatory for termination.
External/library behavior as model summaries (libc, STL) in the same relational format;
unmodeled externals default to conservative pass-through.
SDG assembly: actual/formal in/out nodes, CALL / PARAM_IN / PARAM_OUT / SUMMARY edges; globals
ride as extra formals.
PART 3 — EMISSION AND CLIENTS (stage 8)
program_graphs section + --graphs selector with strict flag validation; co-evolve the shared
SDK Pydantic models (ProgramGraphs / GraphNode / GraphEdge / SDGEdge / TaintFlow).
CPG projection: CFGNode label + CFG_NEXT/CDG/DDG/PARAM_IN/PARAM_OUT/SUMMARY/ HAS_CFG_NODE in neo4j/; schema.neo4j.json bump; conformance test extended.
Backward slicing (two-phase context-sensitive traversal) + taint (labeled reachability,
sanitizer blocking, lazy witness reconstruction) as SDG queries; sources/sinks/sanitizers as data
(built-in pack < config file < inline flags); taint_flows output with model ids.
CAVEATS AND KNOWN RISKS
Inherited unsoundness in C/C++ (documented in README, not silently absorbed): setjmp/longjmp, pointer arithmetic, union type-punning, reinterpret_cast, inline asm,
function pointers (partial), and C++ exceptions / throw across TU boundaries if unmodeled.
Identity-model mismatch: clang CFG is source-location-based; SVF is LLVM-IR-based. Reconciling
both onto one (signature, node_id) space is the hard part — pin the mapping in tests.
Precision: intentionally over-approximate; do not trade soundness for a lower FP rate —
ranking/pruning is downstream's job.
Cost: whole-program SVF solve is the dominant cost; -a 1/-a 2 must stay unaffected (CI-checked).
Incrementality: aspirational; record summary dependency edges + content-hashes from the start.
Parallel determinism: never assign ids / emit during parallel execution — collect, then sort by (signature, node_id). Sequential first, pass every gate at --jobs 1, parallelize after using --jobs 1 as the differential oracle.
PR F SVF/Steensgaard-backed alias-aware propagation replaces the type-based MVP (reuses semantic_analysis/svf/).
PR G (optional) CPG Neo4j projection + conformance test + schema bump.
PR H (later) Incremental re-analysis over the recorded dependency edges.
VERIFICATION / DEFINITION OF DONE
Every gate in dataflow-construction.md passes on the fixture (CFG, dominance, DFG, PDG-slice,
summary, SDG, client gates) — exact expected sets, not "non-empty".
Fixture covers the full C/C++ lowering checklist (goto, switch-fallthrough, short-circuit, RAII
cleanup, try/catch) plus shared minimums (aliasing, SCC recursion, multi-file flow, sanitized +
unsanitized taint pair).
-a 3 output validates against the shared SDK ProgramGraphs models; parity clause holds.
Cypher snapshot with graphs loads clean into empty Neo4j; CFGNode count matches JSON; no
dangling edges.
--jobs N output byte-identical to --jobs 1 (determinism gate).
PROBLEM
codeanalyzer-clangtoday emits the level-1 symbol table and resolver call graph (-a 2), plus astubbed level-2 SVF enrichment toggle. It has no dataflow: no CFG, no dependence graphs, no way
to answer "what does this value affect" or "does user input reach this sink". This issue adds
level 3 — native, whole-program dependence graphs built from Clang's own CFG, per the skillset's
dataflow-graphs.mdcontract — and exposes slicing and taint as queries over them.Native is the constraint: everything runs in the analyzer's own ecosystem (clang/LLVM), in-process.
GOALS (the contract)
analysis.json(
program_graphs, schema-versioned, keyed by canonical(signature, node_id)), gated by-a 3/--graphs cfg,dfg,pdg,sdg. (Today the CLI caps-aat 2 and has no--graphs.)edge types; additive
schema.neo4j.jsonbump + conformance test.models supplied as data (JSON spec + JSON Schema); emit a
taint_flowssection.C/C++-specific additions are additive and recorded in
.claude/SCHEMA_DECISIONS.md.-a 1/-a 2timings unaffected; content-hash + cache summaries with recordeddependency edges (incrementality later, but its hooks now).
-j/--jobs):--jobs Noutput byte-identical to--jobs 1.SUBSTRATE DECISIONS (from dataflow-substrate-menu.md — C/C++ row)
clang::CFG(built per function from the Clang AST in the native backend).LLVM SSA is an option if lowering to IR, but the clang-CFG path keeps node identity at
source-statement granularity).
and then SVF Andersen (the already-scaffolded
semantic_analysis/svf/level-2 backend, overLLVM bitcode) as the strong oracle. Frozen oracle: read its solved state, never fork its solver.
our canonical
signature+node_idkeys — the same keys assymbol_table/call_graph. Onthe critical path for every later part.
capped by the oracle; k-limited access paths (
--graph-field-depth, default 3).PART 1 — INTRAPROCEDURAL (stages 1–4)
Explicit lowering for every C/C++ construct:
if/switch+fallthrough,for/while/do-while,break/continue/goto+labels,&&/||/ternary short-circuit, C++try/catch/throw,destructor/RAII cleanup edges at scope exit,
returnin the presence of cleanups.dependence via the post-dominance frontier.
set, exact match).
PART 2 — INTERPROCEDURAL (stages 5–9)
identity-mapping layer; merge oracle call-graph edges into
call_graphwithprovenance.extern/staticstate and their read/write sites as summary inputs/outputs.region summaries; function-summary composition over the condensation DAG (ready-queue wavefront);
monotone fixpoint within SCCs; k-limiting mandatory for termination.
unmodeled externals default to conservative pass-through.
ride as extra formals.
PART 3 — EMISSION AND CLIENTS (stage 8)
program_graphssection +--graphsselector with strict flag validation; co-evolve the sharedSDK Pydantic models (
ProgramGraphs/GraphNode/GraphEdge/SDGEdge/TaintFlow).CFGNodelabel +CFG_NEXT/CDG/DDG/PARAM_IN/PARAM_OUT/SUMMARY/HAS_CFG_NODEinneo4j/;schema.neo4j.jsonbump; conformance test extended.sanitizer blocking, lazy witness reconstruction) as SDG queries; sources/sinks/sanitizers as data
(built-in pack < config file < inline flags);
taint_flowsoutput with model ids.CAVEATS AND KNOWN RISKS
setjmp/longjmp, pointer arithmetic,uniontype-punning,reinterpret_cast, inline asm,function pointers (partial), and C++ exceptions /
throwacross TU boundaries if unmodeled.both onto one
(signature, node_id)space is the hard part — pin the mapping in tests.ranking/pruning is downstream's job.
-a 1/-a 2must stay unaffected (CI-checked).(signature, node_id). Sequential first, pass every gate at--jobs 1, parallelize after using--jobs 1as the differential oracle.STAGED PRs
clang::CFG.program_graphsemission for cfg/pdg, slice gate green; then per-callable parallel fan-out, differential-tested vs--jobs 1.sdg_edges; MVP taint over the call graph; then the SCC wavefront.taint_flows+ witness paths; SDK models co-evolved.semantic_analysis/svf/).VERIFICATION / DEFINITION OF DONE
dataflow-construction.mdpasses on the fixture (CFG, dominance, DFG, PDG-slice,summary, SDG, client gates) — exact expected sets, not "non-empty".
cleanup, try/catch) plus shared minimums (aliasing, SCC recursion, multi-file flow, sanitized +
unsanitized taint pair).
-a 3output validates against the shared SDKProgramGraphsmodels; parity clause holds.CFGNodecount matches JSON; nodangling edges.
--jobs Noutput byte-identical to--jobs 1(determinism gate).-a 1/-a 2wall-clock unchanged within noise.Reference
Skill:
cldk-forge:codeanalyzer-backend—dataflow-graphs.md(contract),dataflow-construction.md(stage-by-stage method + gates),
dataflow-substrate-menu.md(C/C++ row),dataflow-issue-template.md.Prior art:
codeanalyzer-typescript#2(the TS level-3 instantiation).