Skip to content

Level-3: native dataflow graphs (CFG/DFG/PDG/SDG/CPG) for C/C++ #2

Description

@rahlk

⚠️ Scope amendment (2026-07-02): analyzer is a pure graph provider — slicing & taint move to the SDK

This analyzer emits the graph substrate only; client analyses are frontend SDK queries. Established across the family in cldk-forge PR #7 (dataflow-graphs.md § provider/client boundary); reference instantiation codellm-devkit/codeanalyzer-java#171.

Superseded here → filed as codellm-devkit/python-sdk#229 (the shared client-analysis engine): GOAL 3 (slicing + taint as SDG queries, taint_flows output), PART 3 item 12 (backward slicing + taint), PR D's "MVP taint over the call graph", PR E (models-as-data + taint_flows), and the "client gates" / "sanitized + unsanitized taint pair" in the DoD (now frontend gates). TaintFlow drops from the SDK models co-evolved here.

Stays in this analyzer: the full graph substrate — GOAL 1 (program_graphs: CFG/PDG/SDG), the transitive SUMMARY edges (PART 2 item 9 SDG assembly; PR D "SDG assembly … SUMMARY edges"), and the CPG projection (GOAL 2 / PR G). The DoD fixture keeps a source→sink data-flow pair so the SDK can assert taint over it. Retitled: dropped "and taint analysis".


Depends on #1 (native C++/LibTooling rewrite). Level-3 is native, in-process dataflow —
the natural C/C++ substrate is clang::CFG + LLVM/SVF, which is a first-class C++ API in the
native backend and awkward-to-absent through the Python libclang bindings. Build this on top of
the C++ backend, not the current Python one. Not part of the initial bring-up (level 1 is done).

PROBLEM

codeanalyzer-clang today emits the level-1 symbol table and resolver call graph (-a 2), plus a
stubbed level-2 SVF enrichment toggle. It has no dataflow: no CFG, no dependence graphs, no way
to answer "what does this value affect" or "does user input reach this sink". This issue adds
level 3 — native, whole-program dependence graphs built from Clang's own CFG, per the skillset's
dataflow-graphs.md contract — and exposes slicing and taint as queries over them.

Native is the constraint: everything runs in the analyzer's own ecosystem (clang/LLVM), in-process.

GOALS (the contract)

  1. Emit CFG, PDG (CDG+DDG), and SDG as first-class sections of analysis.json
    (program_graphs, schema-versioned, keyed by canonical (signature, node_id)), gated by
    -a 3 / --graphs cfg,dfg,pdg,sdg. (Today the CLI caps -a at 2 and has no --graphs.)
  2. Project the CPG (AST+CFG+PDG overlay) through the existing Neo4j emitter as new labels /
    edge types; additive schema.neo4j.json bump + conformance test.
  3. Expose backward slicing and taint as SDG queries; sources/sinks/sanitizers/library
    models supplied as data (JSON spec + JSON Schema); emit a taint_flows section.
  4. Hold the cross-language parity clause: shared node kinds / edge types / JSON shapes;
    C/C++-specific additions are additive and recorded in .claude/SCHEMA_DECISIONS.md.
  5. Keep -a 1 / -a 2 timings unaffected; content-hash + cache summaries with recorded
    dependency edges (incrementality later, but its hooks now).
  6. Deterministic parallelism (-j/--jobs): --jobs N output byte-identical to --jobs 1.

SUBSTRATE DECISIONS (from dataflow-substrate-menu.md — C/C++ row)

  • CFG source: clang::CFG (built per function from the Clang AST in the native backend).
  • Def-use source: hand-built reaching definitions over the clang CFG (no free SSA at this level;
    LLVM SSA is an option if lowering to IR, but the clang-CFG path keeps node identity at
    source-statement granularity).
  • Points-to oracle: type-based may-alias MVP, upgraded to a native Steensgaard (cheap)
    and then SVF Andersen (the already-scaffolded semantic_analysis/svf/ level-2 backend, over
    LLVM bitcode) as the strong oracle. Frozen oracle: read its solved state, never fork its solver.
  • Identity mapping: clang CFG / SVF node identities (source-location- / IR-based) mapped onto
    our canonical signature + node_id keys — the same keys as symbol_table / call_graph. On
    the critical path for every later part.
  • Precision posture: sound-leaning, over-approximate; flow-sensitive on locals, heap precision
    capped by the oracle; k-limited access paths (--graph-field-depth, default 3).

PART 1 — INTRAPROCEDURAL (stages 1–4)

  1. Exceptional CFG per callable: statement-level, synthetic ENTRY/EXIT, multi-exit normalized.
    Explicit lowering for every C/C++ construct: if/switch+fallthrough, for/while/do-while,
    break/continue/goto+labels, &&/||/ternary short-circuit, C++ try/catch/throw,
    destructor/RAII cleanup edges at scope exit, return in the presence of cleanups.
  2. Dominators + post-dominators (CHK iterative); synthetic edge for infinite loops; control
    dependence via the post-dominance frontier.
  3. Access-path variable model (k-limited); reaching definitions → DDG edges.
  4. PDG assembly; intraprocedural backward-slice gate on the fixture (hand-computed expected node
    set, exact match).

PART 2 — INTERPROCEDURAL (stages 5–9)

  1. Oracle integration (type-based MVP first; SVF behind the flag); one whole-program solve;
    identity-mapping layer; merge oracle call-graph edges into call_graph with provenance.
  2. Record global/extern/static state and their read/write sites as summary inputs/outputs.
  3. SCC condensation (Tarjan) of the call graph; hammock-region decomposition; bottom-up relational
    region summaries; function-summary composition over the condensation DAG (ready-queue wavefront);
    monotone fixpoint within SCCs; k-limiting mandatory for termination.
  4. External/library behavior as model summaries (libc, STL) in the same relational format;
    unmodeled externals default to conservative pass-through.
  5. SDG assembly: actual/formal in/out nodes, CALL / PARAM_IN / PARAM_OUT / SUMMARY edges; globals
    ride as extra formals.

PART 3 — EMISSION AND CLIENTS (stage 8)

  1. program_graphs section + --graphs selector with strict flag validation; co-evolve the shared
    SDK Pydantic models (ProgramGraphs / GraphNode / GraphEdge / SDGEdge / TaintFlow).
  2. CPG projection: CFGNode label + CFG_NEXT/CDG/DDG/PARAM_IN/PARAM_OUT/SUMMARY/
    HAS_CFG_NODE in neo4j/; schema.neo4j.json bump; conformance test extended.
  3. Backward slicing (two-phase context-sensitive traversal) + taint (labeled reachability,
    sanitizer blocking, lazy witness reconstruction) as SDG queries; sources/sinks/sanitizers as data
    (built-in pack < config file < inline flags); taint_flows output with model ids.

CAVEATS AND KNOWN RISKS

  • Inherited unsoundness in C/C++ (documented in README, not silently absorbed):
    setjmp/longjmp, pointer arithmetic, union type-punning, reinterpret_cast, inline asm,
    function pointers (partial), and C++ exceptions / throw across TU boundaries if unmodeled.
  • Identity-model mismatch: clang CFG is source-location-based; SVF is LLVM-IR-based. Reconciling
    both onto one (signature, node_id) space is the hard part — pin the mapping in tests.
  • Termination: interprocedural fixpoint requires k-limiting (mandatory knob); label sets bounded.
  • Precision: intentionally over-approximate; do not trade soundness for a lower FP rate —
    ranking/pruning is downstream's job.
  • Cost: whole-program SVF solve is the dominant cost; -a 1/-a 2 must stay unaffected (CI-checked).
  • Incrementality: aspirational; record summary dependency edges + content-hashes from the start.
  • Parallel determinism: never assign ids / emit during parallel execution — collect, then sort by
    (signature, node_id). Sequential first, pass every gate at --jobs 1, parallelize after using
    --jobs 1 as the differential oracle.

STAGED PRs

  • PR A Prep: land the native C++ backend (Rewrite the backend in native C++ (Clang LibTooling), replacing the Python/libclang implementation #1) far enough to expose per-function clang::CFG.
  • PR B Points-to oracle integration (type-based MVP) + identity mapping + call-graph merge with provenance.
  • PR C Intraprocedural: CFG + dominance + PDG, program_graphs emission for cfg/pdg, slice gate green; then per-callable parallel fan-out, differential-tested vs --jobs 1.
  • PR D Summaries: hammock regions, SCC fixpoint + k-limiting; SDG assembly; sdg_edges; MVP taint over the call graph; then the SCC wavefront.
  • PR E Models-as-data: JSON spec + Schema, default pack, precedence; taint_flows + witness paths; SDK models co-evolved.
  • PR F SVF/Steensgaard-backed alias-aware propagation replaces the type-based MVP (reuses semantic_analysis/svf/).
  • PR G (optional) CPG Neo4j projection + conformance test + schema bump.
  • PR H (later) Incremental re-analysis over the recorded dependency edges.

VERIFICATION / DEFINITION OF DONE

  • Every gate in dataflow-construction.md passes on the fixture (CFG, dominance, DFG, PDG-slice,
    summary, SDG, client gates) — exact expected sets, not "non-empty".
  • Fixture covers the full C/C++ lowering checklist (goto, switch-fallthrough, short-circuit, RAII
    cleanup, try/catch) plus shared minimums (aliasing, SCC recursion, multi-file flow, sanitized +
    unsanitized taint pair).
  • -a 3 output validates against the shared SDK ProgramGraphs models; parity clause holds.
  • Cypher snapshot with graphs loads clean into empty Neo4j; CFGNode count matches JSON; no
    dangling edges.
  • --jobs N output byte-identical to --jobs 1 (determinism gate).
  • -a 1 / -a 2 wall-clock unchanged within noise.

Reference

Skill: cldk-forge:codeanalyzer-backenddataflow-graphs.md (contract), dataflow-construction.md
(stage-by-stage method + gates), dataflow-substrate-menu.md (C/C++ row), dataflow-issue-template.md.
Prior art: codeanalyzer-typescript#2 (the TS level-3 instantiation).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions