Skip to content

Drop semanticdb step when indexing Java code#887

Draft
jupblb wants to merge 8 commits into
mainfrom
michal/drop-semanticdb
Draft

Drop semanticdb step when indexing Java code#887
jupblb wants to merge 8 commits into
mainfrom
michal/drop-semanticdb

Conversation

@jupblb
Copy link
Copy Markdown
Member

@jupblb jupblb commented May 27, 2026

No description provided.

jupblb added 8 commits May 27, 2026 13:48
…ission

First two milestones of dropping the intermediate SemanticDB step in favour
of direct SCIP shard output from the Java compiler plugin.

Adds, with no behaviour change in the default config:

  semanticdb-javac:
    - ScipSymbols: helper that maps SemanticDB symbol strings to SCIP
      symbol strings. Globals get the '. . . . ' placeholder prefix that
      the aggregator later rewrites into 'scip-java maven g a v ...'.
      Locals are normalised to the canonical 'local N' form.
    - ScipShardWriter: write-or-merge helper for *.scip shards that
      deduplicates documents/symbols/occurrences across compiler rounds.
    - ScipShardFromSemanticdb: intermediate translator that converts the
      in-memory Semanticdb.TextDocument into a single-document Scip.Index
      shard. To be replaced by a direct-from-AST ScipVisitor in Milestone 3.
    - SemanticdbJavacOptions: new -emit-scip:on|off flag (default off).
    - SemanticdbTaskListener: when -emit-scip:on is set, also writes a
      *.scip shard under META-INF/scip/ alongside the existing *.semanticdb
      file, reusing the already-built TextDocument.

  scip-semanticdb:
    - ScipShardWalker: recursively collects *.scip shards under the
      configured targetroots, mirroring SemanticdbWalker.
    - SymbolRewriter: rewrites placeholder global symbols into the final
      'scip-java maven ...' form using PackageTable. Locals and already
      rewritten symbols pass through unchanged.

  build.sbt:
    - javacPlugin now depends on scipProto so the plugin can emit Scip.*
      protobuf messages directly.
    - Discard top-level Bazel BUILD files from fat-jar merge so the new
      scipProto resources don't collide with semanticdb-java.

  tests/unit:
    - ScipSymbolsSuite: unit tests for ScipSymbols and SymbolRewriter,
      including the local/global discrimination and Package.EMPTY fallback.
    - ScipShardEmissionSuite: end-to-end test that drives javac with the
      semanticdb plugin and -emit-scip:on, then parses the produced
      Scip.Index shard to assert the document layout and that every
      emitted symbol either uses the placeholder prefix or is a 'local N'.

All 29 unit tests pass.
Milestone 3 of the SemanticDB->SCIP migration: replace the bridge that
went through ScipShardFromSemanticdb with a direct AST walk that
produces Scip.Document values.

  - ScipVisitor: fork of SemanticdbVisitor with identical traversal
    semantics. Emits Scip.Occurrence, Scip.SymbolInformation, and
    Scip.Relationship directly. Symbols still come from the existing
    GlobalSymbolsCache/LocalSymbolsCache and are translated to the
    placeholder SCIP form via ScipSymbols.fromSemanticdbSymbol at the
    emission boundary. Skips signatures and annotations for now -
    ScipSignatureFormatter in Milestone 4 will add signature_documentation.

  - SemanticdbTaskListener: when -emit-scip:on is set, runs ScipVisitor
    directly instead of converting from Semanticdb.TextDocument. This is a
    second AST walk during the transition; SemanticdbVisitor remains the
    sole producer of legacy *.semanticdb files until Milestone 8.

  - ScipShardFromSemanticdb: deleted; no longer needed now that ScipVisitor
    produces the same shard format natively.

All 29 unit tests pass, including the end-to-end ScipShardEmissionSuite
that exercises the new ScipVisitor through real javac invocations.
Milestone 4: emit SCIP signature_documentation directly from the compiler
plugin, eliminating the need to format signatures from a SemanticDB
intermediate representation.

  - ScipSignatureFormatter: walks javac Element/TypeMirror and produces
    a readable Java declaration string. Supports classes, interfaces,
    annotations, enums, methods, constructors, fields, parameters,
    locals, enum constants, and type parameters with bounds. The internal
    TypePrinter handles declared types, type arguments, arrays,
    primitives, type variables, wildcards, intersections, and void.
    Suppresses implicit 'extends Object' and 'java.lang.Object' supertypes.

  - ScipVisitor: when a definition is emitted, the formatter is invoked
    and (when the result is non-empty) the signature is attached to
    SymbolInformation.signature_documentation with language 'Java' and
    the current source's relative path.

  - ScipShardEmissionSuite: extended end-to-end checks. Verifies the
    shard contains at least one signature_documentation block, that the
    Foo class symbol's signature contains 'class Foo', and that the bar()
    method symbol's signature contains 'int bar('.

All 29 unit tests pass.
Milestone 5: parallel aggregator that walks *.scip shards produced by
ScipVisitor and emits a final scip-java-scheme index.scip. The existing
SemanticDB-based ScipSemanticdb.run() is untouched.

  - ScipShardAggregator:
      * walks for *.scip shards (and *.jar files containing them) via
        ScipShardWalker
      * parses each shard into a Scip.Index
      * rewrites placeholder global symbols ('. . . . ' prefix) into the
        final 'scip-java maven g a v ...' form via SymbolRewriter
      * deduplicates documents by relative_path, merging occurrences and
        symbol-info entries from annotation-processor rounds
      * computes inverse 'is_implementation && is_reference' relationships
        across the whole project, gated on options.emitInverseRelationships
      * emits one Metadata Index plus one Index per merged Document via
        ScipWriter

  - ScipAggregationSuite: end-to-end test that compiles a Java source with
    -emit-scip:on, runs ScipShardAggregator over the produced shards, and
    asserts the final index has metadata with the scip-java tool name and
    that every emitted symbol/occurrence is either local or starts with
    'scip-java maven '.

All 30 unit tests pass.
Milestone 6: surface the direct-SCIP path through the existing
index-semanticdb command and through the Maven / ScipBuildTool paths so
end-to-end indexing can use ScipShardAggregator. Default behaviour is
unchanged.

  - IndexSemanticdbCommand: new --use-scip-shards flag. When set, the
    command runs ScipShardAggregator (walking META-INF/scip/*.scip) instead
    of ScipSemanticdb (walking META-INF/semanticdb/*.semanticdb).

  - SemanticdbOptionBuilder: reads -Dsemanticdb.emit-scip and appends
    '-emit-scip:on' to the injected -Xplugin:semanticdb argument so the
    custom javac wrapper emits SCIP shards.

  - Embedded.customJavac: new optional emitScip parameter; when true,
    propagates -Dsemanticdb.emit-scip=true into the launched javac
    wrapper.

  - MavenBuildTool: forwards index.indexSemanticdb.useScipShards to the
    customJavac wrapper.

  - ScipBuildTool: when useScipShards is on, appends '-emit-scip:on'
    to the directly-constructed -Xplugin:semanticdb arguments used by
    the in-process javac compilation.

Not yet wired (deferred):
  - SemanticdbGradlePlugin propagation
  - BazelBuildTool / scip_java.bzl
  - Kotlin guard for projects that mix Java+Kotlin sources

All 30 unit tests pass.
…hots

Drives the minimized snapshot suite through the new SCIP-direct path
(via --use-scip-shards) and reconciles the resulting output so it can be
locked in as the canonical scheme.

  semanticdb-javac:
    - ScipVisitor: lowercase Document.language to 'java' (matching the
      historical ScipSemanticdb output) and add (range, symbol, roles)
      dedup of occurrences, preferring the variant that carries an
      enclosing_range. Multiple ANALYZE rounds otherwise emit a second
      definition occurrence without enclosing_range that survived the
      structural-equality dedup in ScipShardWriter.
    - ScipVisitor: treat ENUM the same as CLASS/INTERFACE/ANNOTATION_TYPE
      in supportsReferenceRelationship so parent relationships don't get
      a spurious is_reference flag.
    - ScipShardWriter: switch occurrence merge to the looser
      (range, symbol, roles) key, preferring entries with enclosing_range.
    - SemanticdbTaskListener: delete the stale .scip shard alongside the
      .semanticdb file on ENTER so re-runs don't accumulate occurrences
      across builds.

  scip-semanticdb:
    - ScipShardAggregator: mergeInto now uses the same (range, symbol,
      roles) dedup with enclosing_range preference, and merges duplicate
      symbol relationships across shards.

  build.sbt:
    - Add -emit-scip:on to the minimized javac plugin invocation so the
      tests/minimized targetroot always contains shard files.

  tests/snapshots:
    - MinimizedSnapshotScipGenerator now passes --use-scip-shards to
      drive the snapshot suite through ScipShardAggregator.
    - Regenerate all 23 minimized snapshots under the new 'scip-java'
      symbol scheme.

  tests/unit:
    - ScipShardEmissionSuite: update assertions to expect the lowercase
      'java' language string.

Full snapshot suite passes (102 tests). Unit suite passes (30 tests).
After M3-M7 the per-source SCIP shard format is stable and the
ScipShardAggregator produces equivalent output to the legacy
SemanticDB->SCIP path. This commit promotes the cheap compiler-side
half of the dual-emission to be on by default so that:

  - any javac plugin invocation (sbt, Maven, Bazel, ad-hoc) writes a
    *.scip shard under META-INF/scip/ alongside the *.semanticdb file
    without needing an explicit -emit-scip:on flag;
  - users (or build tools) that want to consume the new path only need
    to flip the CLI switch (--use-scip-shards) once the indexer runs;
  - legacy callers that only read *.semanticdb files are unaffected.

The CLI default for index-semanticdb's --use-scip-shards remains false
because the broader ecosystem (notably the Kotlin compiler and the
existing snapshot/build tool integrations) still produces only
*.semanticdb. That flip is deferred to a follow-up PR.

  semanticdb-javac:
    - SemanticdbJavacOptions.emitScip defaults to true. -emit-scip:off
      is now the explicit opt-out and is documented as the legacy path.

  scip-java:
    - SnapshotCommand: skip per-source shards (those without a
      metadata.project_root) so 'scip-java snapshot' continues to render
      only the top-level aggregator output. Per-source shards have no
      project_root and would otherwise crash with 'missing scheme'
      when their relative paths are resolved into a URI.

  build.sbt:
    - Drop the now-redundant -emit-scip:on flag from the minimized
      project; the plugin default already emits shards.

  tests/unit:
    - ScipShardEmissionSuite: invert the off-path test so it explicitly
      passes -emit-scip:off; the previous test relied on the old
      default of false.

Full snapshot suite (102 tests) and unit suite (30 tests) green.
Post-PR1 cleanup of dead code, redundant flag plumbing, and duplication.
No behavioral changes; snapshot suite (102 passed) and unit suite (28 passed)
remain green.

Dead code removed:
- ScipShardAggregator: drop unused documentsFromShards{,Collected}
  and their Stream/Collectors imports.
- ScipSymbols: drop unused isPlaceholderGlobal/descriptorPath; only
  fromSemanticdbSymbol + PLACEHOLDER_PREFIX are needed in production.
- ScipSymbolsSuite: drop the tests for the removed helpers.

Redundant -emit-scip:on plumbing removed:
With compiler-side default emitScip=true (M8), the CLI/build-tool
machinery that conditionally toggled the flag is purely cosmetic.
- Embedded.customJavac: drop emitScip param + emitScipProp system
  property prefix.
- MavenBuildTool: stop passing emitScip = useScipShards.
- ScipBuildTool: stop appending -emit-scip:on to the -Xplugin string.
- SemanticdbOptionBuilder: drop EMIT_SCIP system-property handling and
  the corresponding xpluginOption branch.
- SemanticdbJavacOptions still parses -emit-scip:on / -emit-scip:off as
  the compiler-side opt-out.
- IndexSemanticdbCommand help text no longer implies the shards require
  an extra compiler flag.

Internal duplication removed:
- New ScipOccurrences package-private helper centralizes the
  (symbol, range, roles) dedup key and the 'prefer enclosing_range'
  merge rule that ScipVisitor and ScipShardWriter both used.
- ScipShardWriter.mergeSymbol now uses LinkedHashMap for relationships
  so output ordering is deterministic.

Small ScipVisitor cleanups:
- Drop dependency on Semanticdb Property bitmask; compute isStatic /
  isAbstract directly from Modifier set.
- Make 'source' final and initialized via a static sourceText helper.
- Merge identical switch arms for ENUM/CLASS/INTERFACE/ANNOTATION_TYPE
  in emitSymbolInformation.
- Refresh stale class-level javadoc; signature docs are now produced
  via ScipSignatureFormatter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant