Drop semanticdb step when indexing Java code#887
Draft
jupblb wants to merge 8 commits into
Draft
Conversation
…ission
First two milestones of dropping the intermediate SemanticDB step in favour
of direct SCIP shard output from the Java compiler plugin.
Adds, with no behaviour change in the default config:
semanticdb-javac:
- ScipSymbols: helper that maps SemanticDB symbol strings to SCIP
symbol strings. Globals get the '. . . . ' placeholder prefix that
the aggregator later rewrites into 'scip-java maven g a v ...'.
Locals are normalised to the canonical 'local N' form.
- ScipShardWriter: write-or-merge helper for *.scip shards that
deduplicates documents/symbols/occurrences across compiler rounds.
- ScipShardFromSemanticdb: intermediate translator that converts the
in-memory Semanticdb.TextDocument into a single-document Scip.Index
shard. To be replaced by a direct-from-AST ScipVisitor in Milestone 3.
- SemanticdbJavacOptions: new -emit-scip:on|off flag (default off).
- SemanticdbTaskListener: when -emit-scip:on is set, also writes a
*.scip shard under META-INF/scip/ alongside the existing *.semanticdb
file, reusing the already-built TextDocument.
scip-semanticdb:
- ScipShardWalker: recursively collects *.scip shards under the
configured targetroots, mirroring SemanticdbWalker.
- SymbolRewriter: rewrites placeholder global symbols into the final
'scip-java maven ...' form using PackageTable. Locals and already
rewritten symbols pass through unchanged.
build.sbt:
- javacPlugin now depends on scipProto so the plugin can emit Scip.*
protobuf messages directly.
- Discard top-level Bazel BUILD files from fat-jar merge so the new
scipProto resources don't collide with semanticdb-java.
tests/unit:
- ScipSymbolsSuite: unit tests for ScipSymbols and SymbolRewriter,
including the local/global discrimination and Package.EMPTY fallback.
- ScipShardEmissionSuite: end-to-end test that drives javac with the
semanticdb plugin and -emit-scip:on, then parses the produced
Scip.Index shard to assert the document layout and that every
emitted symbol either uses the placeholder prefix or is a 'local N'.
All 29 unit tests pass.
Milestone 3 of the SemanticDB->SCIP migration: replace the bridge that
went through ScipShardFromSemanticdb with a direct AST walk that
produces Scip.Document values.
- ScipVisitor: fork of SemanticdbVisitor with identical traversal
semantics. Emits Scip.Occurrence, Scip.SymbolInformation, and
Scip.Relationship directly. Symbols still come from the existing
GlobalSymbolsCache/LocalSymbolsCache and are translated to the
placeholder SCIP form via ScipSymbols.fromSemanticdbSymbol at the
emission boundary. Skips signatures and annotations for now -
ScipSignatureFormatter in Milestone 4 will add signature_documentation.
- SemanticdbTaskListener: when -emit-scip:on is set, runs ScipVisitor
directly instead of converting from Semanticdb.TextDocument. This is a
second AST walk during the transition; SemanticdbVisitor remains the
sole producer of legacy *.semanticdb files until Milestone 8.
- ScipShardFromSemanticdb: deleted; no longer needed now that ScipVisitor
produces the same shard format natively.
All 29 unit tests pass, including the end-to-end ScipShardEmissionSuite
that exercises the new ScipVisitor through real javac invocations.
Milestone 4: emit SCIP signature_documentation directly from the compiler
plugin, eliminating the need to format signatures from a SemanticDB
intermediate representation.
- ScipSignatureFormatter: walks javac Element/TypeMirror and produces
a readable Java declaration string. Supports classes, interfaces,
annotations, enums, methods, constructors, fields, parameters,
locals, enum constants, and type parameters with bounds. The internal
TypePrinter handles declared types, type arguments, arrays,
primitives, type variables, wildcards, intersections, and void.
Suppresses implicit 'extends Object' and 'java.lang.Object' supertypes.
- ScipVisitor: when a definition is emitted, the formatter is invoked
and (when the result is non-empty) the signature is attached to
SymbolInformation.signature_documentation with language 'Java' and
the current source's relative path.
- ScipShardEmissionSuite: extended end-to-end checks. Verifies the
shard contains at least one signature_documentation block, that the
Foo class symbol's signature contains 'class Foo', and that the bar()
method symbol's signature contains 'int bar('.
All 29 unit tests pass.
Milestone 5: parallel aggregator that walks *.scip shards produced by
ScipVisitor and emits a final scip-java-scheme index.scip. The existing
SemanticDB-based ScipSemanticdb.run() is untouched.
- ScipShardAggregator:
* walks for *.scip shards (and *.jar files containing them) via
ScipShardWalker
* parses each shard into a Scip.Index
* rewrites placeholder global symbols ('. . . . ' prefix) into the
final 'scip-java maven g a v ...' form via SymbolRewriter
* deduplicates documents by relative_path, merging occurrences and
symbol-info entries from annotation-processor rounds
* computes inverse 'is_implementation && is_reference' relationships
across the whole project, gated on options.emitInverseRelationships
* emits one Metadata Index plus one Index per merged Document via
ScipWriter
- ScipAggregationSuite: end-to-end test that compiles a Java source with
-emit-scip:on, runs ScipShardAggregator over the produced shards, and
asserts the final index has metadata with the scip-java tool name and
that every emitted symbol/occurrence is either local or starts with
'scip-java maven '.
All 30 unit tests pass.
Milestone 6: surface the direct-SCIP path through the existing
index-semanticdb command and through the Maven / ScipBuildTool paths so
end-to-end indexing can use ScipShardAggregator. Default behaviour is
unchanged.
- IndexSemanticdbCommand: new --use-scip-shards flag. When set, the
command runs ScipShardAggregator (walking META-INF/scip/*.scip) instead
of ScipSemanticdb (walking META-INF/semanticdb/*.semanticdb).
- SemanticdbOptionBuilder: reads -Dsemanticdb.emit-scip and appends
'-emit-scip:on' to the injected -Xplugin:semanticdb argument so the
custom javac wrapper emits SCIP shards.
- Embedded.customJavac: new optional emitScip parameter; when true,
propagates -Dsemanticdb.emit-scip=true into the launched javac
wrapper.
- MavenBuildTool: forwards index.indexSemanticdb.useScipShards to the
customJavac wrapper.
- ScipBuildTool: when useScipShards is on, appends '-emit-scip:on'
to the directly-constructed -Xplugin:semanticdb arguments used by
the in-process javac compilation.
Not yet wired (deferred):
- SemanticdbGradlePlugin propagation
- BazelBuildTool / scip_java.bzl
- Kotlin guard for projects that mix Java+Kotlin sources
All 30 unit tests pass.
…hots
Drives the minimized snapshot suite through the new SCIP-direct path
(via --use-scip-shards) and reconciles the resulting output so it can be
locked in as the canonical scheme.
semanticdb-javac:
- ScipVisitor: lowercase Document.language to 'java' (matching the
historical ScipSemanticdb output) and add (range, symbol, roles)
dedup of occurrences, preferring the variant that carries an
enclosing_range. Multiple ANALYZE rounds otherwise emit a second
definition occurrence without enclosing_range that survived the
structural-equality dedup in ScipShardWriter.
- ScipVisitor: treat ENUM the same as CLASS/INTERFACE/ANNOTATION_TYPE
in supportsReferenceRelationship so parent relationships don't get
a spurious is_reference flag.
- ScipShardWriter: switch occurrence merge to the looser
(range, symbol, roles) key, preferring entries with enclosing_range.
- SemanticdbTaskListener: delete the stale .scip shard alongside the
.semanticdb file on ENTER so re-runs don't accumulate occurrences
across builds.
scip-semanticdb:
- ScipShardAggregator: mergeInto now uses the same (range, symbol,
roles) dedup with enclosing_range preference, and merges duplicate
symbol relationships across shards.
build.sbt:
- Add -emit-scip:on to the minimized javac plugin invocation so the
tests/minimized targetroot always contains shard files.
tests/snapshots:
- MinimizedSnapshotScipGenerator now passes --use-scip-shards to
drive the snapshot suite through ScipShardAggregator.
- Regenerate all 23 minimized snapshots under the new 'scip-java'
symbol scheme.
tests/unit:
- ScipShardEmissionSuite: update assertions to expect the lowercase
'java' language string.
Full snapshot suite passes (102 tests). Unit suite passes (30 tests).
After M3-M7 the per-source SCIP shard format is stable and the
ScipShardAggregator produces equivalent output to the legacy
SemanticDB->SCIP path. This commit promotes the cheap compiler-side
half of the dual-emission to be on by default so that:
- any javac plugin invocation (sbt, Maven, Bazel, ad-hoc) writes a
*.scip shard under META-INF/scip/ alongside the *.semanticdb file
without needing an explicit -emit-scip:on flag;
- users (or build tools) that want to consume the new path only need
to flip the CLI switch (--use-scip-shards) once the indexer runs;
- legacy callers that only read *.semanticdb files are unaffected.
The CLI default for index-semanticdb's --use-scip-shards remains false
because the broader ecosystem (notably the Kotlin compiler and the
existing snapshot/build tool integrations) still produces only
*.semanticdb. That flip is deferred to a follow-up PR.
semanticdb-javac:
- SemanticdbJavacOptions.emitScip defaults to true. -emit-scip:off
is now the explicit opt-out and is documented as the legacy path.
scip-java:
- SnapshotCommand: skip per-source shards (those without a
metadata.project_root) so 'scip-java snapshot' continues to render
only the top-level aggregator output. Per-source shards have no
project_root and would otherwise crash with 'missing scheme'
when their relative paths are resolved into a URI.
build.sbt:
- Drop the now-redundant -emit-scip:on flag from the minimized
project; the plugin default already emits shards.
tests/unit:
- ScipShardEmissionSuite: invert the off-path test so it explicitly
passes -emit-scip:off; the previous test relied on the old
default of false.
Full snapshot suite (102 tests) and unit suite (30 tests) green.
Post-PR1 cleanup of dead code, redundant flag plumbing, and duplication.
No behavioral changes; snapshot suite (102 passed) and unit suite (28 passed)
remain green.
Dead code removed:
- ScipShardAggregator: drop unused documentsFromShards{,Collected}
and their Stream/Collectors imports.
- ScipSymbols: drop unused isPlaceholderGlobal/descriptorPath; only
fromSemanticdbSymbol + PLACEHOLDER_PREFIX are needed in production.
- ScipSymbolsSuite: drop the tests for the removed helpers.
Redundant -emit-scip:on plumbing removed:
With compiler-side default emitScip=true (M8), the CLI/build-tool
machinery that conditionally toggled the flag is purely cosmetic.
- Embedded.customJavac: drop emitScip param + emitScipProp system
property prefix.
- MavenBuildTool: stop passing emitScip = useScipShards.
- ScipBuildTool: stop appending -emit-scip:on to the -Xplugin string.
- SemanticdbOptionBuilder: drop EMIT_SCIP system-property handling and
the corresponding xpluginOption branch.
- SemanticdbJavacOptions still parses -emit-scip:on / -emit-scip:off as
the compiler-side opt-out.
- IndexSemanticdbCommand help text no longer implies the shards require
an extra compiler flag.
Internal duplication removed:
- New ScipOccurrences package-private helper centralizes the
(symbol, range, roles) dedup key and the 'prefer enclosing_range'
merge rule that ScipVisitor and ScipShardWriter both used.
- ScipShardWriter.mergeSymbol now uses LinkedHashMap for relationships
so output ordering is deterministic.
Small ScipVisitor cleanups:
- Drop dependency on Semanticdb Property bitmask; compute isStatic /
isAbstract directly from Modifier set.
- Make 'source' final and initialized via a static sourceText helper.
- Merge identical switch arms for ENUM/CLASS/INTERFACE/ANNOTATION_TYPE
in emitSymbolInformation.
- Refresh stale class-level javadoc; signature docs are now produced
via ScipSignatureFormatter.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.