Skip to content

Reduce regular CI test suite runtime#4495

Open
janhoy wants to merge 18 commits into
apache:mainfrom
janhoy:feature/improve-test-speed
Open

Reduce regular CI test suite runtime#4495
janhoy wants to merge 18 commits into
apache:mainfrom
janhoy:feature/improve-test-speed

Conversation

@janhoy
Copy link
Copy Markdown
Contributor

@janhoy janhoy commented Jun 2, 2026

This PR reduces regular CI test suite runtime through two strategies:

  1. Fix one severely under-optimized test (10x speedup)
  2. Calibrate @Repeat iteration counts for regular vs nightly CI
  3. Move 10 slow integration/stress tests to @Nightly

Tests annotated @Nightly continue to run in the dedicated nightly CI job (-Ptests.nightly=true) with no loss of coverage. Regular PR/branch CI becomes significantly faster.

Strategy 1 — Fix inefficient test structure

DistributedCombinedQueryComponentTest (~80s → ~10s, 10x speedup)

Root cause: The test had 6 separate @Test methods, all operating on an identical document set. BaseDistributedSearchTestCase uses a method-level @Rule (ShardsRepeatRule) — not a @ClassRule — so it creates and destroys the full distributed cluster for each test method. With 6 + 2 extra methods, the setup/teardown overhead (≈9s each) dominated the 80s runtime.

Fix: Merged the 6 same-dataset methods into a single testCombinedQueries() method, reducing cluster lifecycles from 8 to 3. All assertions for single-lexical matching, multi-lexical matching, sorting, pagination, faceting, and facet+highlighting are preserved — just executed within one cluster lifecycle.

Strategy 2 — Calibrate @Repeat counts (regular CI vs nightly)

@Repeat requires a compile-time constant so TEST_NIGHTLY ? N : M cannot be used directly in the annotation. The solution uses the subclass pattern: reduce the count in the base class for regular CI, then create a one-liner *NightlyTest subclass annotated @Nightly @Repeat(originalCount) that inherits all tests and runs with the full count nightly.

This preserves all framework semantics: each iteration gets a distinct random seed, independent setup/teardown, separate failure reporting, and unique test naming — benefits that would be lost by converting to a plain loop.

Test class Regular CI Nightly Nightly subclass
RandomizedTaggerTest 1 iterations 10 iterations RandomizedTaggerNightlyTest
TestSolr4Spatial2 (testLLPDecodeIsStableAndPrecise) 1 iterations 10 iterations TestSolr4Spatial2Nightly
SpatialHeatmapFacetsTest (testPng) 1 iteration 3 iterations SpatialHeatmapFacetsNightlyTest
CloudExitableDirectoryReaderTest (testCreepThenBite) 1 iterations 5 iterations CloudExitableDirectoryReaderNightlyTest

Strategy 3 — Move 10 tests to @Nightly

These tests are slow not because of a fixable design issue, but because they are inherently integration/stress tests that exercise complex distributed behavior, external infrastructure, or require many repetitions to catch race conditions. They belong in nightly CI.

Test Module Why it's slow
RollingRestartTest solr:core Repeatedly stops/starts Jetty nodes and waits for overseer leader election across up to 16 nodes. Even at the minimum 2 restarts, the ZooKeeper coordination overhead makes this a stress test, not a unit test.
SyncSliceTest solr:core Exercises leader election and peer-sync after deliberate shard inconsistency. Uses 4–7 shard nodes; deliberately indexes to skip servers and waits for recovery.
RecoveryZkTest solr:core Indexes up to 3000 docs across two concurrent threads, stops/restarts a replica mid-index, then waits for full replication. The if (!TEST_NIGHTLY) branch also reveals it was written with nightly in mind.
UnloadDistributedZkTest solr:core Exercises core unloading, ZK state transitions, and replica removal across a distributed cluster. Heavy ZooKeeper interaction throughout.
SolrAndKafkaIntegrationTest solr:cross-dc-manager Requires starting an embedded Kafka cluster (EmbeddedKafkaCluster) alongside a full SolrCloud cluster. The external broker startup/shutdown alone makes this integration-only.
GCSIncrementalBackupTest solr:modules:gcs-repository Full GCS backup-and-restore integration test: creates a collection, indexes docs, backs up to GCS, restores, verifies. Inherently I/O and cluster-heavy.
S3IncrementalBackupTest solr:modules:s3-repository Same as above for S3, using an embedded S3MockRule. Full backup lifecycle per test method.
BadClusterTest solr:solrj-streaming Progressively degrades a live cluster across ordered test scenarios — stopping replicas, killing leaders — to verify streaming behavior under failure. The cluster worsens through the test by design.
PerReplicaStatesIntegrationTest solr:solrj Creates multiple full MiniSolrCloudClusters within a single test class. Even the class Javadoc notes: "This test would be faster if we simulated the ZK state instead."
TestPullReplica solr:core Has @Repeat(30) on testCreateDelete — 30 full collection create/delete cycles. Multiple other methods exercise pull replica replication, which requires waiting for index replication to complete. One of the heaviest cloud tests in the suite.

janhoy added 16 commits June 2, 2026 11:22
…add nightly subclass with full 5 iterations
This is an integration test with heavy ZooKeeper interaction and multiple
cluster restarts. Move to nightly-only to reduce regular CI time.
Integration test requiring embedded Kafka cluster — inherently slow and
resource-intensive. Move to nightly-only to reduce regular CI overhead.
Full GCS backup/restore integration test — slow by nature. Move to nightly
to reduce regular CI run time.
Full S3 backup/restore integration test using an embedded S3Mock — slow by
nature. Move to nightly to reduce regular CI run time.
Integration test that progressively degrades a live cluster. Slow and
resource-intensive by design. Move to nightly-only CI.
Integration test for per-replica ZK state management — creates multiple
clusters per test method. Even the file's Javadoc notes it would be faster
with ZK simulation. Move to nightly-only CI.
Has @repeat(30) on testCreateDelete and exercises pull replica replication
across multiple test methods — one of the heaviest cloud tests. Move to
nightly-only CI where the high repeat count is appropriate.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce regular CI runtime by (1) restructuring an expensive distributed test to reuse cluster setup, (2) lowering @Repeat iteration counts in regular CI while preserving full coverage via @Nightly subclasses, and (3) moving inherently slow integration/stress tests to nightly CI only.

Changes:

  • Consolidate DistributedCombinedQueryComponentTest coverage into a single test method to reduce repeated cluster lifecycle overhead.
  • Reduce @Repeat iterations for regular CI and add @Nightly subclasses that run the original iteration counts in nightly CI.
  • Annotate multiple slow integration/stress tests with @Nightly so they run only in nightly CI.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
solr/solrj/src/test/org/apache/solr/common/cloud/PerReplicaStatesIntegrationTest.java Mark per-replica-states integration test as @Nightly to keep regular CI faster.
solr/solrj-streaming/src/test/org/apache/solr/client/solrj/io/stream/BadClusterTest.java Move slow/cluster-degrading streaming test to nightly via @Nightly.
solr/modules/s3-repository/src/test/org/apache/solr/s3/S3IncrementalBackupTest.java Run S3 incremental backup integration test only in nightly.
solr/modules/gcs-repository/src/test/org/apache/solr/gcs/GCSIncrementalBackupTest.java Run GCS incremental backup integration test only in nightly.
solr/cross-dc-manager/src/test/org/apache/solr/crossdc/manager/SolrAndKafkaIntegrationTest.java Move embedded-Kafka integration test to nightly via @Nightly.
solr/core/src/test/org/apache/solr/search/TestSolr4Spatial2.java Reduce regular CI repeat count for testLLPDecodeIsStableAndPrecise.
solr/core/src/test/org/apache/solr/search/TestSolr4Spatial2Nightly.java Add nightly subclass restoring the original repeat count.
solr/core/src/test/org/apache/solr/search/facet/SpatialHeatmapFacetsTest.java Reduce regular CI repeat count for testPng.
solr/core/src/test/org/apache/solr/search/facet/SpatialHeatmapFacetsNightlyTest.java Add nightly subclass restoring the original repeat count.
solr/core/src/test/org/apache/solr/handler/tagger/RandomizedTaggerTest.java Reduce regular CI repeat count at class level.
solr/core/src/test/org/apache/solr/handler/tagger/RandomizedTaggerNightlyTest.java Add nightly subclass restoring the original repeat count.
solr/core/src/test/org/apache/solr/handler/component/DistributedCombinedQueryComponentTest.java Merge multiple test methods into one to reduce repeated distributed cluster setup.
solr/core/src/test/org/apache/solr/cloud/UnloadDistributedZkTest.java Move distributed unload test to nightly via @Nightly.
solr/core/src/test/org/apache/solr/cloud/TestPullReplica.java Move heavy pull-replica test class to nightly via @Nightly.
solr/core/src/test/org/apache/solr/cloud/SyncSliceTest.java Move leader-election/peer-sync style test to nightly via @Nightly.
solr/core/src/test/org/apache/solr/cloud/RollingRestartTest.java Move rolling restart stress test to nightly via @Nightly.
solr/core/src/test/org/apache/solr/cloud/RecoveryZkTest.java Move recovery stress test to nightly via @Nightly.
solr/core/src/test/org/apache/solr/cloud/CloudExitableDirectoryReaderTest.java Reduce regular CI repeat count for testCreepThenBite.
solr/core/src/test/org/apache/solr/cloud/CloudExitableDirectoryReaderNightlyTest.java Add nightly subclass restoring the original repeat count.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ponentTest

Replaces escaped string concatenations with Java text blocks, making the
JSON structure readable. Also fixes a latent bug in facetQuery where a
missing comma between "offset":1 and "fields" caused noggit to silently
ignore the fields parameter.
@epugh
Copy link
Copy Markdown
Contributor

epugh commented Jun 2, 2026

More of a comment, but I wonder if we need to change from a @Nightly concept to a @Slow or @Intense or @System annotation. Nightly makes it seem like someone else;s responsiblity to run.. Where as @Slow or @Intense may makes it more normal to run these on individual laptops so we run these more commonly...

+ "\"lexical2\":{\"lucene\":{\"query\":\"id:(4^1 OR 5^2 OR 7^3 OR 10^2)\"}}},"
+ "\"fields\":[\"id\",\"score\",\"title\"],"
+ "\"params\":{\"combiner\":true,\"combiner.query\":[\"lexical1\",\"lexical2\"]}}",
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the use of """ as well... Kind of wish we had one sweep through our code base for all this!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of a swiping modernization for large complex strings.
And we could also do a swiping change to identify candidates for class->record


/** Nightly variant of {@link RandomizedTaggerTest} that runs the full 10 iterations. */
@Nightly
@Repeat(iterations = 10)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had seem a pattern where you would change the number of iterations based on some system property? Versus ahving seperate Java files?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that works some places, but this annotation requires a constant, so we need the subclass override hack. Tried to find a better way...

import software.amazon.awssdk.regions.Region;

// Backups do checksum validation against a footer value not present in 'SimpleText'
@LuceneTestCase.Nightly
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can't do anything to make this faster instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants