Enable mixed approx_count_distinct aggregation#4823
Draft
michaelmitchell-bit wants to merge 2 commits into
Draft
Enable mixed approx_count_distinct aggregation#4823michaelmitchell-bit wants to merge 2 commits into
michaelmitchell-bit wants to merge 2 commits into
Conversation
Add native support for Spark's approx_count_distinct, a faithful port of Spark's HyperLogLogPlusPlus / HyperLogLogPlusPlusHelper. Each non-null input is hashed with Comet's Spark-compatible XxHash64 (seed 42, floats normalized first), and the HyperLogLog++ registers are stored in Spark's exact packed-Long buffer layout (10 six-bit registers per word). The cardinality is estimated with the same linear-counting and bias-correction tables Spark uses, so results are bit-identical to Spark and the partial-aggregation state matches Spark's aggBufferSchema. Includes a vectorized GroupsAccumulator, SQL file tests comparing against Spark across a range of cardinalities, native unit tests, benchmark coverage, and documentation updates.
ac57fad to
2d75c61
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Important
Draft/stacked on #4819. Please do not review this independently until #4819 lands; after that I will rebase/update this PR so the reviewer diff is only the mixed partial/final follow-up.
Which issue does this PR close?
Closes #4820.
Depends on #4819.
Rationale for this change
After #4819,
approx_count_distinctstores HyperLogLogPlusPlus buffers in the same packed Long layout as Spark. That makes the partial/final aggregate state compatible across Spark and Comet, so Comet can safely enable mixed partial/final execution for this aggregate.What changes are included in this PR?
CometApproxCountDistinct.count(DISTINCT ...)+approx_count_distinct(...)with Comet partial/merge stages feeding a Spark final stage.approx_count_distinctthrough Spark's PartialMerge aggregate path.How are these changes tested?
make core./mvnw test -Dtest=none -Dsuites="org.apache.comet.exec.CometAggregateSuite approx_count_distinct partial merge" -Dscalastyle.skip=true./mvnw test -Dtest=none -Dsuites="org.apache.comet.rules.CometExecRuleSuite approx_count_distinct" -Dscalastyle.skip=true./mvnw test -Dtest=none -Dsuites="org.apache.comet.CometSqlFileTestSuite partial_merge" -Dscalastyle.skip=true