Skip to content

Enable mixed approx_count_distinct aggregation#4823

Draft
michaelmitchell-bit wants to merge 2 commits into
apache:mainfrom
michaelmitchell-bit:codex/approx-count-distinct-mixed
Draft

Enable mixed approx_count_distinct aggregation#4823
michaelmitchell-bit wants to merge 2 commits into
apache:mainfrom
michaelmitchell-bit:codex/approx-count-distinct-mixed

Conversation

@michaelmitchell-bit

@michaelmitchell-bit michaelmitchell-bit commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Important

Draft/stacked on #4819. Please do not review this independently until #4819 lands; after that I will rebase/update this PR so the reviewer diff is only the mixed partial/final follow-up.

Which issue does this PR close?

Closes #4820.

Depends on #4819.

Rationale for this change

After #4819, approx_count_distinct stores HyperLogLogPlusPlus buffers in the same packed Long layout as Spark. That makes the partial/final aggregate state compatible across Spark and Comet, so Comet can safely enable mixed partial/final execution for this aggregate.

What changes are included in this PR?

  • Enable mixed partial/final execution for CometApproxCountDistinct.
  • Add planner coverage for Comet partial with Spark final execution.
  • Add planner coverage for Spark partial with Comet final execution.
  • Add runtime PartialMerge coverage for count(DISTINCT ...) + approx_count_distinct(...) with Comet partial/merge stages feeding a Spark final stage.
  • Add SQL-file coverage for approx_count_distinct through Spark's PartialMerge aggregate path.

How are these changes tested?

  • make core
  • ./mvnw test -Dtest=none -Dsuites="org.apache.comet.exec.CometAggregateSuite approx_count_distinct partial merge" -Dscalastyle.skip=true
  • ./mvnw test -Dtest=none -Dsuites="org.apache.comet.rules.CometExecRuleSuite approx_count_distinct" -Dscalastyle.skip=true
  • ./mvnw test -Dtest=none -Dsuites="org.apache.comet.CometSqlFileTestSuite partial_merge" -Dscalastyle.skip=true

Add native support for Spark's approx_count_distinct, a faithful port of
Spark's HyperLogLogPlusPlus / HyperLogLogPlusPlusHelper.

Each non-null input is hashed with Comet's Spark-compatible XxHash64
(seed 42, floats normalized first), and the HyperLogLog++ registers are
stored in Spark's exact packed-Long buffer layout (10 six-bit registers
per word). The cardinality is estimated with the same linear-counting and
bias-correction tables Spark uses, so results are bit-identical to Spark
and the partial-aggregation state matches Spark's aggBufferSchema.

Includes a vectorized GroupsAccumulator, SQL file tests comparing against
Spark across a range of cardinalities, native unit tests, benchmark
coverage, and documentation updates.
@michaelmitchell-bit michaelmitchell-bit force-pushed the codex/approx-count-distinct-mixed branch from ac57fad to 2d75c61 Compare July 4, 2026 06:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] Enable mixed partial/final execution for approx_count_distinct (HyperLogLogPlusPlus)

2 participants