Enable mixed approx_count_distinct aggregation by michaelmitchell-bit · Pull Request #4823 · apache/datafusion-comet

michaelmitchell-bit · 2026-07-04T05:45:56Z

Important

Draft/stacked on #4819. Please do not review this independently until #4819 lands; after that I will rebase/update this PR so the reviewer diff is only the mixed partial/final follow-up.

Which issue does this PR close?

Closes #4820.

Depends on #4819.

Rationale for this change

After #4819, approx_count_distinct stores HyperLogLogPlusPlus buffers in the same packed Long layout as Spark. That makes the partial/final aggregate state compatible across Spark and Comet, so Comet can safely enable mixed partial/final execution for this aggregate.

What changes are included in this PR?

Enable mixed partial/final execution for CometApproxCountDistinct.
Add planner coverage for Comet partial with Spark final execution.
Add planner coverage for Spark partial with Comet final execution.
Add runtime PartialMerge coverage for count(DISTINCT ...) + approx_count_distinct(...) with Comet partial/merge stages feeding a Spark final stage.
Add SQL-file coverage for approx_count_distinct through Spark's PartialMerge aggregate path.

How are these changes tested?

make core
./mvnw test -Dtest=none -Dsuites="org.apache.comet.exec.CometAggregateSuite approx_count_distinct partial merge" -Dscalastyle.skip=true
./mvnw test -Dtest=none -Dsuites="org.apache.comet.rules.CometExecRuleSuite approx_count_distinct" -Dscalastyle.skip=true
./mvnw test -Dtest=none -Dsuites="org.apache.comet.CometSqlFileTestSuite partial_merge" -Dscalastyle.skip=true

Add native support for Spark's approx_count_distinct, a faithful port of Spark's HyperLogLogPlusPlus / HyperLogLogPlusPlusHelper. Each non-null input is hashed with Comet's Spark-compatible XxHash64 (seed 42, floats normalized first), and the HyperLogLog++ registers are stored in Spark's exact packed-Long buffer layout (10 six-bit registers per word). The cardinality is estimated with the same linear-counting and bias-correction tables Spark uses, so results are bit-identical to Spark and the partial-aggregation state matches Spark's aggBufferSchema. Includes a vectorized GroupsAccumulator, SQL file tests comparing against Spark across a range of cardinalities, native unit tests, benchmark coverage, and documentation updates.

michaelmitchell-bit mentioned this pull request Jul 4, 2026

[Enhancement] Enable mixed partial/final execution for approx_count_distinct (HyperLogLogPlusPlus) #4820

Open

Enable mixed approx_count_distinct aggregation

2d75c61

michaelmitchell-bit force-pushed the codex/approx-count-distinct-mixed branch from ac57fad to 2d75c61 Compare July 4, 2026 06:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable mixed approx_count_distinct aggregation#4823

Enable mixed approx_count_distinct aggregation#4823
michaelmitchell-bit wants to merge 2 commits into
apache:mainfrom
michaelmitchell-bit:codex/approx-count-distinct-mixed

michaelmitchell-bit commented Jul 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

michaelmitchell-bit commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

michaelmitchell-bit commented Jul 4, 2026 •

edited

Loading