Background
The initial approx_count_distinct support (PR #4819) stores its HyperLogLog++ registers in Spark's exact packed-Long buffer layout (numWords Long columns, 10 six-bit registers per word), so Comet's partial-aggregation state is byte-identical to Spark's HyperLogLogPlusPlus.aggBufferSchema.
CometApproxCountDistinct currently leaves supportsMixedPartialFinal at the default false, which means that whenever a plan has an approx_count_distinct at a Comet/Spark boundary, allAggsSupportMixedExecution forces both the partial and final aggregate onto the same engine.
Proposal
Because the intermediate buffer format now matches Spark exactly, supportsMixedPartialFinal should be safe to set to true (as CometMin, CometMax, and the bitwise aggregates already do). This would let Comet accelerate the partial aggregate even when the final falls back to Spark (and vice versa), broadening native coverage.
Work required
- Set
supportsMixedPartialFinal = true in CometApproxCountDistinct.
- Add a partial-merge test (see
spark/src/test/resources/sql-tests/expressions/aggregate/partial_merge.sql) that exercises Comet-partial + Spark-final and Spark-partial + Comet-final, confirming the result stays bit-identical to Spark.
This was deferred from PR #4819 because it needs interop verification rather than being a trivial one-line change.
Background
The initial
approx_count_distinctsupport (PR #4819) stores its HyperLogLog++ registers in Spark's exact packed-Longbuffer layout (numWordsLongcolumns, 10 six-bit registers per word), so Comet's partial-aggregation state is byte-identical to Spark'sHyperLogLogPlusPlus.aggBufferSchema.CometApproxCountDistinctcurrently leavessupportsMixedPartialFinalat the defaultfalse, which means that whenever a plan has anapprox_count_distinctat a Comet/Spark boundary,allAggsSupportMixedExecutionforces both the partial and final aggregate onto the same engine.Proposal
Because the intermediate buffer format now matches Spark exactly,
supportsMixedPartialFinalshould be safe to set totrue(asCometMin,CometMax, and the bitwise aggregates already do). This would let Comet accelerate the partial aggregate even when the final falls back to Spark (and vice versa), broadening native coverage.Work required
supportsMixedPartialFinal = trueinCometApproxCountDistinct.spark/src/test/resources/sql-tests/expressions/aggregate/partial_merge.sql) that exercises Comet-partial + Spark-final and Spark-partial + Comet-final, confirming the result stays bit-identical to Spark.This was deferred from PR #4819 because it needs interop verification rather than being a trivial one-line change.