Skip to content

feat: support kurtosis aggregate#4818

Open
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:feat/kurtosis-native-support
Open

feat: support kurtosis aggregate#4818
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:feat/kurtosis-native-support

Conversation

@andygrove

Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #.

Rationale for this change

kurtosis is a standard SQL statistical aggregate and one of the last remaining CentralMomentAgg siblings that Comet didn't run natively (variance and stddev are already supported). Adding it lets queries using kurtosis stay in Comet's native path instead of falling back to Spark for the whole aggregate.

What changes are included in this PR?

  • Adds a Comet-owned Kurtosis UDAF (native/spark-expr/src/agg_funcs/kurtosis.rs) with a per-row KurtosisAccumulator. Intermediate state is [n, avg, m2, m3, m4] Float64, mirroring Spark's CentralMomentAgg (momentOrder = 4) wire format so a Spark-produced Partial and a Comet-produced Final can share bytes without conversion. Update and merge kernels are direct ports of Spark's updateExpressionsDef and mergeExpressions (Meng 2015 recurrence).
  • Threads nullOnDivideByZero through the proto so spark.sql.legacy.statisticalAggregate behaves the same on both engines: the default returns NULL when m2 == 0, legacy mode returns NaN.
  • Adds a Kurtosis protobuf message and wires it through the native planner.
  • Adds CometKurtosis in spark/src/main/scala/org/apache/comet/serde/aggregates.scala and registers it in QueryPlanSerde.aggrSerdeMap. supportsMixedPartialFinal is left at false to match the conservative policy already used by Variance and Stddev in the same file.
  • Documents the audit in docs/source/contributor-guide/expression-audits/agg_funcs.md; flips the support status in docs/source/user-guide/latest/expressions.md from planned to supported.
  • Window use (kurtosis(x) OVER (...)) still falls back because the Comet window path doesn't wire kurtosis today. Captured as an expect_fallback case rather than left as an implicit gap.

Scaffolding produced by the implement-comet-expression skill; the audit-comet-expression skill drove the audit and produced the extra fallback and coverage-gap tests.

How are these changes tested?

  • Rust unit tests for KurtosisAccumulator covering empty group, single-value divide-by-zero in both nullOnDivideByZero modes, both of Spark's own ExpressionDescription examples (-0.7014368047529627 and 0.19432323191699075), and a two-partition state merge that reproduces the single-batch result.
  • New Comet SQL test at spark/src/test/resources/sql-tests/expressions/aggregate/kurtosis.sql running under ConfigMatrix: parquet.enable.dictionary=false,true. Covers Spark's documented examples, GROUP BY, global aggregate, empty table, integer/decimal/float/bigint inputs, literal argument, FILTER (WHERE ...), mixed with other aggregates, and expect_fallback cases for skewness and window use of kurtosis. Numerical-stress cases (NaN, Infinity, -Infinity, 1e15 magnitudes) use spark_answer_only mode.
  • Separate SQL file kurtosis_legacy.sql gated with Config: spark.sql.legacy.statisticalAggregate=true exercises the NaN path for single-value and all-equal groups.
  • ./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite kurtosis" -Dtest=none passes under both the default (Spark 3.5) and -Pspark-4.0 profiles.
  • cd native && cargo clippy --all-targets --workspace -- -D warnings passes.

Adds a Comet-owned `Kurtosis` UDAF and `CometKurtosis` serde so that
Spark's excess-kurtosis aggregate runs natively. The native accumulator
stores `[n, avg, m2, m3, m4]` `Float64` state to mirror Spark's
`CentralMomentAgg` (momentOrder = 4) buffer, and the update/merge
kernels are a direct port of Spark's `updateExpressionsDef` /
`mergeExpressions` expressions. `nullOnDivideByZero` is threaded
through so `spark.sql.legacy.statisticalAggregate` both branches
behave the same as Spark (NULL vs NaN when `m2 == 0`).

Window use falls back to Spark today: the window path doesn't wire the
Comet aggregate for kurtosis. Captured as an `expect_fallback` case in
the SQL test.

Scaffolding produced by the `implement-comet-expression` skill; audit
performed by the `audit-comet-expression` skill.
@andygrove andygrove added this to the 1.0.0 milestone Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant