Skip to content

feat: support listagg / string_agg aggregate (Spark 4.0+)#4816

Open
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:feat/listagg-native-support
Open

feat: support listagg / string_agg aggregate (Spark 4.0+)#4816
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:feat/listagg-native-support

Conversation

@andygrove

Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #.

Rationale for this change

Spark 4.0 added the LISTAGG / STRING_AGG aggregate to concatenate string values in a group. Comet previously fell back to Spark for the entire aggregate. This PR adds native support for the common shape so groups that use listagg on a StringType column with a literal delimiter can stay in Comet's native execution path.

What changes are included in this PR?

  • Adds a Comet-owned SparkListAgg UDAF (native/spark-expr/src/agg_funcs/list_agg.rs) that skips nulls, returns Utf8, and — critically — keeps its intermediate state as Binary to match Spark's TypedImperativeAggregate buffer schema. Emitting Utf8 state would force the Comet shuffle layer to insert a Utf8Binary cast that the merge side then cannot read back.
  • Adds the CometListAgg serde in spark/src/main/spark-4.x/ and wires it into QueryPlanSerde.aggrSerdeMap through a new sparkVersionSpecificAggregates shim slot (empty for Spark 3.x, provides ListAgg on Spark 4.x). This keeps the base map compiling against Spark 3.x where ListAgg does not exist.
  • Adds a new ListAgg protobuf message and wires it through the native planner.
  • Restricts native execution to the simple form. WITHIN GROUP (ORDER BY ...), BinaryType inputs, non-literal delimiters, and non-default collations fall back to Spark. DISTINCT falls back naturally because Comet's aggregate dispatcher rejects multi-column distinct aggregates.
  • Documents the audit findings in docs/source/contributor-guide/expression-audits/agg_funcs.md and updates the support status in docs/source/user-guide/latest/expressions.md.

Scaffolding produced by the implement-comet-expression skill; the audit-comet-expression skill drove the fallback and coverage-gap tests.

How are these changes tested?

  • New Rust unit tests for ListAggAccumulator covering delimiter handling, null-input skipping, empty groups, and multi-partition state merge (native/spark-expr/src/agg_funcs/list_agg.rs).
  • New Comet SQL test at spark/src/test/resources/sql-tests/expressions/aggregate/listagg.sql gated to Spark 4.0+ (MinSparkVersion: 4.0), running under ConfigMatrix: parquet.enable.dictionary=false,true to exercise both dictionary-encoded and plain-encoded reads. It covers:
    • Native cases: literal delimiter, string_agg alias, default (empty) delimiter, all-NULL group, empty table, global aggregate, multi-char and empty-string delimiters, empty-string values inside a group, multibyte UTF-8 values (composed accents, CJK, emoji).
    • expect_fallback cases: DISTINCT, WITHIN GROUP (ORDER BY ...) ascending and descending, BinaryType child with default and literal binary delimiter.
  • ./mvnw test -Pspark-4.0 -Dsuites="org.apache.comet.CometSqlFileTestSuite listagg" -Dtest=none passes for both dictionary-encoding modes.
  • cd native && cargo clippy --all-targets --workspace -- -D warnings passes.

Adds a `SparkListAgg` UDAF and a `CometListAgg` serde so that Comet can
natively execute the simple form of Spark 4.0's `LISTAGG(child, delimiter)`
/ `string_agg` on `StringType` inputs with a literal delimiter. `WITHIN
GROUP (ORDER BY ...)`, `BinaryType` inputs, non-literal delimiters, and
non-default collations fall back to Spark. `DISTINCT` falls back because
Comet already rejects multi-column distinct aggregates.

The native accumulator returns `Utf8` but keeps its intermediate state as
`Binary` to match Spark's `TypedImperativeAggregate` buffer schema so the
Comet shuffle layer does not insert a `Utf8` → `Binary` cast the merge
side cannot read back.

Scaffolding produced by the `implement-comet-expression` skill.
@andygrove andygrove added this to the 1.0.0 milestone Jul 3, 2026
Consolidate the reason strings into `private val`s so `getSupportLevel`
and `getUnsupportedReasons` reference the same source of truth, and
replace the manual `case _: Literal` delimiter check with the standard
`.foldable` gate used elsewhere (e.g. `CometPercentile`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant