feat: support listagg / string_agg aggregate (Spark 4.0+) by andygrove · Pull Request #4816 · apache/datafusion-comet

andygrove · 2026-07-03T15:20:14Z

Which issue does this PR close?

Closes #.

Rationale for this change

Spark 4.0 added the LISTAGG / STRING_AGG aggregate to concatenate string values in a group. Comet previously fell back to Spark for the entire aggregate. This PR adds native support for the common shape so groups that use listagg on a StringType column with a literal delimiter can stay in Comet's native execution path.

What changes are included in this PR?

Adds a Comet-owned SparkListAgg UDAF (native/spark-expr/src/agg_funcs/list_agg.rs) that skips nulls, returns Utf8, and — critically — keeps its intermediate state as Binary to match Spark's TypedImperativeAggregate buffer schema. Emitting Utf8 state would force the Comet shuffle layer to insert a Utf8 → Binary cast that the merge side then cannot read back.
Adds the CometListAgg serde in spark/src/main/spark-4.x/ and wires it into QueryPlanSerde.aggrSerdeMap through a new sparkVersionSpecificAggregates shim slot (empty for Spark 3.x, provides ListAgg on Spark 4.x). This keeps the base map compiling against Spark 3.x where ListAgg does not exist.
Adds a new ListAgg protobuf message and wires it through the native planner.
Restricts native execution to the simple form. WITHIN GROUP (ORDER BY ...), BinaryType inputs, non-literal delimiters, and non-default collations fall back to Spark. DISTINCT falls back naturally because Comet's aggregate dispatcher rejects multi-column distinct aggregates.
Documents the audit findings in docs/source/contributor-guide/expression-audits/agg_funcs.md and updates the support status in docs/source/user-guide/latest/expressions.md.

Scaffolding produced by the implement-comet-expression skill; the audit-comet-expression skill drove the fallback and coverage-gap tests.

How are these changes tested?

New Rust unit tests for ListAggAccumulator covering delimiter handling, null-input skipping, empty groups, and multi-partition state merge (native/spark-expr/src/agg_funcs/list_agg.rs).
New Comet SQL test at spark/src/test/resources/sql-tests/expressions/aggregate/listagg.sql gated to Spark 4.0+ (MinSparkVersion: 4.0), running under ConfigMatrix: parquet.enable.dictionary=false,true to exercise both dictionary-encoded and plain-encoded reads. It covers:
- Native cases: literal delimiter, string_agg alias, default (empty) delimiter, all-NULL group, empty table, global aggregate, multi-char and empty-string delimiters, empty-string values inside a group, multibyte UTF-8 values (composed accents, CJK, emoji).
- expect_fallback cases: DISTINCT, WITHIN GROUP (ORDER BY ...) ascending and descending, BinaryType child with default and literal binary delimiter.
./mvnw test -Pspark-4.0 -Dsuites="org.apache.comet.CometSqlFileTestSuite listagg" -Dtest=none passes for both dictionary-encoding modes.
cd native && cargo clippy --all-targets --workspace -- -D warnings passes.

Adds a `SparkListAgg` UDAF and a `CometListAgg` serde so that Comet can natively execute the simple form of Spark 4.0's `LISTAGG(child, delimiter)` / `string_agg` on `StringType` inputs with a literal delimiter. `WITHIN GROUP (ORDER BY ...)`, `BinaryType` inputs, non-literal delimiters, and non-default collations fall back to Spark. `DISTINCT` falls back because Comet already rejects multi-column distinct aggregates. The native accumulator returns `Utf8` but keeps its intermediate state as `Binary` to match Spark's `TypedImperativeAggregate` buffer schema so the Comet shuffle layer does not insert a `Utf8` → `Binary` cast the merge side cannot read back. Scaffolding produced by the `implement-comet-expression` skill.

Consolidate the reason strings into `private val`s so `getSupportLevel` and `getUnsupportedReasons` reference the same source of truth, and replace the manual `case _: Literal` delimiter check with the standard `.foldable` gate used elsewhere (e.g. `CometPercentile`).

andygrove added this to the 1.0.0 milestone Jul 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support listagg / string_agg aggregate (Spark 4.0+)#4816

feat: support listagg / string_agg aggregate (Spark 4.0+)#4816
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:feat/listagg-native-support

andygrove commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

andygrove commented Jul 3, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant