feat: implement regr_slope, regr_intercept, regr_r2, regr_sxx, regr_syy, regr_sxy aggregates by andygrove · Pull Request #4775 · apache/datafusion-comet

andygrove · 2026-06-30T18:13:30Z

Which issue does this PR close?

Closes #4552.

Rationale for this change

Comet already accelerates regr_avgx, regr_avgy, and regr_count (Spark rewrites those to Average/Count), but the remaining six SQL-standard linear regression aggregates fell back to Spark. This PR implements native support for regr_slope, regr_intercept, regr_r2, regr_sxx, regr_syy, and regr_sxy, so a query using any of them can run fully on Comet instead of falling back.

What changes are included in this PR?

A new native Regr aggregate UDF in native/spark-expr/src/agg_funcs/regr.rs. Rather than re-implementing the statistics, each function is composed from Comet's existing Spark-compatible CovarianceAccumulator and VarianceAccumulator. This keeps the partial aggregation state byte-compatible with the buffer layout Spark's planner declares for the partial to final shuffle:
- regr_sxx / regr_syy reach Comet as RegrReplacement (a CentralMomentAgg, 3-field buffer) and reuse the variance accumulator, evaluating to m2.
- regr_sxy (a population Covariance, 4-field buffer) reuses the covariance accumulator, evaluating to the co-moment ck.
- regr_r2 (a PearsonCorrelation, 6-field buffer) composes covariance + two variances.
- regr_slope / regr_intercept (a declarative composite, 7-field buffer) compose population covariance + variance.
regr_r2 matches Spark's behavior of returning 1.0 when the dependent variable is constant but the independent variable varies (a perfect horizontal fit). This is the one case where Spark and DataFusion's regr_r2 diverge (DataFusion returns null), so a Comet-specific accumulator is warranted.
A new Regr protobuf message (with a RegrType enum) wired through QueryPlanSerde and the native PhysicalPlanner.
Scala serde for RegrSlope, RegrIntercept, RegrR2, RegrSXY, and RegrReplacement (the rewrite target for regr_sxx/regr_syy).
Updated the expression support status in the user guide.

This work was scaffolded using the implement-comet-expression project skill.

How are these changes tested?

Expanded the existing spark/src/test/resources/sql-tests/expressions/aggregate/regr.sql so all six functions are verified to run natively and match Spark (previously spark_answer_only). New cases cover NULL pairs, single-pair input, constant independent variable (slope/intercept/r2 to NULL), constant dependent variable (r2 to 1.0), grouped aggregation, and literal/column argument mixes.
Added Rust unit tests in regr.rs covering perfect-fit slope/intercept/r2, the constant-y and constant-x edge cases, single-pair and empty input, NULL-pair skipping, the raw moments, and partial-state merge across batches.

…yy, regr_sxy aggregates Add native support for the six simple linear regression aggregates that previously fell back to Spark. regr_avgx, regr_avgy and regr_count were already accelerated because Spark rewrites them to Average/Count. The native accumulators are composed from Comet's existing Spark-compatible covariance and variance accumulators so the partial aggregation state matches the buffer layout Spark's planner expects between partial and final stages: RegrReplacement (regr_sxx/regr_syy) -> 3 fields, Covariance (regr_sxy) -> 4, PearsonCorrelation (regr_r2) -> 6, and the slope/intercept composite -> 7. regr_r2 matches Spark's behavior of returning 1.0 when the dependent variable is constant but the independent variable varies (a perfect horizontal fit), which differs from DataFusion's regr_r2.

# Conflicts: # native/core/src/execution/planner.rs # native/proto/src/proto/expr.proto # spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala # spark/src/main/scala/org/apache/comet/serde/aggregates.scala

The regr aggregates diverged from Spark in three ways that surfaced as CI failures across Spark versions: - regr_r2 degenerate cases were inverted. Spark 3.4/3.5/4.0 return null when the dependent variable is constant and 1.0 when the independent variable is constant; Comet had these swapped. - Spark 4.1 swapped that degenerate handling again (constant dependent -> 1.0, constant independent -> null). Route the behaviour through a new r2_constant_dependent_is_perfect_fit proto flag set from isSpark41Plus. - regr_slope/regr_intercept compute VariancePop(x) over both-non-null pairs on Spark 3.5+, but over every x-non-null row on Spark 3.4. Route this through a new filter_var_by_pair_nulls proto flag set from isSpark35Plus. Also evaluate regr_r2 as corr = ck / sqrt(m2_y * m2_x); corr * corr to mirror Spark's exact float rounding, so the golden-file postgres aggregates tests match bit-for-bit.

andygrove added 2 commits June 30, 2026 12:12

Merge remote-tracking branch 'apache/main' into feat-regr-aggregates

8b4cfd2

# Conflicts: # native/core/src/execution/planner.rs # native/proto/src/proto/expr.proto # spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala # spark/src/main/scala/org/apache/comet/serde/aggregates.scala

andygrove marked this pull request as draft July 1, 2026 20:51

andygrove added this to the 1.0.0 milestone Jul 3, 2026

andygrove marked this pull request as ready for review July 3, 2026 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: implement regr_slope, regr_intercept, regr_r2, regr_sxx, regr_syy, regr_sxy aggregates#4775

feat: implement regr_slope, regr_intercept, regr_r2, regr_sxx, regr_syy, regr_sxy aggregates#4775
andygrove wants to merge 3 commits into
apache:mainfrom
andygrove:feat-regr-aggregates

andygrove commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

andygrove commented Jun 30, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant