Skip to content

feat: implement regr_slope, regr_intercept, regr_r2, regr_sxx, regr_syy, regr_sxy aggregates#4775

Open
andygrove wants to merge 3 commits into
apache:mainfrom
andygrove:feat-regr-aggregates
Open

feat: implement regr_slope, regr_intercept, regr_r2, regr_sxx, regr_syy, regr_sxy aggregates#4775
andygrove wants to merge 3 commits into
apache:mainfrom
andygrove:feat-regr-aggregates

Conversation

@andygrove

Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #4552.

Rationale for this change

Comet already accelerates regr_avgx, regr_avgy, and regr_count (Spark rewrites those to Average/Count), but the remaining six SQL-standard linear regression aggregates fell back to Spark. This PR implements native support for regr_slope, regr_intercept, regr_r2, regr_sxx, regr_syy, and regr_sxy, so a query using any of them can run fully on Comet instead of falling back.

What changes are included in this PR?

  • A new native Regr aggregate UDF in native/spark-expr/src/agg_funcs/regr.rs. Rather than re-implementing the statistics, each function is composed from Comet's existing Spark-compatible CovarianceAccumulator and VarianceAccumulator. This keeps the partial aggregation state byte-compatible with the buffer layout Spark's planner declares for the partial to final shuffle:
    • regr_sxx / regr_syy reach Comet as RegrReplacement (a CentralMomentAgg, 3-field buffer) and reuse the variance accumulator, evaluating to m2.
    • regr_sxy (a population Covariance, 4-field buffer) reuses the covariance accumulator, evaluating to the co-moment ck.
    • regr_r2 (a PearsonCorrelation, 6-field buffer) composes covariance + two variances.
    • regr_slope / regr_intercept (a declarative composite, 7-field buffer) compose population covariance + variance.
  • regr_r2 matches Spark's behavior of returning 1.0 when the dependent variable is constant but the independent variable varies (a perfect horizontal fit). This is the one case where Spark and DataFusion's regr_r2 diverge (DataFusion returns null), so a Comet-specific accumulator is warranted.
  • A new Regr protobuf message (with a RegrType enum) wired through QueryPlanSerde and the native PhysicalPlanner.
  • Scala serde for RegrSlope, RegrIntercept, RegrR2, RegrSXY, and RegrReplacement (the rewrite target for regr_sxx/regr_syy).
  • Updated the expression support status in the user guide.

This work was scaffolded using the implement-comet-expression project skill.

How are these changes tested?

  • Expanded the existing spark/src/test/resources/sql-tests/expressions/aggregate/regr.sql so all six functions are verified to run natively and match Spark (previously spark_answer_only). New cases cover NULL pairs, single-pair input, constant independent variable (slope/intercept/r2 to NULL), constant dependent variable (r2 to 1.0), grouped aggregation, and literal/column argument mixes.
  • Added Rust unit tests in regr.rs covering perfect-fit slope/intercept/r2, the constant-y and constant-x edge cases, single-pair and empty input, NULL-pair skipping, the raw moments, and partial-state merge across batches.

…yy, regr_sxy aggregates

Add native support for the six simple linear regression aggregates that
previously fell back to Spark. regr_avgx, regr_avgy and regr_count were
already accelerated because Spark rewrites them to Average/Count.

The native accumulators are composed from Comet's existing Spark-compatible
covariance and variance accumulators so the partial aggregation state matches
the buffer layout Spark's planner expects between partial and final stages:
RegrReplacement (regr_sxx/regr_syy) -> 3 fields, Covariance (regr_sxy) -> 4,
PearsonCorrelation (regr_r2) -> 6, and the slope/intercept composite -> 7.

regr_r2 matches Spark's behavior of returning 1.0 when the dependent variable
is constant but the independent variable varies (a perfect horizontal fit),
which differs from DataFusion's regr_r2.
# Conflicts:
#	native/core/src/execution/planner.rs
#	native/proto/src/proto/expr.proto
#	spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala
#	spark/src/main/scala/org/apache/comet/serde/aggregates.scala
@andygrove andygrove marked this pull request as draft July 1, 2026 20:51
The regr aggregates diverged from Spark in three ways that surfaced as CI
failures across Spark versions:

- regr_r2 degenerate cases were inverted. Spark 3.4/3.5/4.0 return null when
  the dependent variable is constant and 1.0 when the independent variable is
  constant; Comet had these swapped.
- Spark 4.1 swapped that degenerate handling again (constant dependent -> 1.0,
  constant independent -> null). Route the behaviour through a new
  r2_constant_dependent_is_perfect_fit proto flag set from isSpark41Plus.
- regr_slope/regr_intercept compute VariancePop(x) over both-non-null pairs on
  Spark 3.5+, but over every x-non-null row on Spark 3.4. Route this through a
  new filter_var_by_pair_nulls proto flag set from isSpark35Plus.

Also evaluate regr_r2 as corr = ck / sqrt(m2_y * m2_x); corr * corr to mirror
Spark's exact float rounding, so the golden-file postgres aggregates tests match
bit-for-bit.
@andygrove andygrove added this to the 1.0.0 milestone Jul 3, 2026
@andygrove andygrove marked this pull request as ready for review July 3, 2026 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Natively support regr_sxx, regr_syy, regr_sxy, regr_slope, regr_intercept, regr_r2 aggregates

1 participant