feat: implement regr_slope, regr_intercept, regr_r2, regr_sxx, regr_syy, regr_sxy aggregates#4775
Open
andygrove wants to merge 3 commits into
Open
feat: implement regr_slope, regr_intercept, regr_r2, regr_sxx, regr_syy, regr_sxy aggregates#4775andygrove wants to merge 3 commits into
andygrove wants to merge 3 commits into
Conversation
…yy, regr_sxy aggregates Add native support for the six simple linear regression aggregates that previously fell back to Spark. regr_avgx, regr_avgy and regr_count were already accelerated because Spark rewrites them to Average/Count. The native accumulators are composed from Comet's existing Spark-compatible covariance and variance accumulators so the partial aggregation state matches the buffer layout Spark's planner expects between partial and final stages: RegrReplacement (regr_sxx/regr_syy) -> 3 fields, Covariance (regr_sxy) -> 4, PearsonCorrelation (regr_r2) -> 6, and the slope/intercept composite -> 7. regr_r2 matches Spark's behavior of returning 1.0 when the dependent variable is constant but the independent variable varies (a perfect horizontal fit), which differs from DataFusion's regr_r2.
# Conflicts: # native/core/src/execution/planner.rs # native/proto/src/proto/expr.proto # spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala # spark/src/main/scala/org/apache/comet/serde/aggregates.scala
The regr aggregates diverged from Spark in three ways that surfaced as CI failures across Spark versions: - regr_r2 degenerate cases were inverted. Spark 3.4/3.5/4.0 return null when the dependent variable is constant and 1.0 when the independent variable is constant; Comet had these swapped. - Spark 4.1 swapped that degenerate handling again (constant dependent -> 1.0, constant independent -> null). Route the behaviour through a new r2_constant_dependent_is_perfect_fit proto flag set from isSpark41Plus. - regr_slope/regr_intercept compute VariancePop(x) over both-non-null pairs on Spark 3.5+, but over every x-non-null row on Spark 3.4. Route this through a new filter_var_by_pair_nulls proto flag set from isSpark35Plus. Also evaluate regr_r2 as corr = ck / sqrt(m2_y * m2_x); corr * corr to mirror Spark's exact float rounding, so the golden-file postgres aggregates tests match bit-for-bit.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #4552.
Rationale for this change
Comet already accelerates
regr_avgx,regr_avgy, andregr_count(Spark rewrites those toAverage/Count), but the remaining six SQL-standard linear regression aggregates fell back to Spark. This PR implements native support forregr_slope,regr_intercept,regr_r2,regr_sxx,regr_syy, andregr_sxy, so a query using any of them can run fully on Comet instead of falling back.What changes are included in this PR?
Regraggregate UDF innative/spark-expr/src/agg_funcs/regr.rs. Rather than re-implementing the statistics, each function is composed from Comet's existing Spark-compatibleCovarianceAccumulatorandVarianceAccumulator. This keeps the partial aggregation state byte-compatible with the buffer layout Spark's planner declares for the partial to final shuffle:regr_sxx/regr_syyreach Comet asRegrReplacement(aCentralMomentAgg, 3-field buffer) and reuse the variance accumulator, evaluating tom2.regr_sxy(a populationCovariance, 4-field buffer) reuses the covariance accumulator, evaluating to the co-momentck.regr_r2(aPearsonCorrelation, 6-field buffer) composes covariance + two variances.regr_slope/regr_intercept(a declarative composite, 7-field buffer) compose population covariance + variance.regr_r2matches Spark's behavior of returning1.0when the dependent variable is constant but the independent variable varies (a perfect horizontal fit). This is the one case where Spark and DataFusion'sregr_r2diverge (DataFusion returnsnull), so a Comet-specific accumulator is warranted.Regrprotobuf message (with aRegrTypeenum) wired throughQueryPlanSerdeand the nativePhysicalPlanner.RegrSlope,RegrIntercept,RegrR2,RegrSXY, andRegrReplacement(the rewrite target forregr_sxx/regr_syy).This work was scaffolded using the
implement-comet-expressionproject skill.How are these changes tested?
spark/src/test/resources/sql-tests/expressions/aggregate/regr.sqlso all six functions are verified to run natively and match Spark (previouslyspark_answer_only). New cases cover NULL pairs, single-pair input, constant independent variable (slope/intercept/r2 to NULL), constant dependent variable (r2 to 1.0), grouped aggregation, and literal/column argument mixes.regr.rscovering perfect-fit slope/intercept/r2, the constant-y and constant-x edge cases, single-pair and empty input, NULL-pair skipping, the raw moments, and partial-state merge across batches.