Skip to content

Partial project fallback: keep a projection/filter native by evaluating an unsupported subexpression in the JVM #4825

Description

@schenksj

Disclosure: this issue was drafted with the help of an AI assistant.

Is your feature request related to a problem?

Today, when a single expression in a ProjectExec or FilterExec has no native translation, Comet abandons the whole operator. CometProjectExec.convert requires every projectList entry to convert, and CometFilterExec.convert requires the predicate to convert; one None falls the operator back to Spark. That also breaks the native island below it: a ColumnarToRow transition materializes the scan output and the supported sibling expressions run row-wise in Spark WSCG. One unsupported expression sitting above a native (Parquet/Iceberg/Delta) scan discards the native-scan investment for the whole pipeline. This is one of the most common real-world reasons native coverage collapses.

Describe the solution you'd like

Evaluate only the unsupported subexpression in the JVM and keep the operator — and the pipeline — native. This is the pattern Gluten ships as ColumnarPartialProjectExec and Blaze/Auron ship over Arrow FFI.

Comet is well positioned because the JVM-callback machinery already exists on main:

  • JvmScalarUdfExpr — a DataFusion PhysicalExpr that exports its argument arrays over the Arrow C Data Interface and calls back into the JVM.
  • CometUdfBridge + CometScalaUDFCodegen / CometBatchKernelCodegen, which Janino-compile Spark's own doGenCode into an Arrow-direct batch kernel.

What is missing is a splitter: a last-resort hook that routes an otherwise-unsupported projection/filter subexpression to that existing detour. Proposed shape for a first version:

  • A last-resort hook in expression serde that retries any unsupported node via the existing codegen dispatcher, at expression granularity (fires at the outermost unsupported node; supported ancestors stay native and reference its output).
  • Scoped to ProjectExec and FilterExec only in v1. Join keys, sort keys, aggregate/group expressions, and partitioning keep today's all-or-nothing behavior.
  • Plan-time eligibility via the dispatcher's existing canHandle (rejects aggregates, generators, Unevaluable, unsupported types, and oversized trees); subquery-bearing trees fall back cleanly rather than failing in the kernel.
  • Behind a new config, default off, while it stabilizes.
  • No proto or native changes — reuses JvmScalarUdfExpr.

Results run through Spark's own doGenCode / eval inside the kernel, so they match Spark by construction. Iceberg/Delta inherit this for free since the hook lives in projection/filter conversion, above the scan.

Describe alternatives you've considered

A dedicated CometPartialProjectExec operator (Gluten's shape) — rejected for v1 because Comet already has expression-granularity callback, so an operator adds plan surgery and per-operator batch recomposition for strictly less composability than a PhysicalExpr that works in any expression slot.

Additional context

A prototype implementation with serde/parity/island tests, hook fuzzing, and a microbenchmark is in progress. Early numbers on a projection/filter over a native Parquet scan show the detour keeping the island native and running ~1.2–1.3× vs. Spark where the whole operator would otherwise fall back.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions