Disclosure: this issue was drafted with the help of an AI assistant.
Is your feature request related to a problem?
Today, when a single expression in a ProjectExec or FilterExec has no native translation, Comet abandons the whole operator. CometProjectExec.convert requires every projectList entry to convert, and CometFilterExec.convert requires the predicate to convert; one None falls the operator back to Spark. That also breaks the native island below it: a ColumnarToRow transition materializes the scan output and the supported sibling expressions run row-wise in Spark WSCG. One unsupported expression sitting above a native (Parquet/Iceberg/Delta) scan discards the native-scan investment for the whole pipeline. This is one of the most common real-world reasons native coverage collapses.
Describe the solution you'd like
Evaluate only the unsupported subexpression in the JVM and keep the operator — and the pipeline — native. This is the pattern Gluten ships as ColumnarPartialProjectExec and Blaze/Auron ship over Arrow FFI.
Comet is well positioned because the JVM-callback machinery already exists on main:
JvmScalarUdfExpr — a DataFusion PhysicalExpr that exports its argument arrays over the Arrow C Data Interface and calls back into the JVM.
CometUdfBridge + CometScalaUDFCodegen / CometBatchKernelCodegen, which Janino-compile Spark's own doGenCode into an Arrow-direct batch kernel.
What is missing is a splitter: a last-resort hook that routes an otherwise-unsupported projection/filter subexpression to that existing detour. Proposed shape for a first version:
- A last-resort hook in expression serde that retries any unsupported node via the existing codegen dispatcher, at expression granularity (fires at the outermost unsupported node; supported ancestors stay native and reference its output).
- Scoped to
ProjectExec and FilterExec only in v1. Join keys, sort keys, aggregate/group expressions, and partitioning keep today's all-or-nothing behavior.
- Plan-time eligibility via the dispatcher's existing
canHandle (rejects aggregates, generators, Unevaluable, unsupported types, and oversized trees); subquery-bearing trees fall back cleanly rather than failing in the kernel.
- Behind a new config, default off, while it stabilizes.
- No proto or native changes — reuses
JvmScalarUdfExpr.
Results run through Spark's own doGenCode / eval inside the kernel, so they match Spark by construction. Iceberg/Delta inherit this for free since the hook lives in projection/filter conversion, above the scan.
Describe alternatives you've considered
A dedicated CometPartialProjectExec operator (Gluten's shape) — rejected for v1 because Comet already has expression-granularity callback, so an operator adds plan surgery and per-operator batch recomposition for strictly less composability than a PhysicalExpr that works in any expression slot.
Additional context
A prototype implementation with serde/parity/island tests, hook fuzzing, and a microbenchmark is in progress. Early numbers on a projection/filter over a native Parquet scan show the detour keeping the island native and running ~1.2–1.3× vs. Spark where the whole operator would otherwise fall back.
Disclosure: this issue was drafted with the help of an AI assistant.
Is your feature request related to a problem?
Today, when a single expression in a
ProjectExecorFilterExechas no native translation, Comet abandons the whole operator.CometProjectExec.convertrequires everyprojectListentry to convert, andCometFilterExec.convertrequires the predicate to convert; oneNonefalls the operator back to Spark. That also breaks the native island below it: aColumnarToRowtransition materializes the scan output and the supported sibling expressions run row-wise in Spark WSCG. One unsupported expression sitting above a native (Parquet/Iceberg/Delta) scan discards the native-scan investment for the whole pipeline. This is one of the most common real-world reasons native coverage collapses.Describe the solution you'd like
Evaluate only the unsupported subexpression in the JVM and keep the operator — and the pipeline — native. This is the pattern Gluten ships as
ColumnarPartialProjectExecand Blaze/Auron ship over Arrow FFI.Comet is well positioned because the JVM-callback machinery already exists on
main:JvmScalarUdfExpr— a DataFusionPhysicalExprthat exports its argument arrays over the Arrow C Data Interface and calls back into the JVM.CometUdfBridge+CometScalaUDFCodegen/CometBatchKernelCodegen, which Janino-compile Spark's owndoGenCodeinto an Arrow-direct batch kernel.What is missing is a splitter: a last-resort hook that routes an otherwise-unsupported projection/filter subexpression to that existing detour. Proposed shape for a first version:
ProjectExecandFilterExeconly in v1. Join keys, sort keys, aggregate/group expressions, and partitioning keep today's all-or-nothing behavior.canHandle(rejects aggregates, generators,Unevaluable, unsupported types, and oversized trees); subquery-bearing trees fall back cleanly rather than failing in the kernel.JvmScalarUdfExpr.Results run through Spark's own
doGenCode/evalinside the kernel, so they match Spark by construction. Iceberg/Delta inherit this for free since the hook lives in projection/filter conversion, above the scan.Describe alternatives you've considered
A dedicated
CometPartialProjectExecoperator (Gluten's shape) — rejected for v1 because Comet already has expression-granularity callback, so an operator adds plan surgery and per-operator batch recomposition for strictly less composability than aPhysicalExprthat works in any expression slot.Additional context
A prototype implementation with serde/parity/island tests, hook fuzzing, and a microbenchmark is in progress. Early numbers on a projection/filter over a native Parquet scan show the detour keeping the island native and running ~1.2–1.3× vs. Spark where the whole operator would otherwise fall back.