feat: ergonomic native DataFrame/Expr API (substrait.api)#204
feat: ergonomic native DataFrame/Expr API (substrait.api)#204nielspardon wants to merge 4 commits into
Conversation
…ter_functions Extend the builder literal() to construct a literal for every Substrait type (decimal, uuid, precision time/timestamp[_tz], all interval kinds, struct, list, map with empty-list/empty-map handling, and typed nulls via value=None) through a new recursive _make_literal helper. Existing kinds remain byte-identical. Add the missing precision_time case to type_inference.infer_literal_type so every kind round-trips, and add ExtensionRegistry.iter_functions() to enumerate every registered (urn, name, function_type).
…arwhals The module is the Narwhals integration layer, not a general DataFrame; rename it to reflect that role and free the "DataFrame" name for the native frame. BREAKING CHANGE: import substrait.narwhals instead of substrait.dataframe. The module was a minimal, experimental Narwhals wrapper, so impact is expected low.
… type coverage Add substrait.api, a shallow front door over the existing builders: - expr.Expr: operator overloading (comparison/arithmetic/boolean), literal auto-wrap with peer-type coercion, and .cast()/.alias()/.is_null() - frame.DataFrame: chainable filter/select/with_columns/sort/limit/join/ group_by().agg(), carrying an ExtensionRegistry so it is not threaded through every call - functions.f: every scalar/aggregate/window function, generated lazily from the registry; multi-extension names resolved by argument type; functions_for() and DataFrame.f expose custom-registry functions - dtypes: nullability-aware type shortcuts (sub.i64 / sub.i64.non_null) covering every concrete Substrait type The facade is faithful: it emits byte-identical protobuf to the equivalent builder calls. Adds tests/api covering expressions, frame verbs, function coverage, type coverage and literal construction.
Add examples/api_example.py demonstrating the native substrait.api. Update narwhals_example.py to the renamed substrait.narwhals and label it as the Narwhals integration example. Remove dataframe_example.py, whose direct wrapper usage is superseded by api_example.py (native) and narwhals_example.py.
|
I haven't looked through everything yet, but my +1 on the overall approach. There are some cases (lambda functions most notably) where narwhals/polars api and substrait representation genuinely diverge from one another. Having a native api and a narwhals wrapper is probably a better long-term decision. |
Yeah, definitely a lot of Substrait features to be covered. Wasn't sure yet how to stage it and thought I stop after the first couple iterations to see what the community thinks. |
Summary
substrait-python can already build any plan, but the day-to-day building experience lags behind sibling libraries — notably substrait-java, whose
coremodule ships a fluent builder, a nullability-aware type factory, a preloaded extension collection, and named function helpers.This PR adds a thin, additive facade over the existing
substrait.builderslayer that gives Python users the idioms they expect from pandas / Polars / PySpark / Ibis:Nothing in
builders,proto, orextension_registrybehaviour changes. The facade is faithful: for equivalent inputs it emits byte-identical protobuf to the raw builder path (asserted in tests).What's here
Native ergonomic API (
import substrait.api as sub, a submodule —substraitis a PEP 420 namespace package so the root can't hold an__init__.py):expr.Expr— operator overloading (>,+,&,~, …) mapping to the standard function extensions and resolving lazily; literal auto-wrap with peer-type coercion (col_fp64 * 2,col_i32 > 25resolve);.cast(),.alias(),.is_null().frame.DataFrame— chainablefilter/select/with_columns/sort/limit/join/group_by().agg(), carrying anExtensionRegistryso it isn't threaded through every call. Verb semantics follow Polars.functions.f— every scalar/aggregate/window function from the default extensions, generated lazily from the registry (dir(sub.f)-discoverable). Multi-extension names (add,count, …) resolve to the right extension by argument type; keyword names exposed asand_/or_/not_.functions_for(registry)andDataFrame.fsurface custom-extension functions.dtypes— nullability-aware type shortcuts (sub.i64nullable,sub.i64.non_nullrequired) covering every concrete Substrait type.Lower-layer enablers:
builders.literal()now builds a literal for every Substrait type (decimal, uuid, precision time/timestamp, all intervals, struct/list/map incl. empty variants, typed nulls). Existing kinds stay byte-identical.type_inference.infer_literal_typegains the missingprecision_timecase (every kind round-trips).ExtensionRegistry.iter_functions()enumerates registered functions.Narwhals rename (breaking):
substrait.dataframe→substrait.narwhals, clarifying that it is the Narwhals integration layer (a compliant wrapper that drives plan construction vianw.from_native), distinct from the new nativesubstrait.frame. The two layers compose: the native frame does the plan-building; the Narwhals layer adapts onto it.Breaking changes
import substrait.dataframe→import substrait.narwhals. The module was a minimal, experimental Narwhals wrapper, so impact is expected to be low.Testing
New
tests/api/covering expressions, frame verbs, full function coverage, full type coverage, and literal construction (with round-trip assertions).Follow-ups (not in this PR)
substrait.narwhalsout into a full Narwhals compliant backend (CompliantLazyFrame / Expr / Namespace) on top ofsubstrait.frame. Note: the Narwhals compliant protocol is experimental, andcollect()implies execution (a non-goal), so it would raise or delegate to a consumer.f.*; column disambiguation after joins.🤖 Generated with AI