Skip to content

feat: ergonomic native DataFrame/Expr API (substrait.api)#204

Draft
nielspardon wants to merge 4 commits into
substrait-io:mainfrom
nielspardon:ergonomic-api
Draft

feat: ergonomic native DataFrame/Expr API (substrait.api)#204
nielspardon wants to merge 4 commits into
substrait-io:mainfrom
nielspardon:ergonomic-api

Conversation

@nielspardon

@nielspardon nielspardon commented Jul 2, 2026

Copy link
Copy Markdown
Member

Summary

substrait-python can already build any plan, but the day-to-day building experience lags behind sibling libraries — notably substrait-java, whose core module ships a fluent builder, a nullability-aware type factory, a preloaded extension collection, and named function helpers.

This PR adds a thin, additive facade over the existing substrait.builders layer that gives Python users the idioms they expect from pandas / Polars / PySpark / Ibis:

import substrait.api as sub

plan = (
    sub.read_named_table("people", {"id": sub.i64, "age": sub.i64, "name": sub.string})
    .filter(sub.col("age") > 25)
    .with_columns(bonus=(sub.col("age") + 1) * 2)
    .select("id", "name", "bonus")
    .to_plan()
)

Nothing in builders, proto, or extension_registry behaviour changes. The facade is faithful: for equivalent inputs it emits byte-identical protobuf to the raw builder path (asserted in tests).

Draft: opening for design feedback on the API shape and the Narwhals direction before polishing.

What's here

Native ergonomic API (import substrait.api as sub, a submodule — substrait is a PEP 420 namespace package so the root can't hold an __init__.py):

  • expr.Expr — operator overloading (>, +, &, ~, …) mapping to the standard function extensions and resolving lazily; literal auto-wrap with peer-type coercion (col_fp64 * 2, col_i32 > 25 resolve); .cast(), .alias(), .is_null().
  • frame.DataFrame — chainable filter / select / with_columns / sort / limit / join / group_by().agg(), carrying an ExtensionRegistry so it isn't threaded through every call. Verb semantics follow Polars.
  • functions.fevery scalar/aggregate/window function from the default extensions, generated lazily from the registry (dir(sub.f)-discoverable). Multi-extension names (add, count, …) resolve to the right extension by argument type; keyword names exposed as and_/or_/not_. functions_for(registry) and DataFrame.f surface custom-extension functions.
  • dtypes — nullability-aware type shortcuts (sub.i64 nullable, sub.i64.non_null required) covering every concrete Substrait type.

Lower-layer enablers:

  • builders.literal() now builds a literal for every Substrait type (decimal, uuid, precision time/timestamp, all intervals, struct/list/map incl. empty variants, typed nulls). Existing kinds stay byte-identical.
  • type_inference.infer_literal_type gains the missing precision_time case (every kind round-trips).
  • ExtensionRegistry.iter_functions() enumerates registered functions.

Narwhals rename (breaking): substrait.dataframesubstrait.narwhals, clarifying that it is the Narwhals integration layer (a compliant wrapper that drives plan construction via nw.from_native), distinct from the new native substrait.frame. The two layers compose: the native frame does the plan-building; the Narwhals layer adapts onto it.

Breaking changes

  • import substrait.dataframeimport substrait.narwhals. The module was a minimal, experimental Narwhals wrapper, so impact is expected to be low.

Testing

New tests/api/ covering expressions, frame verbs, full function coverage, full type coverage, and literal construction (with round-trip assertions).

Follow-ups (not in this PR)

  • Build substrait.narwhals out into a full Narwhals compliant backend (CompliantLazyFrame / Expr / Namespace) on top of substrait.frame. Note: the Narwhals compliant protocol is experimental, and collect() implies execution (a non-goal), so it would raise or delegate to a consumer.
  • Function options / configurable behaviours on f.*; column disambiguation after joins.

🤖 Generated with AI

…ter_functions

Extend the builder literal() to construct a literal for every Substrait type
(decimal, uuid, precision time/timestamp[_tz], all interval kinds, struct, list,
map with empty-list/empty-map handling, and typed nulls via value=None) through a
new recursive _make_literal helper. Existing kinds remain byte-identical.

Add the missing precision_time case to type_inference.infer_literal_type so every
kind round-trips, and add ExtensionRegistry.iter_functions() to enumerate every
registered (urn, name, function_type).
…arwhals

The module is the Narwhals integration layer, not a general DataFrame; rename it
to reflect that role and free the "DataFrame" name for the native frame.

BREAKING CHANGE: import substrait.narwhals instead of substrait.dataframe. The
module was a minimal, experimental Narwhals wrapper, so impact is expected low.
… type coverage

Add substrait.api, a shallow front door over the existing builders:

- expr.Expr: operator overloading (comparison/arithmetic/boolean), literal
  auto-wrap with peer-type coercion, and .cast()/.alias()/.is_null()
- frame.DataFrame: chainable filter/select/with_columns/sort/limit/join/
  group_by().agg(), carrying an ExtensionRegistry so it is not threaded through
  every call
- functions.f: every scalar/aggregate/window function, generated lazily from the
  registry; multi-extension names resolved by argument type; functions_for()
  and DataFrame.f expose custom-registry functions
- dtypes: nullability-aware type shortcuts (sub.i64 / sub.i64.non_null) covering
  every concrete Substrait type

The facade is faithful: it emits byte-identical protobuf to the equivalent
builder calls. Adds tests/api covering expressions, frame verbs, function
coverage, type coverage and literal construction.
Add examples/api_example.py demonstrating the native substrait.api. Update
narwhals_example.py to the renamed substrait.narwhals and label it as the
Narwhals integration example. Remove dataframe_example.py, whose direct
wrapper usage is superseded by api_example.py (native) and narwhals_example.py.
@tokoko

tokoko commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

I haven't looked through everything yet, but my +1 on the overall approach. There are some cases (lambda functions most notably) where narwhals/polars api and substrait representation genuinely diverge from one another. Having a native api and a narwhals wrapper is probably a better long-term decision.

@nielspardon

Copy link
Copy Markdown
Member Author

I haven't looked through everything yet, but my +1 on the overall approach. There are some cases (lambda functions most notably) where narwhals/polars api and substrait representation genuinely diverge from one another. Having a native api and a narwhals wrapper is probably a better long-term decision.

Yeah, definitely a lot of Substrait features to be covered. Wasn't sure yet how to stage it and thought I stop after the first couple iterations to see what the community thinks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants