feat: expose spark-compatible functions by timsaucer · Pull Request #1564 · apache/datafusion-python

timsaucer · 2026-05-30T00:28:02Z

Which issue does this PR close?

Closes #1482

Rationale for this change

This expands the pool of available functions for users. Some replace existing functions and others are new.

What changes are included in this PR?

Expose spark functions
Add feature to enable replacing built in functions with spark when using SQL
Unit tests

Are there any user-facing changes?

No existing functions are impacted. New APIs and functions exposed.

Add `datafusion.functions.spark` module exposing the upstream `datafusion-spark` crate's UDF/UDAF library (~87 functions across string, math, datetime, hash, array, aggregate, bitwise, bitmap, conditional, collection, conversion, json, map, url categories). For DataFrame use, import the typed Python wrappers from `datafusion.functions.spark`. For SQL use, call `SessionContext.enable_spark_functions()` to register the Spark UDFs by name (overriding DataFusion built-ins of the same name with their Spark semantics — NULL-propagating `concat`, 1-indexed `substring`, HALF_UP `round`, etc.). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Seven `#[allow(clippy::borrow_deref_ref)]` attributes on module declarations in `crates/core/src/lib.rs` had become stale — the only remaining lint hit was a redundant `&*x.as_str()` pattern in `parse_file_compression_type`. Rewriting that call to `&x.unwrap_or_default()` lets every allow come off, removing noise that new modules were copying without need. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Switch most spark wrappers from UDF-direct path (which forced `spark_udf_fixed!(name, fn_category::name, args...)` repetition) to a `spark_expr_fn!` macro that mirrors the existing `expr_fn!` macro in `functions.rs`, so calls collapse to `spark_expr_fn!(sha2, arg1 bit_length);`. UDF-direct retained for genuinely variadic functions whose upstream `expr_fn` wrappers were generated with a single-`Expr` arm by `export_functions!` (concat, array, xxhash64, parse_url family, etc.) so that the Python side keeps its `*args` ergonomics. Aggregates collapse the same way via `spark_aggregate!` mirroring `aggregate_function!`. Net 173 lines removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The intro wording implied "SQL functions" only; the same wrappers are the primary entry point for the DataFrame API as well. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace API-speak ("Import the submodule", "Returned values are Expr instances that compose") with a concrete description of where users can actually drop these calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Hand-maintained category list would drift from the actual module as upstream `datafusion-spark` adds/removes functions. Replace with a pointer to the AutoAPI-generated reference, which renders from the module itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

38 wrappers carried `# doctest: +SKIP` because outputs weren't verified at authoring time. Run each with concrete inputs, capture actual outputs, and inline the values so the doctests execute and stay correct. Covers datetime (20), URL (5), bitmap (3), map (3), and remaining hash, JSON, math, string, conversion, and format_string cases. Net new doctest coverage: 65 examples now run that were skipped before; total skipped across the suite drops from 53 to 12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Align positional parameter names in `functions.spark` with pyspark.sql.functions: - aggregate first positional → `col` (avg, try_sum, collect_list, collect_set) - unary `arg` → `col` across math/string/byte/datetime helpers - multi-arg renames: array_contains (col, value), array (*cols), shuffle (col), array_repeat (col, count), slice (x, start, length), shiftleft/right/rightunsigned (col, numBits), add_months (start, months), date_add/sub (start, days), date_diff (end, start), date_trunc (format, timestamp), time_trunc (unit, time), trunc (date, format), next_day (date, dayOfWeek), from/to_utc_timestamp (timestamp, tz), sha2 (col, numBits), xxhash64 (*cols), map_from_arrays (col1, col2), width_bucket (v, min, max, numBucket), substring (str, pos, len), concat (*cols), elt (*inputs), is_valid_utf8/make_valid_utf8 (str) Bodies updated to reference the new names; positional callers unaffected. This finishes Category 1 / Category 4 (spark-side BOTH-bucket) renames from PYSPARK_ALIGNMENT_PLAN.md PR 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Match pyspark's optional-parameter surface in the spark namespace: - make_dt_interval, make_interval: all parts default to zero (int32 0 / lit 0.0) - str_to_map: pair_delim defaults to ',', key_value_delim defaults to ':' - round: scale defaults to 0 (HALF_UP rounding to nearest integer) - shuffle: accepts `seed` kwarg for pyspark parity; raises NotImplementedError for non-None values until the Rust binding supports it - like, ilike: accept `escapeChar` for pyspark parity; same NotImplementedError guard; first positional renamed `string` → `str` to match pyspark ceil/floor `scale=` deferred — the underlying Rust expr_fn is single-arg. Added a module-level `_ZERO_I32` literal to avoid rebuilding the pyarrow int32 zero scalar on every call. Tests: positional-compat coverage for aggregates (`spark.avg(col)` etc.), defaults-omitted cases for the optional-arg functions, and NotImplementedError cases for `shuffle(seed=)` and `like/ilike(escapeChar=)`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace generic ``*args`` with explicit pyspark-style signatures: - json_tuple(col, *fields) — first positional is the JSON expr - format_string(format, *cols) — `format` is the printf template; a plain ``str`` is auto-promoted to a literal - parse_url(url, partToExtract, key=None) — `key` is optional and only meaningful with ``partToExtract='QUERY'`` - try_parse_url(url, partToExtract, key=None) — same shape - url_decode(str), try_url_decode(str), url_encode(str) — single-argument forms (multi-arg calls were always semantically wrong) Tests cover the three-arg parse_url path and the plain-str format_string auto-promotion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`functions.spark` mirrors `pyspark.sql.functions` and now ships on this branch. Update every skill that references the function surface: - skills/datafusion_python/SKILL.md (user-facing): add an import reference, a Core Abstractions row, and a "Spark-Compatible Functions" subsection listing coverage by category, the SQL-vs-DataFrame usage (`enable_spark_functions`), and the divergent-semantics table (concat NULL, round HALF_UP, trunc) so callers know which namespace to pick. - .ai/skills/check-upstream/SKILL.md: new area for the `datafusion-spark` crate with the coverage policy (parity with pyspark, extras allowed when positional pyspark calls still work). Hygiene check also now spans `functions/spark.py`'s `__all__`. - .ai/skills/audit-skill-md/SKILL.md: add `functions.spark` to the surface table and a `spark-functions` scope so this audit also validates the new subsection and divergent-semantics table. - .ai/skills/make-pythonic/SKILL.md: explicit scope note that the spark namespace is a deliberate pyspark mirror — generic native-type coercion does not apply there. Path references updated to the new `functions/__init__.py` module layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

timsaucer and others added 12 commits May 29, 2026 20:25

docs: clarify spark functions cover DataFrame API too

45f0b68

The intro wording implied "SQL functions" only; the same wrappers are the primary entry point for the DataFrame API as well. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs: rewrite spark DataFrame intro for users

7a7f6b1

Replace API-speak ("Import the submodule", "Returned values are Expr instances that compose") with a concrete description of where users can actually drop these calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge branch 'main' into feat/expose-spark-functions

026e7f4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expose spark-compatible functions#1564

feat: expose spark-compatible functions#1564
timsaucer wants to merge 12 commits into
apache:mainfrom
timsaucer:feat/expose-spark-functions

timsaucer commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

timsaucer commented May 30, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant