Skip to content

feat: expose spark-compatible functions#1564

Draft
timsaucer wants to merge 12 commits into
apache:mainfrom
timsaucer:feat/expose-spark-functions
Draft

feat: expose spark-compatible functions#1564
timsaucer wants to merge 12 commits into
apache:mainfrom
timsaucer:feat/expose-spark-functions

Conversation

@timsaucer
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #1482

Rationale for this change

This expands the pool of available functions for users. Some replace existing functions and others are new.

What changes are included in this PR?

  • Expose spark functions
  • Add feature to enable replacing built in functions with spark when using SQL
  • Unit tests

Are there any user-facing changes?

No existing functions are impacted. New APIs and functions exposed.

timsaucer and others added 12 commits May 29, 2026 20:25
Add `datafusion.functions.spark` module exposing the upstream
`datafusion-spark` crate's UDF/UDAF library (~87 functions across string,
math, datetime, hash, array, aggregate, bitwise, bitmap, conditional,
collection, conversion, json, map, url categories).

For DataFrame use, import the typed Python wrappers from
`datafusion.functions.spark`. For SQL use, call
`SessionContext.enable_spark_functions()` to register the Spark UDFs by
name (overriding DataFusion built-ins of the same name with their Spark
semantics — NULL-propagating `concat`, 1-indexed `substring`, HALF_UP
`round`, etc.).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Seven `#[allow(clippy::borrow_deref_ref)]` attributes on module
declarations in `crates/core/src/lib.rs` had become stale — the only
remaining lint hit was a redundant `&*x.as_str()` pattern in
`parse_file_compression_type`. Rewriting that call to
`&x.unwrap_or_default()` lets every allow come off, removing noise that
new modules were copying without need.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch most spark wrappers from UDF-direct path (which forced
`spark_udf_fixed!(name, fn_category::name, args...)` repetition) to a
`spark_expr_fn!` macro that mirrors the existing `expr_fn!` macro in
`functions.rs`, so calls collapse to `spark_expr_fn!(sha2, arg1
bit_length);`.

UDF-direct retained for genuinely variadic functions whose upstream
`expr_fn` wrappers were generated with a single-`Expr` arm by
`export_functions!` (concat, array, xxhash64, parse_url family, etc.) so
that the Python side keeps its `*args` ergonomics.

Aggregates collapse the same way via `spark_aggregate!` mirroring
`aggregate_function!`. Net 173 lines removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The intro wording implied "SQL functions" only; the same wrappers are the
primary entry point for the DataFrame API as well.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace API-speak ("Import the submodule", "Returned values are Expr
instances that compose") with a concrete description of where users can
actually drop these calls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hand-maintained category list would drift from the actual module as
upstream `datafusion-spark` adds/removes functions. Replace with a
pointer to the AutoAPI-generated reference, which renders from the
module itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
38 wrappers carried `# doctest: +SKIP` because outputs weren't verified at
authoring time. Run each with concrete inputs, capture actual outputs, and
inline the values so the doctests execute and stay correct.

Covers datetime (20), URL (5), bitmap (3), map (3), and remaining hash,
JSON, math, string, conversion, and format_string cases. Net new doctest
coverage: 65 examples now run that were skipped before; total skipped
across the suite drops from 53 to 12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Align positional parameter names in `functions.spark` with pyspark.sql.functions:
- aggregate first positional → `col` (avg, try_sum, collect_list, collect_set)
- unary `arg` → `col` across math/string/byte/datetime helpers
- multi-arg renames: array_contains (col, value), array (*cols), shuffle (col),
  array_repeat (col, count), slice (x, start, length), shiftleft/right/rightunsigned
  (col, numBits), add_months (start, months), date_add/sub (start, days),
  date_diff (end, start), date_trunc (format, timestamp), time_trunc (unit, time),
  trunc (date, format), next_day (date, dayOfWeek), from/to_utc_timestamp
  (timestamp, tz), sha2 (col, numBits), xxhash64 (*cols), map_from_arrays
  (col1, col2), width_bucket (v, min, max, numBucket), substring (str, pos, len),
  concat (*cols), elt (*inputs), is_valid_utf8/make_valid_utf8 (str)

Bodies updated to reference the new names; positional callers unaffected.
This finishes Category 1 / Category 4 (spark-side BOTH-bucket) renames from
PYSPARK_ALIGNMENT_PLAN.md PR 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Match pyspark's optional-parameter surface in the spark namespace:
- make_dt_interval, make_interval: all parts default to zero (int32 0 / lit 0.0)
- str_to_map: pair_delim defaults to ',', key_value_delim defaults to ':'
- round: scale defaults to 0 (HALF_UP rounding to nearest integer)
- shuffle: accepts `seed` kwarg for pyspark parity; raises NotImplementedError
  for non-None values until the Rust binding supports it
- like, ilike: accept `escapeChar` for pyspark parity; same NotImplementedError
  guard; first positional renamed `string` → `str` to match pyspark

ceil/floor `scale=` deferred — the underlying Rust expr_fn is single-arg.

Added a module-level `_ZERO_I32` literal to avoid rebuilding the pyarrow
int32 zero scalar on every call.

Tests: positional-compat coverage for aggregates (`spark.avg(col)` etc.),
defaults-omitted cases for the optional-arg functions, and
NotImplementedError cases for `shuffle(seed=)` and `like/ilike(escapeChar=)`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace generic ``*args`` with explicit pyspark-style signatures:
- json_tuple(col, *fields) — first positional is the JSON expr
- format_string(format, *cols) — `format` is the printf template; a plain
  ``str`` is auto-promoted to a literal
- parse_url(url, partToExtract, key=None) — `key` is optional and only
  meaningful with ``partToExtract='QUERY'``
- try_parse_url(url, partToExtract, key=None) — same shape
- url_decode(str), try_url_decode(str), url_encode(str) — single-argument
  forms (multi-arg calls were always semantically wrong)

Tests cover the three-arg parse_url path and the plain-str format_string
auto-promotion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`functions.spark` mirrors `pyspark.sql.functions` and now ships on this
branch. Update every skill that references the function surface:

- skills/datafusion_python/SKILL.md (user-facing): add an import
  reference, a Core Abstractions row, and a "Spark-Compatible Functions"
  subsection listing coverage by category, the SQL-vs-DataFrame usage
  (`enable_spark_functions`), and the divergent-semantics table
  (concat NULL, round HALF_UP, trunc) so callers know which namespace
  to pick.
- .ai/skills/check-upstream/SKILL.md: new area for the `datafusion-spark`
  crate with the coverage policy (parity with pyspark, extras allowed
  when positional pyspark calls still work). Hygiene check also now
  spans `functions/spark.py`'s `__all__`.
- .ai/skills/audit-skill-md/SKILL.md: add `functions.spark` to the
  surface table and a `spark-functions` scope so this audit also
  validates the new subsection and divergent-semantics table.
- .ai/skills/make-pythonic/SKILL.md: explicit scope note that the
  spark namespace is a deliberate pyspark mirror — generic native-type
  coercion does not apply there. Path references updated to the new
  `functions/__init__.py` module layout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose Spark functions

1 participant