Skip to content

feat(isthmus): map REGEXP_EXTRACT to Substrait regexp_match_substring#985

Draft
nielspardon wants to merge 1 commit into
substrait-io:mainfrom
nielspardon:feat/isthmus-regexp-extract
Draft

feat(isthmus): map REGEXP_EXTRACT to Substrait regexp_match_substring#985
nielspardon wants to merge 1 commit into
substrait-io:mainfrom
nielspardon:feat/isthmus-regexp-extract

Conversation

@nielspardon

Copy link
Copy Markdown
Member

What

Maps Calcite's BigQuery-library REGEXP_EXTRACT(value, regexp) operator to Substrait's two-argument regexp_match_substring(input, pattern).

Why

REGEXP_EXTRACT had no entry in FunctionMappings, so SQL using it failed to convert to Substrait (Unable to convert call REGEXP_EXTRACT(...)). The two-argument form lines up directly with the two-argument regexp_match_substring impl in functions_string.yaml.

Scope / notes

  • Two-argument form only. Calcite's optional position/occurrence args (3–4 arg forms) are not handled — Substrait has no 3/4-arg impl, so those would need argument padding to the 5-arg impl and are left for a follow-up.
  • Options (case_sensitivity, multiline, dotall) are defaulted automatically by the function matcher, the same way substring's negative_start option is.
  • Semantic caveat: BigQuery REGEXP_EXTRACT returns capture group 1 when the pattern contains a capturing group; the 2-arg Substrait impl returns the full match. Identical for group-less patterns, divergent otherwise. Called out in a code comment.
  • AutomaticDynamicFunctionMappingRoundtripTest had used regexp_match_substring as an example of an unmapped function. Since it is now mapped, the test is repointed at the still-unmapped regexp_count_substring.

Testing

  • New round-trip test in StringFunctionTest over c16/vc32/vc.
  • StringFunctionTest, AutomaticDynamicFunctionMappingRoundtripTest, and FunctionConversionTest pass locally.

🤖 Generated with AI

Calcite's BigQuery-library REGEXP_EXTRACT(value, regexp) operator had no Substrait mapping, so queries using it failed to convert. Map the two-argument form to the two-argument regexp_match_substring(input, pattern) impl, which returns the substring matching the full pattern. The function's options (case_sensitivity, multiline, dotall) are defaulted by the function matcher, the same way substring's negative_start option is handled. Patterns containing a capture group are not handled specially: the full match is returned rather than the captured group.

Also repoints AutomaticDynamicFunctionMappingRoundtripTest, which used regexp_match_substring as an example of an unmapped function; since it is now mapped, the test exercises the still-unmapped regexp_count_substring instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant