fix: Do not escape regex in d2:validatePattern [DHIS2-21359]#100
Conversation
11ea7dc to
aafe0d0
Compare
aafe0d0 to
379b3f4
Compare
jbee
left a comment
There was a problem hiding this comment.
I had a quick look an this looks to me as if we are addressing the issue on the wrong level. The semantics of the string should be transparent until the point you e.g. use it as a pattern for a RegEx. But that does not seemed to be the case with what the adjustment does. Also it seems strange to have a set of escapes that are kept while others are stripped. To me this says the handling (character decoding?) on a lower level is off and needs correcting so the output behaves correctly when used later as a regex. If we would apply a fix like this we move to this to a place where by definition it has semantics that support some escaping but not others which generally would be allowed in a standard regex. At last that is what I understand from looking at it. I think this needs more discussion.
That solution was a little bit hacky and it was not clearly addressing the issue. |
|

Problem
Two bugs combined to break
d2:validatePattern, one affecting all platforms and one affectingKotlin/JS only.
Bug 1 — pattern was string-decoded before reaching the regex engine (all platforms)
The regex pattern argument was evaluated with
evalToString, which applies full expression-stringdecoding: every backslash escape is resolved as if it were a string literal. This silently corrupted
the pattern passed to the regex engine:
evalToString\dddonly\ww\-inside[...]-The fix is to use
evalToRawString, which returns the node's raw value without applyingstring-escape decoding, so the regex engine receives the pattern as the author intended.
Bug 2 — raw pattern rejected by the JS regex engine (Kotlin/JS only)
After fixing Bug 1, the raw expression-string value reaches the JS regex engine intact. The raw
value preserves expression-language escapes that have no equivalent in the regex spec, such as
\',\`, and\. The Kotlin stdlibRegexwrapper applies the JavaScript Unicode mode(
uflag) unconditionally. Under Unicode mode, ECMAScript withdraws the Annex B leniency rulesand permits only a strict subset of backslash escapes, causing any unknown one to throw a
SyntaxErrorat construction time.The relevant test case illustrates this — the raw pattern string reaching the engine is:
java.util.regex)uflaguflag (Kotlin stdlib)\'''SyntaxError— not a syntax character\---ClassEscapein Unicode mode\```SyntaxError— not a syntax character\SyntaxError— not a syntax characterThere is no API knob on the Kotlin stdlib
Regexto suppress theuflag on the JS target.Solution
Bug 1 is fixed by introducing
evalToRawStringinCalculatorand routing the patternargument of
d2:validatePatternthrough it instead ofevalToString.Bug 2 is fixed by introducing a
matchesPatternexpect/actualfunction:commonMain— declares theexpectjvmMain/nativeMain— delegates toinput.matches(pattern.toRegex())unchangedjsMain— constructs theRegExpdirectly viajs(...), bypassing the stdlib wrapper andtherefore the
uflag; wraps the pattern in^(?:...)$to replicate Kotlin's full-stringmatching semantics
d2_validatePatternnow callsmatchesPatterninstead ofinput.matches(regex.toRegex()).Trade-off
Running without the
uflag means Unicode property escapes (\p{Lu}etc.) are not interpretedby the JS engine — the pattern is accepted without error but
\p{Lu}behaves as the literalstring
p{Lu}. For DHIS2's validation patterns (digit ranges, character sets, simple anchors)this is not a concern in practice, and the JVM target continues to interpret them correctly via
Java's
Pattern.