FIX: Request SQL_CHAR as SQL_C_WCHAR in arrow fetch path#575
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the Arrow fetch path in the C++ pybind layer to always request SQL_CHAR/SQL_VARCHAR data as SQL_C_WCHAR (UTF-16) so Arrow results are correct regardless of server/client codepage, locale, or platform—addressing the VARCHAR non-ASCII decoding issues reported in #553.
Changes:
- Switch Arrow batch binding/fetching for
SQL_CHAR/SQL_VARCHARfromSQL_C_CHARtoSQL_C_WCHARto avoid codepage-dependent decoding. - Remove the narrow-char copy path for
SQL_CHAR/SQL_VARCHARin Arrow batch production and route through the existing wide-char → UTF-8 conversion logic. - Add an Arrow regression test covering Unicode round-tripping through a UTF-8-collated
VARCHARcolumn.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
mssql_python/pybind/ddbc_bindings.cpp |
Changes Arrow batch binding and fetch handling so VARCHAR is requested as SQL_C_WCHAR, ensuring consistent Unicode correctness. |
tests/test_004_cursor_arrow.py |
Adds a regression test to validate Arrow output for Unicode data stored in VARCHAR with UTF-8 collation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| } | ||
| break; | ||
| } | ||
| case SQL_LONGVARCHAR: |
There was a problem hiding this comment.
Can we add a comment explaining these now fall through to WCHAR handling because SQL_CHAR is bound as SQL_C_WCHAR at line 4784.
| arrowColumnProducer->varVal[idxRowArrow + 1] = start + dataLen; | ||
| break; | ||
| } | ||
| case SQL_LONGVARCHAR: |
| assert tbl.column(0).to_pylist() == expected | ||
| finally: | ||
| cursor.execute(f"drop table if exists {table}") | ||
|
|
There was a problem hiding this comment.
The fix applies to SQL_CHAR, SQL_VARCHAR, and SQL_LONGVARCHAR, but only VARCHAR is tested. Can we add a test for CHAR (fixed-length) type.
| cursor.execute(f"drop table if exists {table}") | ||
|
|
||
|
|
||
| def test_arrow_varchar_utf8_collation_cp1252(cursor: mssql_python.Cursor): |
There was a problem hiding this comment.
The test uses SQL_Latin1_General_CP1_CI_AS (CP1252), NOT a UTF-8 collation. The current name is slightly misleading, would it be better to name it as test_arrow_varchar_cp1252_collation_unicode?
Work Item / Issue Reference
Summary
Due to #495, we can now request SQL_CHAR data as SQL_C_WCHAR, i.e. utf16le strings. Doing this for the arrow path ensures that arrow methods always return correct data no matter the encoding settings / locale / operating system. There does not seem to be any significant negative performance impact.