Skip to content

[CSV-329] Fix byte tracking for supplementary delimiters#613

Merged
garydgregory merged 2 commits into
apache:masterfrom
OldTruckDriver:fix/CSV-329_trackbytes_supplementary_delimiter
Jun 19, 2026
Merged

[CSV-329] Fix byte tracking for supplementary delimiters#613
garydgregory merged 2 commits into
apache:masterfrom
OldTruckDriver:fix/CSV-329_trackbytes_supplementary_delimiter

Conversation

@OldTruckDriver

Copy link
Copy Markdown
Contributor

[CSV-329] Fix byte tracking for supplementary delimiters

CSVParser with trackBytes enabled could throw CharacterCodingException when a multi-character delimiter contained a supplementary Unicode character. The failure happened while delimiter lookahead read a surrogate pair through ExtendedBufferedReader.read(char[]).

This change updates ExtendedBufferedReader byte-length accounting for char-buffer reads so surrogate pairs are evaluated with the correct previous character before lastChar is updated. This lets byte tracking remain metadata-only and not change parser correctness.

Tests cover trackBytes=true with a multi-character delimiter containing an emoji, including byte-position tracking across records.

Tests run:

  • mvn -q -Dtest=org.apache.commons.csv.CSVParserTest#testGetBytePositionMultiCharacterDelimiterWithSupplementaryCharacter test
  • mvn -q -Dtest=org.apache.commons.csv.CSVParserTest,org.apache.commons.csv.ExtendedBufferedReaderTest test
  • mvn -q

@garydgregory

Copy link
Copy Markdown
Member

Jira ticket is https://issues.apache.org/jira/browse/CSV-329

@garydgregory garydgregory left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OldTruckDriver
Thank you for the PR.
Please add a test to ExtendedBufferedReaderTest to help future maintenance.
TY!

ExtendedBufferedReader.read(char[], int, int) updated lastChar before computing the encoded byte length, so a surrogate pair in the delimiter lookahead buffer was paired against the post-update lastChar and threw CharacterCodingException.

Count bytes before updating lastChar, and pair each char against the preceding char in the buffer seeded from lastChar so pairs split across reads still count. Add parser and ExtendedBufferedReader regression tests.

Reviewed-by: OpenAI Codex
Reviewed-by: Anthropic Claude Code
@OldTruckDriver OldTruckDriver force-pushed the fix/CSV-329_trackbytes_supplementary_delimiter branch from adf6397 to 1d89cd5 Compare June 19, 2026 07:22
@OldTruckDriver

Copy link
Copy Markdown
Contributor Author

Added a direct ExtendedBufferedReaderTest unit test (testReadingSupplementaryCharacterTracksBytes) asserting the byte count for a supplementary character. Thanks!

@garydgregory garydgregory merged commit 871f745 into apache:master Jun 19, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants