Preserve OpenAI token text chunks#95

Draft

Mirochill wants to merge 1 commit into

freedmand:mainfrom

Mirochill:fix-42-openai-token-chunks

Mirochill commented May 27, 2026

Summary

Decode OpenAI tokenizer bytes incrementally when building stored text chunks.
Preserve UTF-8 characters that span multiple BPE tokens instead of decoding each token in isolation.
Keep one text chunk per token so existing offsets and embedding windows continue to align.

Fixes #42.

Validation

git diff --check HEAD~1..HEAD

Tests not run locally.


          Preserve OpenAI token text chunks

ebb4e7e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet