Skip to content

Preserve OpenAI token text chunks#95

Draft
Mirochill wants to merge 1 commit into
freedmand:mainfrom
Mirochill:fix-42-openai-token-chunks
Draft

Preserve OpenAI token text chunks#95
Mirochill wants to merge 1 commit into
freedmand:mainfrom
Mirochill:fix-42-openai-token-chunks

Conversation

@Mirochill

Copy link
Copy Markdown

Summary

  • Decode OpenAI tokenizer bytes incrementally when building stored text chunks.
  • Preserve UTF-8 characters that span multiple BPE tokens instead of decoding each token in isolation.
  • Keep one text chunk per token so existing offsets and embedding windows continue to align.

Fixes #42.

Validation

  • git diff --check HEAD~1..HEAD

Tests not run locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Garbled characters when using OpenAI models to embed Chinese

1 participant