Skip to content

Index Office Open XML documents#96

Draft
Mirochill wants to merge 1 commit into
freedmand:mainfrom
Mirochill:fix-23-office-ooxml-text
Draft

Index Office Open XML documents#96
Mirochill wants to merge 1 commit into
freedmand:mainfrom
Mirochill:fix-23-office-ooxml-text

Conversation

@Mirochill

Copy link
Copy Markdown

Summary

  • Add stdlib-based text extraction for Office Open XML .docx and .pptx files.
  • Feed extracted Office text through the existing tokenization and embedding pipeline.
  • Document the supported Office formats in the README.

Refs #23.

Validation

  • git diff --check HEAD~1..HEAD

Tests not run locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant