Feature Request
OpenKB currently treats images in documents as text-only syntax — the LLM sees  but never the actual image content. This significantly reduces knowledge base quality for technical and scientific documents where figures, diagrams, and charts carry essential information.
Current Behavior
Image path rewriting (images.py)
copy_relative_images() scans for  references
- Copies referenced image files into
wiki/sources/images/<doc_name>/
- Rewrites links to
sources/images/<doc_name>/<filename>
- Skips images not found on disk, http/https/data: URIs
During LLM compilation
- The markdown with image references is sent to the LLM as plain text
- The LLM sees
 but cannot see the actual image
- No image bytes are ever sent to the LLM
- No vision/multimodal capability is used
Why This Matters
For technical and scientific documents — the kind that benefit most from a knowledge base — figures are often irreplaceable:
- Architecture diagrams: Show signal flow, system topology, protocol stacks
- Tables rendered as images: Contain normative reference data that isn't in the text
- Charts and plots: Performance benchmarks, measurement results
- Schematics: Circuit diagrams, filter responses, encoder block diagrams
A knowledge base that ignores all of this produces summaries and concept articles that are missing critical information. For example, a 3GPP spec document on "Immersive Audio Rendering" might have 15+ figures showing rendering pipelines, binaural processing chains, and speaker layouts — none of which would be captured.
Proposed Solution (Optional / Configurable)
Since not all users need image understanding (and it requires a vision-capable model), this should be opt-in:
- Config flag:
image_understanding: true (default: false)
- Detection: During compilation, identify
![]() references in the markdown
- Vision pass: For each referenced image file found on disk, send the image to a vision-capable LLM with a prompt like: "Describe this figure from document {doc_name}. Include: caption, what it depicts, key information conveyed, visible text/labels, related concepts."
- Injection: Prepend the vision-generated description as a text block before the image reference in the prompt sent to the summarization LLM
- Wiki output: Include the description in the generated summary/concept pages alongside the image reference
This approach is framework-agnostic — it works with any vision-capable model (GPT-4o, Claude 3.5+, Gemini, LLaVA via local Ollama, etc.) and doesn't require changes to the wiki output format.
Alternative (Minimal)
If full vision integration is too complex, a simpler approach: add an image_caption_step config that lets the user provide pre-generated captions in a sidecar file (e.g., doc_name.images.yaml), which get injected into the LLM prompt. This avoids the vision dependency entirely while still giving the LLM access to image content descriptions.
Environment
- Document corpus: 3GPP ATIAS technical specifications (converted PDF → markdown with inline image references)
- Many documents contain critical figures (protocol diagrams, test setups, signal flow charts) that are essential for understanding the content
Feature Request
OpenKB currently treats images in documents as text-only syntax — the LLM sees
but never the actual image content. This significantly reduces knowledge base quality for technical and scientific documents where figures, diagrams, and charts carry essential information.Current Behavior
Image path rewriting (
images.py)copy_relative_images()scans forreferenceswiki/sources/images/<doc_name>/sources/images/<doc_name>/<filename>During LLM compilation
but cannot see the actual imageWhy This Matters
For technical and scientific documents — the kind that benefit most from a knowledge base — figures are often irreplaceable:
A knowledge base that ignores all of this produces summaries and concept articles that are missing critical information. For example, a 3GPP spec document on "Immersive Audio Rendering" might have 15+ figures showing rendering pipelines, binaural processing chains, and speaker layouts — none of which would be captured.
Proposed Solution (Optional / Configurable)
Since not all users need image understanding (and it requires a vision-capable model), this should be opt-in:
image_understanding: true(default:false)![]()references in the markdownThis approach is framework-agnostic — it works with any vision-capable model (GPT-4o, Claude 3.5+, Gemini, LLaVA via local Ollama, etc.) and doesn't require changes to the wiki output format.
Alternative (Minimal)
If full vision integration is too complex, a simpler approach: add an
image_caption_stepconfig that lets the user provide pre-generated captions in a sidecar file (e.g.,doc_name.images.yaml), which get injected into the LLM prompt. This avoids the vision dependency entirely while still giving the LLM access to image content descriptions.Environment