Production-grade Retrieval-Augmented Generation (RAG) system powering the OmniBioAI ecosystem documentation, architecture search, workflow discovery, and developer assistant APIs.
- FAISS-native vector search (IndexFlatIP, 768-dim)
- Incremental per-repo indexing with hash-based dedup
- Ollama local embeddings (
nomic-embed-text) + local LLM inference (llama3) - FastAPI API server
- Real token-level SSE streaming via Ollama
stream: true - Chunk-level document retrieval with source attribution
- Repository-wide multi-project indexing (19 repos)
- Fully local execution — no OpenAI dependency
- Production-safe embedding normalization
- V6 dimension consistency enforcement
Repositories (REPO_BASE/omnibioai-*)
↓
Document Loader (ingestion/doc_loader.py)
↓
Chunker (processing/chunker.py, 500-word windows / 2000-char max)
↓
Ollama Embeddings (nomic-embed-text, 768-d, normalized)
↓
FAISS Vector Index (IndexFlatIP, data/faiss_index/)
↓
RAG Engine (rag/engine.py)
↓
FastAPI API (api/main.py + api/routes/rag.py)
↓
LLM Answer Generation (llama3 via Ollama, blocking or token-streamed)
Previous versions used brute-force cosine scanning across vectors.
V6 uses:
faiss.IndexFlatIPBenefits:
- 10–50x faster retrieval
- scalable search
- lower latency
- future ANN support
A major issue in previous builds was embedding mismatch.
| Stage | Model | Dimension |
|---|---|---|
| Indexing | all-MiniLM-L6-v2 | 384 |
| Querying | nomic-embed-text | 768 |
This caused FAISS assertion failures:
AssertionError: d == self.dBoth ingestion (scripts/build_index.py) and retrieval (rag/engine.py) now call the same ollama_embed("nomic-embed-text") function — the single source of truth.
embeddings/embedder.py (sentence-transformers, 384-dim) is retained for test coverage only and is not on any live request path.
omnibioai-dev-hub/
│
├── api/
│ ├── main.py # FastAPI app, startup, /status, /health
│ └── routes/
│ └── rag.py # /rag/query and /rag/stream endpoints
│
├── rag/
│ ├── engine.py # RAGEngine: retrieve, build_context, answer, stream_llm
│ └── control_plane.py # Singleton lifecycle manager
│
├── index/
│ ├── vector_store.py # FAISS wrapper (add, search, save, load)
│ ├── graph_store.py # In-memory knowledge graph (BFS expansion)
│ └── plugin_index.py # Plugin doc registry
│
├── embeddings/
│ └── embedder.py # SentenceTransformer wrapper (tests only, not live)
│
├── retrieval/
│ └── retriever.py # Retriever class (tests only, not live)
│
├── ingestion/
│ └── doc_loader.py # Markdown document loader with SKIP_DIRS/SKIP_PATH_SEGMENTS
│
├── processing/
│ └── chunker.py # 500-word / 2000-char chunker
│
├── scripts/
│ └── build_index.py # Index builder entry point
│
├── data/
│ └── faiss_index/ # index.faiss + metadata.pkl (gitignored)
│
├── configs/
│ └── repos.yaml # Repo list documentation
│
└── .env.example # Environment variable template
Recommended:
Python 3.11 or 3.12
Note: Python 3.13 removes
numpy.distutils, breakingfaiss-cpu. Use 3.11 or 3.12. In Docker the provided image uses Python 3.12.
Install:
curl -fsSL https://ollama.com/install.sh | shollama pull nomic-embed-textThe system uses llama3 by default (hardcoded in rag/engine.py):
ollama pull llama3Optional alternatives (change model= in engine.py if needed):
ollama pull mistral
ollama pull deepseek-coderconda create -n omnibioai-dev-hub python=3.12 -y
conda activate omnibioai-dev-hubpip install fastapi uvicorn requests numpy faiss-cpu sentence-transformers
sentence-transformersis required for cross-encoder reranking (rerank=Trueinretrieve()). It is also used by the test suite. If reranking is not needed you can omit it — the engine degrades gracefully to FAISS order when the model is unavailable.
REPO_BASE is the only required configuration. It must point to the directory that contains the omnibioai-* repos as immediate children.
export REPO_BASE=/home/manish/Desktop/machineThe indexer exits immediately with a clear error if no repos are found under REPO_BASE:
❌ No repos found under REPO_BASE='/some/wrong/path'
Set REPO_BASE to the directory that contains the omnibioai-* repos.
Example: export REPO_BASE=/home/manish/Desktop/machine
Docker: -e REPO_BASE=/repos (with repos volume mounted at /repos)
In Docker the image sets ENV REPO_BASE=/repos automatically — no action needed.
Copy .env.example to .env for local development:
cp .env.example .env
# edit REPO_BASE as neededrm -rf data/faiss_index/*python scripts/build_index.pyExpected output:
🚀 Incremental V6 Indexing Starting...
⚠️ 2 repos not found, will be skipped: ['omnibioai-security-audit', 'omnibioai-hpc-policy-engine']
📄 Loaded N documents
...
💾 Index saved to .../data/faiss_index (10877 vectors)
✅ V6 Index Complete
{'too_short': 4, 'deduped': 31, 'embed_failed': 4, 'new': 10881, 'chunks_indexed': 10877}
seen_hashes is reset per repo so cross-repo identical chunks each get their own index entry under their canonical source path. Within a single repo, duplicate chunks (e.g. shared boilerplate across plugin READMEs) are deduplicated.
Chunks shorter than 10 characters are discarded (MIN_CHUNK_CHARS = 10) to eliminate overflow tails produced by the hard character-slice in chunker.py.
ingestion/doc_loader.py skips the following during the walk:
By directory name (SKIP_DIRS):
SKIP_DIRS = {".git", "__pycache__", "node_modules", ".venv", "venv", ".pytest_cache", "obsolete"}Note: both .venv and bare venv are excluded. .pytest_cache is excluded because it contains auto-generated README.md stubs (present in 17 of the 19 repos) that would otherwise pollute the index with boilerplate.
By path segment (SKIP_PATH_SEGMENTS):
SKIP_PATH_SEGMENTS = {"work"}This excludes omnibioai/work/ which contains UUID-named runtime copies of workflow bundle READMEs. Without this exclusion, those copies would claim index slots before the canonical omnibioai-workflow-bundles/ paths are processed, causing all 50 bundles to appear indexed under omnibioai/work/ instead.
As of 2026-06-14 (Phase 3 — metadata filtering + cross-encoder reranking):
| Metric | Value |
|---|---|
| Total vectors | 10,877 |
| Unique source files | ~965 |
| Repos indexed | 17 of 19 |
| Workflow bundles covered | 50 / 50 |
| Chunks filtered (too short) | 4 |
| Cross-repo deduped | 31 |
| Recall@5 (eval set) | 96.8% (30/31) |
| Eval set size | 31 queries |
Chunk metadata fields: Each indexed chunk carries text, source, hash, repo (basename of the containing repository directory), and bundle (first subdirectory under the repo root, or None for root-level files). The repo and bundle fields enable post-filtered scoped queries that restrict FAISS candidates to a specific repository or workflow bundle.
Chunking strategy: Markdown-structure-aware — splits on H1/H2/H3 headers as natural section boundaries, never splits inside fenced code blocks, falls back to paragraph boundaries (\n\n) for long sections, and word-boundary splitting for oversized paragraphs. Each chunk is prefixed with its ancestor header breadcrumb (e.g. # ATACseq Pipeline > ## Parameters) so retrieval context carries section identity. Previous fixed 500-word window chunker produced 2,067 vectors; the markdown chunker produces 10,877 (5.3× more granular chunks).
Remaining failure: The one unscoped failure (metagenomics shotgun profiling vs. microbiome/kraken sub-workflows) passes with either bundle="metagenomics" scoped filtering or rerank=True cross-encoder reranking.
Missing repos (not present on disk): omnibioai-security-audit, omnibioai-hpc-policy-engine. The indexer skips them with a warning and continues.
Note:
data/faiss_index/is excluded from git (see.gitignore). Regenerate withpython scripts/build_index.py.
uvicorn api.main:app --host 0.0.0.0 --port 8082 --reloadIn Docker the server starts automatically as PID 1.
curl -X POST http://localhost:8082/rag/query \
-H "Content-Type: application/json" \
-d '{"query":"What is workflow engine in OmniBioAI?"}'Example response:
{
"query": "What is workflow engine in OmniBioAI?",
"answer": "According to the provided context...",
"sources": [
"/repos/omnibioai-workflow-bundles/README.md"
],
"context": [
{"score": 0.87, "text": "...", "source": "/repos/omnibioai-workflow-bundles/README.md"}
],
"context_used": 5,
"version": "v6-faiss",
"api_version": "v6"
}Both /rag/query and /rag/stream accept optional repo and bundle parameters to post-filter FAISS candidates to a specific repository or workflow bundle:
| Parameter | Type | Description |
|---|---|---|
repo |
string (optional) |
Restrict results to chunks from this repository (e.g. "omnibioai-model-registry") |
bundle |
string (optional) |
Restrict results to chunks from this workflow bundle subdirectory (e.g. "metagenomics") |
When both are provided, bundle takes priority. Omitting both parameters performs a global unscoped search.
curl -X POST http://localhost:8082/rag/query \
-H "Content-Type: application/json" \
-d '{"query": "shotgun profiling steps", "bundle": "metagenomics"}'Endpoint:
POST /rag/stream
Body: {"query": "your question"} — also accepts optional repo and bundle parameters (same semantics as /rag/query).
- The server retrieves the top-5 chunks from FAISS (same path as
/rag/query). - It builds the prompt from the retrieved context.
- It calls Ollama's
/api/generatewith"stream": true. - Ollama returns newline-delimited JSON (NDJSON), one object per token:
{"model":"llama3","response":"Based","done":false} {"model":"llama3","response":" on","done":false} ... {"model":"llama3","response":"","done":true} - Each token is immediately forwarded as an SSE event:
data: {"type": "token", "content": "Based"} data: {"type": "token", "content": " on"} ... data: {"type": "done"}
ragStream(
"your query",
(token) => appendToUI(token), // called per token
() => markComplete(), // called on done
(err) => showError(err) // called on error
);The UI client (src/api/client.ts) parses the data: envelope and dispatches type: "token" and type: "done" events.
The retrieval pipeline supports optional cross-encoder reranking:
docs = engine.retrieve(query, top_k=5, rerank=True)Model: cross-encoder/ms-marco-MiniLM-L-6-v2 (via sentence-transformers)
How it works:
- FAISS retrieves
top_k × 3candidates (15 for the defaulttop_k=5). - The cross-encoder scores each
(query, chunk)pair jointly, capturing fine-grained relevance that the bi-encoder embedding alone misses. - The top
top_kby cross-encoder score are returned, each with ace_scorefield attached.
Graceful degradation: If sentence-transformers is not installed or the model fails to load, rerank() logs a warning and falls back to the original FAISS order silently. No code change needed.
When to use: Unscoped queries where semantic collision is likely — for example, when a broad term (e.g. "Kraken2") matches many sub-workflow chunks and the pipeline overview you actually want is pushed below rank 5. For narrowly scoped queries (with repo or bundle), the candidate pool is already restricted and reranking adds latency with little benefit.
The indexer targets 19 repositories. All paths are relative to REPO_BASE:
repos = [
"omnibioai", # main platform repo
"omnibioai-rag",
"omnibioai-toolserver",
"omnibioai-sdk",
"omnibioai-workflow-bundles", # 50 workflow bundle subdirectories
"omnibioai-control-center",
"omnibioai-lims",
"omnibioai-model-registry",
"omnibioai-dev-docker",
"omnibioai-api-gateway",
"omnibioai-docs",
"omnibioai-studio",
"omnibioai-auth",
"omnibioai-tool-runtime",
"omnibioai-iam-client",
"omnibioai-policy-engine",
"omnibioai-security-sdk",
"omnibioai-security-audit", # not present on disk — skipped with warning
"omnibioai-hpc-policy-engine", # not present on disk — skipped with warning
]The two missing repos are skipped gracefully at build time. When they become available, no code change is needed — just add them to the directory.
Documents are split using a 500-word sliding window, hard-capped at 2000 characters per chunk. Chunks shorter than 10 characters are discarded.
Known limitation: The hard character slice at 2000 chars can produce 1–2 word overflow fragments at word boundaries (e.g. "onment", "abases"). These are above the 10-char filter threshold. A future fix will snap the slice to the nearest word boundary. See Future Work below.
Each chunk is embedded using:
nomic-embed-text (768-dim, L2-normalized)
Vectors are stored in:
faiss.IndexFlatIPPre-normalized vectors make inner product equivalent to cosine similarity.
User query is embedded using the same nomic-embed-text model at query time.
FAISS retrieves the top-5 nearest chunks (configurable via top_k).
Retrieved chunks become the context block in a structured prompt.
Prompt sent to local Ollama llama3 model, either blocking (/rag/query) or token-streamed (/rag/stream).
- brute-force cosine scan
- slow retrieval
- embedding dimension mismatch
- unstable indexing
- global dedup silencing canonical paths
- FAISS-native inner product retrieval
- stable 768-dim pipeline end-to-end
- per-repo dedup with canonical path preservation
- real SSE token streaming
- local-only execution
Symptom:
❌ No repos found under REPO_BASE='/repos'
Cause: REPO_BASE defaults to /home/manish/Desktop/machine locally and /repos in Docker. If neither is correct, set it explicitly:
export REPO_BASE=/path/to/parent/of/omnibioai-repos
python scripts/build_index.pyCause: The index on disk was built with a different embedding model or dimension. The loaded vectors won't match query vectors.
Fix: Delete and rebuild:
rm data/faiss_index/index.faiss data/faiss_index/metadata.pkl
python scripts/build_index.pyError:
AssertionError: d == self.d
Cause: Different embedding models used during indexing vs querying (the pre-V6 problem). Rebuild the index — V6 enforces 768-dim at both stages.
Error:
Read timed out
Fix: Use a smaller generation model. Change the model= argument in rag/engine.py:
model="mistral" # faster than llama3 on smaller hardwareVerify the index loaded correctly:
python -c "
from index.vector_store import VectorStore
vs = VectorStore()
vs.load('data/faiss_index')
print('ntotal:', vs.index.ntotal if vs.index else 'no index')
"Expected: ntotal: 2067 (or your current count).
chunker.py slices at a hard 2000-character boundary without snapping to word boundaries. This produces 1–2 word fragments at the tail of some documents (e.g. "onment", "abases" — 16–18 chars, above the MIN_CHUNK_CHARS=10 filter). These fragments are harmless but pollute the index with low-information chunks. The fix is to snap the slice to the nearest preceding space.
- Chunker word-boundary snapping
- IVF or HNSW indexes for million-scale corpora
- Hybrid BM25 + vector search
- Persistent storage and distributed / incremental index updates
- Graph RAG (graph store already seeded)
- Plugin-aware retrieval
Internal OmniBioAI Development License.
RAG V6 powers:
- architecture discovery
- workflow documentation search
- plugin documentation retrieval
- developer assistant APIs
- AI infrastructure exploration
- cross-repository semantic search
- internal engineering copilots