PixelRAG is a lightweight, zero-heavy-GPU Visual Retrieval-Augmented Generation (RAG) pipeline. While traditional RAG extracts plain text and loses tables, formatting, and structural context, PixelRAG renders documents into visual page tiles, retrieves the most relevant visual tile, and answers queries directly from the visual content.
This project is a minimal, production-ready implementation that queries the official live PixelRAG hosted API (which indexes 8.28M Wikipedia articles as screenshot tiles), downloads the matching high-resolution visual tile, and answers queries using a local Ollama LLM.
Follow these exact steps to set up, install, run, and test the project.
Ensure you have Python 3.10+ installed and Ollama running. Download and verify the required Ollama models:
ollama pull gemma4:e2bNavigate to the project directory, then create a virtual environment and install the dependencies:
python -m venv venv
.\venv\Scripts\Activate.ps1
pip install -r requirements.txtRun the script to run the search and output the final execution log directly to ideal_output.md:
python app.py📝 README.md: Project setup instructions, technical specifications, and use cases.⚙️ requirements.txt: Pinned list of minimal, fast-installing dependencies.⚡ pixelrag_mvp.py: Core Visual RAG engine (37 lines). Queries the live Wikipedia PixelRAG API, downloads the tile, and runs terminal retrieval.📊 app.py: Report generator (9 lines). Calls the core engine and generates the matchingideal_output.mdreport.📄 ideal_output.md: Generated report showing the exact retrieval logs and answer.
- 📊 Financial Auditing: Retrieving visual cells/tables from complex financial statements where text parsers merge columns and scramble tabular data.
- 🗂️ Slide Deck Content QA: Slicing presentation slides into quadrant tiles to query structural infographics, bullet lists, and visual diagrams.
- 📐 Engineering Blueprint Queries: Slicing high-resolution schematics into spatial grid tiles and retrieving specific sub-components based on user queries.
- 📸 Web Screenshot RAG: Visualizing complex dashboard layouts by rendering page snapshots, preserving visual context (headers, sidebars, charts) during retrieval.
- 🔬 Research Paper Navigation: Querying multi-column scientific publications (e.g. arXiv PDFs) where text flow is often interrupted by floating figures or footnotes.
- 👁️ Native Visual Embedding Integration: Support for native vision-language embeddings like
Qwen3-VL-Embeddingto replace text OCR-fallback entirely. - 🌐 Dynamic Playwright Web Capturing: Direct web page rendering to visual tiles using headless Playwright with custom viewports.
- 🧠 Ollama Multimodal Generation: Direct image-based question answering using Ollama vision models (e.g.
minicpm-v4.6orllava). - ✂️ Overlapping Sliding-Window Tiling: Implementing overlapping grids to avoid cutting words or images at tile boundaries.
- 🔍 Hierarchical Visual Retrieval: A multi-stage search that retrieves the whole page first, then zooms in on the most relevant sub-tile.
PixelRAG Visual RAG Document Scanner Wikipedia Search Ollama Gemma 4 Multimodal AI RAG Pipeline Python RAG Local LLM PDF Tiling Computer Vision Information Retrieval AI Search Engine