Skip to content

47thtechcorner/RayCodes_PixelRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎨 PixelRAG: The NEW Free AI That Can Read 8+ Million Docs

PixelRAG is a lightweight, zero-heavy-GPU Visual Retrieval-Augmented Generation (RAG) pipeline. While traditional RAG extracts plain text and loses tables, formatting, and structural context, PixelRAG renders documents into visual page tiles, retrieves the most relevant visual tile, and answers queries directly from the visual content.

This project is a minimal, production-ready implementation that queries the official live PixelRAG hosted API (which indexes 8.28M Wikipedia articles as screenshot tiles), downloads the matching high-resolution visual tile, and answers queries using a local Ollama LLM.


🚀 Quick Start (Windows Powershell)

Follow these exact steps to set up, install, run, and test the project.

📋 1. Prerequisites

Ensure you have Python 3.10+ installed and Ollama running. Download and verify the required Ollama models:

ollama pull gemma4:e2b

⚙️ 2. Installation

Navigate to the project directory, then create a virtual environment and install the dependencies:

python -m venv venv
.\venv\Scripts\Activate.ps1
pip install -r requirements.txt

📊 3. Run the Pipeline & Generate Report

Run the script to run the search and output the final execution log directly to ideal_output.md:

python app.py

📁 File Structure & Explanations

  1. 📝 README.md: Project setup instructions, technical specifications, and use cases.
  2. ⚙️ requirements.txt: Pinned list of minimal, fast-installing dependencies.
  3. ⚡ pixelrag_mvp.py: Core Visual RAG engine (37 lines). Queries the live Wikipedia PixelRAG API, downloads the tile, and runs terminal retrieval.
  4. 📊 app.py: Report generator (9 lines). Calls the core engine and generates the matching ideal_output.md report.
  5. 📄 ideal_output.md: Generated report showing the exact retrieval logs and answer.

💡 5 Real-World Use Cases

  1. 📊 Financial Auditing: Retrieving visual cells/tables from complex financial statements where text parsers merge columns and scramble tabular data.
  2. 🗂️ Slide Deck Content QA: Slicing presentation slides into quadrant tiles to query structural infographics, bullet lists, and visual diagrams.
  3. 📐 Engineering Blueprint Queries: Slicing high-resolution schematics into spatial grid tiles and retrieving specific sub-components based on user queries.
  4. 📸 Web Screenshot RAG: Visualizing complex dashboard layouts by rendering page snapshots, preserving visual context (headers, sidebars, charts) during retrieval.
  5. 🔬 Research Paper Navigation: Querying multi-column scientific publications (e.g. arXiv PDFs) where text flow is often interrupted by floating figures or footnotes.

🔮 5 Future Features

  1. 👁️ Native Visual Embedding Integration: Support for native vision-language embeddings like Qwen3-VL-Embedding to replace text OCR-fallback entirely.
  2. 🌐 Dynamic Playwright Web Capturing: Direct web page rendering to visual tiles using headless Playwright with custom viewports.
  3. 🧠 Ollama Multimodal Generation: Direct image-based question answering using Ollama vision models (e.g. minicpm-v4.6 or llava).
  4. ✂️ Overlapping Sliding-Window Tiling: Implementing overlapping grids to avoid cutting words or images at tile boundaries.
  5. 🔍 Hierarchical Visual Retrieval: A multi-stage search that retrieves the whole page first, then zooms in on the most relevant sub-tile.

🏷️ Keywords

PixelRAG Visual RAG Document Scanner Wikipedia Search Ollama Gemma 4 Multimodal AI RAG Pipeline Python RAG Local LLM PDF Tiling Computer Vision Information Retrieval AI Search Engine

About

Visual RAG pipeline searching 8.28M Wikipedia articles as screenshot tiles, matching visual layouts, and querying local Ollama LLM.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages