Private Knowledge Base RAG Platform
A high-level architecture reference POC for Retrieval-Augmented Generation that lets you chat with your private documents.
What it unlocks
- Multi-Model accuracy validation with a single API Key (OpenRouter)
- Agent Tool Configuration and Contract Enforcement (PydanticAI)
- SQL Vector Search Future (DuckDB + VSS Plugin)
- AI Engineering workflows (see RAG Quality Gate)
Value Proposition: Full Stack & Data Ownership
Model Configuration
Chat Model
qwen/qwen3-14b
Via OpenRouter
Embedding Model
text-embedding-3-small
1536 dimensions
Reranker
ministral-3b-2512
LLM-based reranking
How to Test This POC
Key Features to Try
Index a GitHub Repository
Use a README.md with content ChatGPT doesn't know about
Upload PDF Documents
Documents created by hand and not available on the internet
View Source Citations
Every response includes source citations with similarity scores
Run It Yourself
Clone & Configure
Request repository access at gabriel@impacte.tech, then set up your environment
cp .env.example .env && vim .envStart Services
Launch all services with Docker Compose
docker compose up -dIndex Documents
Use the GitHub, PDF, or Google Docs tabs to add your knowledge base
Start Chatting
Ask questions about your indexed documents in the Chat tab
Architecture Overview
Agent Framework Back-End with PydanticAI
Provides structured AI agent interactions with automatic contract enforcement, tool validation, and type-safe LLM responses - enabling reliable production AI workflows.
Private Vector Database with DuckDB + VSS
Embedded SQL database with native vector similarity search (HNSW) - zero external dependencies, full data privacy, and lightning-fast semantic retrieval without vendor lock-in.
Central Model API with OpenRouter
Single API key for 100+ LLMs across providers - central governance of token expenses per team/developer, model benchmarking, and cost optimization without multiple vendor accounts.
Private Front-End with Next.js App Router
Full ownership of UI/UX for complete auditing, governance, and rapid iteration - custom proxying, streaming responses, and seamless integration with internal auth/security.
AI Engineering
RAG Pipeline
Document Chunking
Split documents into ~1000 token chunks with 200 token overlap for context preservation
Embedding Generation
OpenAI text-embedding-3-small (1536 dimensions) for semantic vector representation
Vector Search
Cosine similarity search using DuckDB VSS extension with HNSW indexing
LLM Reranking
PydanticAI-powered reranker selects top-3 most relevant chunks from top-5 candidates
RAG Quality Gate (MLOps, MLflow, LLM-as-Judge)
LLM Judge Accuracy
17/20 questions correct*
Mean Correctness
Qwen3-14b + Reranking
LLM-as-Judge
MLflow Metrics
Configuration
Key Findings
- 85% accuracy on the LLM-as-Judge evaluation
- Reranking positive impact: Concept Recall considerable positive impact +17.6%
- Main Takeaway: Qwen3-14b beats 30B, 70B and 120B models in accuracy. Smaller models can outperform biggest models.
Model Comparison
| Model | Size | Accuracy | Recall | Rerank |
|---|---|---|---|---|
| qwen/qwen3-14b | 14B | 85.0% | 83.3% | ✅ |
| qwen/qwen3-14b | 14B | 85.0% | 70.8% | ❌ |
| mistralai/ministral-14b | 14B | 75.0% | 75.0% | ❌ |
| meta-llama/llama-3.3-70b | 70B | 60.0% | 78.3% | ❌ |
| meta-llama/llama-3.1-8b | 8B | 55.0% | 74.2% | ❌ |
| openai/gpt-oss-120b:free | 120B | 72.0% | 79.6% | ❌ |
*Synthetic Document Evaluation: docs generated by google/gemini-2.0-flash-001, ground truth QA pairs by anthropic/claude-3.5-sonnet, scored by deepseek/deepseek-r1
*Metrics shown are illustrative for this POC. Production deployments should define evaluation criteria based on specific use cases — additional metrics like Context Precision, Semantic Similarity, Hallucination Rate, or Cost per Query may be relevant.