AI Engineering POC

Private Knowledge Base RAG Platform

A high-level architecture reference POC for Retrieval-Augmented Generation that lets you chat with your private documents.

What it unlocks

Multi-Model accuracy validation with a single API Key (OpenRouter)
Agent Tool Configuration and Contract Enforcement (PydanticAI)
SQL Vector Search Future (DuckDB + VSS Plugin)
AI Engineering workflows (see RAG Quality Gate)

Value Proposition: Full Stack & Data Ownership

Air-Gapped Security & Full Data Privacy

Complete Auditing of Every Layer

Granular Cost Control per Model

No Vendor Lock-in on Any Component

Model Configuration

Chat Model

qwen/qwen3-14b

Via OpenRouter

Embedding Model

text-embedding-3-small

1536 dimensions

Reranker

ministral-3b-2512

LLM-based reranking

How to Test This POC

Key Features to Try

Index a GitHub Repository

Use a README.md with content ChatGPT doesn't know about

Upload PDF Documents

Documents created by hand and not available on the internet

View Source Citations

Every response includes source citations with similarity scores

Run It Yourself

Clone & Configure

Request repository access at gabriel@impacte.tech, then set up your environment

cp .env.example .env && vim .env

Start Services

Launch all services with Docker Compose

docker compose up -d

Index Documents

Use the GitHub, PDF, or Google Docs tabs to add your knowledge base

Start Chatting

Ask questions about your indexed documents in the Chat tab

Architecture Overview

Agent Framework Back-End with PydanticAI

Provides structured AI agent interactions with automatic contract enforcement, tool validation, and type-safe LLM responses - enabling reliable production AI workflows.

Private Vector Database with DuckDB + VSS

Embedded SQL database with native vector similarity search (HNSW) - zero external dependencies, full data privacy, and lightning-fast semantic retrieval without vendor lock-in.

Central Model API with OpenRouter

Single API key for 100+ LLMs across providers - central governance of token expenses per team/developer, model benchmarking, and cost optimization without multiple vendor accounts.

Private Front-End with Next.js App Router

Full ownership of UI/UX for complete auditing, governance, and rapid iteration - custom proxying, streaming responses, and seamless integration with internal auth/security.

AI Engineering

RAG Pipeline

Document Chunking

Split documents into ~1000 token chunks with 200 token overlap for context preservation

Embedding Generation

OpenAI text-embedding-3-small (1536 dimensions) for semantic vector representation

Vector Search

Cosine similarity search using DuckDB VSS extension with HNSW indexing

LLM Reranking

PydanticAI-powered reranker selects top-3 most relevant chunks from top-5 candidates

RAG Quality Gate (MLOps, MLflow, LLM-as-Judge)

85.0%

LLM Judge Accuracy

17/20 questions correct*

4.40/5

Mean Correctness

Qwen3-14b + Reranking

LLM-as-Judge

Correctness: 4.40/5

Relevance: 4.15/5

Completeness: 3.80/5

Evaluator: deepseek-r1

MLflow Metrics

Concept Recall: 83.3%

ROUGE-L: 0.181

Faithfulness: 0.044

Configuration

Chunk Size: 1000

Chunk Overlap: 200

Vector Top-K: 5

Rerank Top-N: 3

Key Findings

85% accuracy on the LLM-as-Judge evaluation
Reranking positive impact: Concept Recall considerable positive impact +17.6%
Main Takeaway: Qwen3-14b beats 30B, 70B and 120B models in accuracy. Smaller models can outperform biggest models.

Model Comparison

Model	Size	Accuracy	Recall	Rerank
qwen/qwen3-14b	14B	85.0%	83.3%	✅
qwen/qwen3-14b	14B	85.0%	70.8%	❌
mistralai/ministral-14b	14B	75.0%	75.0%	❌
meta-llama/llama-3.3-70b	70B	60.0%	78.3%	❌
meta-llama/llama-3.1-8b	8B	55.0%	74.2%	❌
openai/gpt-oss-120b:free	120B	72.0%	79.6%	❌

*Synthetic Document Evaluation: docs generated by google/gemini-2.0-flash-001, ground truth QA pairs by anthropic/claude-3.5-sonnet, scored by deepseek/deepseek-r1

*Metrics shown are illustrative for this POC. Production deployments should define evaluation criteria based on specific use cases — additional metrics like Context Precision, Semantic Similarity, Hallucination Rate, or Cost per Query may be relevant.