Impacte Tech
AI Engineering POC

Private Knowledge Base RAG Platform

A high-level architecture reference POC for Retrieval-Augmented Generation that lets you chat with your private documents.

What it unlocks

  • Multi-Model accuracy validation with a single API Key (OpenRouter)
  • Agent Tool Configuration and Contract Enforcement (PydanticAI)
  • SQL Vector Search Future (DuckDB + VSS Plugin)
  • AI Engineering workflows (see RAG Quality Gate)

Value Proposition: Full Stack & Data Ownership

Air-Gapped Security & Full Data Privacy
Complete Auditing of Every Layer
Granular Cost Control per Model
No Vendor Lock-in on Any Component

Model Configuration

Chat Model

qwen/qwen3-14b

Via OpenRouter

Embedding Model

text-embedding-3-small

1536 dimensions

Reranker

ministral-3b-2512

LLM-based reranking

How to Test This POC

Key Features to Try

1

Index a GitHub Repository

Use a README.md with content ChatGPT doesn't know about

2

Upload PDF Documents

Documents created by hand and not available on the internet

3

View Source Citations

Every response includes source citations with similarity scores

Run It Yourself

1

Clone & Configure

Request repository access at gabriel@impacte.tech, then set up your environment

cp .env.example .env && vim .env
2

Start Services

Launch all services with Docker Compose

docker compose up -d
3

Index Documents

Use the GitHub, PDF, or Google Docs tabs to add your knowledge base

4

Start Chatting

Ask questions about your indexed documents in the Chat tab

Architecture Overview

Agent Framework Back-End with PydanticAI

Provides structured AI agent interactions with automatic contract enforcement, tool validation, and type-safe LLM responses - enabling reliable production AI workflows.

Private Vector Database with DuckDB + VSS

Embedded SQL database with native vector similarity search (HNSW) - zero external dependencies, full data privacy, and lightning-fast semantic retrieval without vendor lock-in.

Central Model API with OpenRouter

Single API key for 100+ LLMs across providers - central governance of token expenses per team/developer, model benchmarking, and cost optimization without multiple vendor accounts.

Private Front-End with Next.js App Router

Full ownership of UI/UX for complete auditing, governance, and rapid iteration - custom proxying, streaming responses, and seamless integration with internal auth/security.

AI Engineering

RAG Pipeline

1

Document Chunking

Split documents into ~1000 token chunks with 200 token overlap for context preservation

2

Embedding Generation

OpenAI text-embedding-3-small (1536 dimensions) for semantic vector representation

3

Vector Search

Cosine similarity search using DuckDB VSS extension with HNSW indexing

4

LLM Reranking

PydanticAI-powered reranker selects top-3 most relevant chunks from top-5 candidates

RAG Quality Gate (MLOps, MLflow, LLM-as-Judge)

85.0%

LLM Judge Accuracy

17/20 questions correct*

4.40/5

Mean Correctness

Qwen3-14b + Reranking

LLM-as-Judge

Correctness: 4.40/5
Relevance: 4.15/5
Completeness: 3.80/5
Evaluator: deepseek-r1

MLflow Metrics

Concept Recall: 83.3%
ROUGE-L: 0.181
Faithfulness: 0.044

Configuration

Chunk Size: 1000
Chunk Overlap: 200
Vector Top-K: 5
Rerank Top-N: 3

Key Findings

  • 85% accuracy on the LLM-as-Judge evaluation
  • Reranking positive impact: Concept Recall considerable positive impact +17.6%
  • Main Takeaway: Qwen3-14b beats 30B, 70B and 120B models in accuracy. Smaller models can outperform biggest models.

Model Comparison

ModelSizeAccuracyRecallRerank
qwen/qwen3-14b14B85.0%83.3%
qwen/qwen3-14b14B85.0%70.8%
mistralai/ministral-14b14B75.0%75.0%
meta-llama/llama-3.3-70b70B60.0%78.3%
meta-llama/llama-3.1-8b8B55.0%74.2%
openai/gpt-oss-120b:free120B72.0%79.6%

*Synthetic Document Evaluation: docs generated by google/gemini-2.0-flash-001, ground truth QA pairs by anthropic/claude-3.5-sonnet, scored by deepseek/deepseek-r1

*Metrics shown are illustrative for this POC. Production deployments should define evaluation criteria based on specific use cases — additional metrics like Context Precision, Semantic Similarity, Hallucination Rate, or Cost per Query may be relevant.