RAG (Retrieval Augmented Generation) ist eines der mächtigsten Patterns für KI-Anwendungen, die tatsächlich in der Praxis funktionieren. Statt LLMs nur auf ihre Trainingsdaten zu beschränken oder jeden Context in den Prompt zu quetschen, baut man ein System, das relevante Dokumente findet und sie dem LLM zur Verfügung stellt. Das ist nicht neu, aber die Kombination mit modernen LLMs und pgvector in PostgreSQL macht es praktisch und skalierbar.

In diesem Artikel zeigen wir euch, wie ihr ein vollständiges RAG-System mit Django und Python aufbaut — von der Dokumenten-Ingestion über die Vektordatenbank bis zur intelligenten Suche. Alles produktionsreif, alles mit Code-Beispielen, alles ohne externe Vector-as-a-Service Kosten. Das ist das, was wir bei e-laborat für echte Kunden bauen — und wir zeigen euch hier genau wie.

Was ist RAG und warum braucht ihr es?

Bevor wir Code schreiben, braucht ihr das konzeptionelle Verständnis.

Das Problem ohne RAG: Ihr habt ein LLM. Es kann gut antworten. Aber nur auf Basis seiner Trainingsdaten (die 6+ Monate alt sind). Wenn ihr eurem Unternehmen ein LLM geben wollt, das spezifisches Wissen über eure Produkte, Dokumentation, Kundenrichtlinien hat, dann passiert eines von drei Dingen:

Fine-tuning: Ihr trainiert das Modell mit euren Daten. Das kostet Millionen, braucht GPUs, und ist overkill für Knowledge-Integration.
Context im Prompt: Ihr nehmt alle relevanten Dokumente und schmeißt sie in den Prompt. Das funktioniert bis 4k Token, dann wird es teuer und langsam.
RAG: Ihr speichert eure Dokumente intelligent, retrievet nur die relevanten bei jeder Frage, und fügt sie dem Prompt hinzu.

RAG ist die praktische Lösung.

Das RAG-Pattern vereinfacht: Dokument → Chunking → Embeddings → Vector DB ↓ Query → Embedding → Similarity Search → Top-K Chunks ↓ Prompt = System Context + Retrieved Chunks + User Query ↓ LLM API → Response

Warum Django + pgvector? Es gibt natürlich Alternativen wie Pinecone, Weaviate oder Qdrant. Die sind fancy und skalieren gut. Aber für die meisten Startups und Mittelstandsunternehmen ist pgvector das Richtige:

Kosten: PostgreSQL kostet fast nichts. Pinecone wird teuer bei Millionen von Embeddings.
Einfachheit: Ihr habt sowieso eine Django-Anwendung mit PostgreSQL. Das ist nicht ein weiterer Service.
Kontrolle: Die Vektoren sind in eurem System. Keine Vendor Lock-in.
Django Integration: pgvector funktioniert with Django ORM wie ein native Model Field.
Skalierung: pgvector kann Millionen von Vektoren handhaben. IVFFlat indices sind schnell genug.

Es ist nicht eine Frage von pgvector vs. die bessere Lösung. Es ist eine Frage von: was brauchst du wirklich? Für 90% der Anwendungen: pgvector.

Setup: Dependencies und PostgreSQL pgvector

Zuerst die Dependencies installieren:

pip install django pgvector langchain openai anthropic pypdf python-dotenv

Wenn ihr noch nicht PostgreSQL mit pgvector habt, installiert die Extension:

-- In psql, verbunden mit eurer Datenbank: CREATE EXTENSION IF NOT EXISTS vector;

Das war's! pgvector ist jetzt aktiviert.

In `settings.py`:

# settings.py INSTALLED_APPS = [     'django.contrib.admin',     'django.contrib.auth',     'django.contrib.contenttypes',     'django.contrib.sessions',     'django.contrib.messages',     'django.contrib.staticfiles',     'pgvector.django',  # Add this     'rag_app',  # Your app ]


DATABASES = {     'default': {         'ENGINE': 'django.db.backends.postgresql',         'NAME': 'rag_db',         'USER': 'postgres',         'PASSWORD': 'your_password',         'HOST': 'localhost',         'PORT': '5432',     } }

Migration ausführen:

python manage.py migrate

Models: Documents und Chunks

Definiert die Models für eure RAG-Pipeline:

# models.py from django.db import models from pgvector.django import VectorField import uuid


class Document(models.Model):     id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)     title = models.CharField(max_length=500)     content = models.TextField()     file_name = models.CharField(max_length=255, blank=True)     created_at = models.DateTimeField(auto_now_add=True)     updated_at = models.DateTimeField(auto_now=True)          class Meta:         ordering = ['-created_at']          def __str__(self):         return self.title


class DocumentChunk(models.Model):     id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)     document = models.ForeignKey(Document, related_name='chunks')     chunk_index = models.IntegerField()     content = models.TextField()     embedding = VectorField(dimensions=1536)  # OpenAI embedding size     created_at = models.DateTimeField(auto_now_add=True)          class Meta:         ordering = ['document', 'chunk_index']         indexes = [             models.Index(fields=['document', 'chunk_index']),         ]          def __str__(self):         return f"{self.document.title} - Chunk {self.chunk_index}"

Text Chunking: Dokumente in verwertbare Teile zerlegen

Der erste Schritt nach dem Upload ist Chunking. Ihr wollt nicht ein 100-seitiges PDF als einen einzigen Vektor speichern — ihr braucht kleinere, bedeutungsvolle Chunks.

# services/chunking.py from typing import List import re


class TextChunker:     """Split text into chunks with overlap."""          def __init__(self, chunk_size: int = 1000, overlap: int = 200):         self.chunk_size = chunk_size         self.overlap = overlap          def chunk_text(self, text: str) -> List[str]:         """Split text into overlapping chunks."""         # First, clean the text         text = self._clean_text(text)                  # Split by sentences to keep meaningful boundaries         sentences = self._split_sentences(text)                  chunks = []         current_chunk = []         current_length = 0                  for sentence in sentences:             sentence_length = len(sentence.split())                          # If adding this sentence exceeds chunk_size, start a new chunk             if current_length + sentence_length > self.chunk_size and current_chunk:                 # Save chunk                 chunk_text = ' '.join(current_chunk)                 chunks.append(chunk_text)                                  # Keep some overlap                 overlap_words = ' '.join(current_chunk[-5:])                 current_chunk = [overlap_words, sentence]                 current_length = len(overlap_words.split()) + sentence_length             else:                 current_chunk.append(sentence)                 current_length += sentence_length                  # Don't forget the last chunk         if current_chunk:             chunks.append(' '.join(current_chunk))                  return chunks          def _clean_text(self, text: str) -> str:         """Clean text: remove extra whitespace, etc."""         # Remove multiple newlines         text = re.sub(r'\n+', ' ', text)         # Remove multiple spaces         text = re.sub(r' +', ' ', text)         return text.strip()          def _split_sentences(self, text: str) -> List[str]:         """Split text into sentences."""         # Simple sentence splitter         sentence_pattern = r'(?<=[.!?]) +'         sentences = re.split(sentence_pattern, text)         return [s.strip() for s in sentences if s.strip()]


# Für PDFs zusätzlich: from PyPDF2 import PdfReader


def extract_text_from_pdf(pdf_path: str) -> str:     """Extract text from PDF file."""     text = ""     with open(pdf_path, 'rb') as file:         pdf = PdfReader(file)         for page in pdf.pages:             text += page.extract_text()     return text

Embedding-Generierung mit OpenAI und Anthropic

Jetzt konvertieren wir Text zu Vektoren. Ihr könnt OpenAI, Anthropic oder lokale Modelle nutzen. Hier nutzen wir OpenAI (günstiger für Embeddings) und Optional Hugging Face:

# services/embeddings.py from typing import List import numpy as np from openai import OpenAI import os


class EmbeddingService:     """Generate embeddings for text chunks."""          def __init__(self, model: str = "text-embedding-3-small"):         self.model = model         self.client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))          def embed_text(self, text: str) -> List[float]:         """Generate embedding for a single text."""         response = self.client.embeddings.create(             model=self.model,             input=text         )         return response.data[0].embedding          def embed_texts(self, texts: List[str]) -> List[List[float]]:         """Generate embeddings for multiple texts (batch)."""         response = self.client.embeddings.create(             model=self.model,             input=texts         )                  # Sort by index to ensure correct order         embeddings = sorted(response.data, key=lambda x: x.index)         return [e.embedding for e in embeddings]


# Alternative: Lokale Embeddings mit sentence-transformers from sentence_transformers import SentenceTransformer


class LocalEmbeddingService:     """Generate embeddings locally using sentence-transformers."""          def __init__(self, model_name: str = "all-MiniLM-L6-v2"):         self.model = SentenceTransformer(model_name)          def embed_text(self, text: str) -> List[float]:         """Generate embedding for a single text."""         embedding = self.model.encode(text)         return embedding.tolist()          def embed_texts(self, texts: List[str]) -> List[List[float]]:         """Generate embeddings for multiple texts."""         embeddings = self.model.encode(texts)         return embeddings.tolist()

Bei OpenAI's text-embedding-3-small könnt ihr davon ausgehen: - 1536 dimensions - $0.02 pro 1 Million tokens - Die Embeddings sind sehr schnell

Document Ingestion: Upload und Verarbeitung

Jetzt bauen wir einen View, der PDFs hochlädt, chunkt und embeddet:

# views.py from django.shortcuts import render from django.http import JsonResponse from django.views.decorators.http import require_http_methods from django.views.decorators.csrf import csrf_exempt from rest_framework.views import APIView from rest_framework.response import Response from rest_framework import status from rest_framework.permissions import IsAuthenticated import json from pathlib import Path import tempfile


from .models import Document, DocumentChunk from .services.chunking import TextChunker, extract_text_from_pdf from .services.embeddings import EmbeddingService import logging


logger = logging.getLogger(__name__)


class DocumentUploadView(APIView):     """Upload and process a document."""     permission_classes = [IsAuthenticated]          def post(self, request):         try:             # Get file from request             if 'file' not in request.FILES:                 return Response(                     {"error": "No file provided"},                     status=status.HTTP_400_BAD_REQUEST                 )                          file = request.FILES['file']             title = request.data.get('title', file.name)                          # Save to temp location             with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp_file:                 for chunk in file.chunks():                     tmp_file.write(chunk)                 tmp_path = tmp_file.name                          try:                 # Extract text                 if file.name.endswith('.pdf'):                     text = extract_text_from_pdf(tmp_path)                 else:                     # Assume plain text                     text = file.read().decode('utf-8')                                  # Create Document                 document = Document.objects.create(                     title=title,                     content=text,                     file_name=file.name                 )                                  # Chunk text                 chunker = TextChunker(chunk_size=1000, overlap=200)                 chunks = chunker.chunk_text(text)                                  logger.info(f"Processing {len(chunks)} chunks for {document.title}")                                  # Generate embeddings                 embedding_service = EmbeddingService()                 embeddings = embedding_service.embed_texts(chunks)                                  # Save chunks with embeddings                 chunk_objects = []                 for idx, (chunk_text, embedding) in enumerate(zip(chunks, embeddings)):                     chunk_obj = DocumentChunk(                         document=document,                         chunk_index=idx,                         content=chunk_text,                         embedding=embedding                     )                     chunk_objects.append(chunk_obj)                                  # Batch create                 DocumentChunk.objects.bulk_create(chunk_objects, batch_size=100)                                  logger.info(f"Successfully processed document {document.id}")                                  return Response({                     "document_id": str(document.id),                     "title": document.title,                     "chunks_created": len(chunks)                 }, status=status.HTTP_201_CREATED)                          finally:                 # Clean up temp file                 Path(tmp_path).unlink(missing_ok=True)                  except Exception as e:             logger.error(f"Upload error: {str(e)}")             return Response(                 {"error": f"Upload failed: {str(e)}"},                 status=status.HTTP_500_INTERNAL_SERVER_ERROR             )

Retrieval: Vektorsuche mit pgvector

Jetzt kommt der Magic Part: Die intelligente Suche mit pgvector.

# services/retrieval.py from typing import List, Tuple from django.db.models import F, FloatField from pgvector.django import CosineDistance from .models import DocumentChunk from .embeddings import EmbeddingService


class RetrievalService:     """Retrieve relevant chunks for a query."""          def __init__(self):         self.embedding_service = EmbeddingService()          def retrieve(         self,         query: str,         top_k: int = 5,         similarity_threshold: float = 0.5     ) -> List[dict]:         """Retrieve the top-k most relevant chunks for a query."""                  # Embed the query         query_embedding = self.embedding_service.embed_text(query)                  # Find similar chunks using cosine distance         similar_chunks = DocumentChunk.objects.annotate(             distance=CosineDistance('embedding', query_embedding),             # Cosine distance: 0 = identical, 2 = opposite             # So similarity = 1 - distance             similarity=F('distance').resolve_expression(                 WrappedSimilarity(F('distance')), None, None             )         ).filter(             distance__lte=1 - similarity_threshold  # Convert to distance         ).order_by('distance')[:top_k]                  # Format results         results = []         for chunk in similar_chunks:             results.append({                 'document_id': str(chunk.document.id),                 'document_title': chunk.document.title,                 'content': chunk.content,                 'chunk_index': chunk.chunk_index,                 'similarity': 1 - chunk.distance  # Convert back to similarity             })                  return results


# Simpler alternative without the WrappedSimilarity: def retrieve_simple(     query: str,     top_k: int = 5 ) -> List[dict]:     """Simple retrieval without similarity calculation."""     embedding_service = EmbeddingService()     query_embedding = embedding_service.embed_text(query)          chunks = DocumentChunk.objects.annotate(         distance=CosineDistance('embedding', query_embedding)     ).order_by('distance')[:top_k]          results = []     for chunk in chunks:         results.append({             'document_id': str(chunk.document.id),             'document_title': chunk.document.title,             'content': chunk.content,             'distance': float(chunk.distance)  # Raw cosine distance         })          return results

Die Cosine Distance mit pgvector ist das Herz der RAG-Pipeline. Sie findet automatisch die ähnlichsten Vektoren und gibt sie sortiert zurück — alles in der Datenbank, super schnell.

RAG Pipeline: Von der Frage zur Antwort

Jetzt verbinden wir alles: Retrieval + LLM = RAG.

# services/rag.py from typing import Iterator from anthropic import Anthropic from .retrieval import retrieve_simple


class RAGService:     """Retrieval Augmented Generation."""          def __init__(self):         self.client = Anthropic()          def generate_response(         self,         query: str,         top_k: int = 5     ) -> Tuple[str, List[dict]]:         """Generate a response using RAG.                  Returns:             (response_text, retrieved_chunks)         """         # Retrieve relevant chunks         chunks = retrieve_simple(query, top_k=top_k)                  if not chunks:             return "No relevant documents found.", []                  # Build context from chunks         context = "\n\n".join([             f"[{c['document_title']}]\n{c['content']}"             for c in chunks         ])                  # Build prompt         system_prompt = f"""You are a helpful assistant.          You have access to the following documents:


{context}


Use this information to answer the user's question accurately.         If the information is not in the documents, say so.         """                  # Call LLM         response = self.client.messages.create(             model="claude-3-5-sonnet-20241022",             max_tokens=1024,             system=system_prompt,             messages=[{                 "role": "user",                 "content": query             }]         )                  return response.content[0].text, chunks          def stream_response(         self,         query: str,         top_k: int = 5     ) -> Iterator[str]:         """Stream a RAG response."""         # Retrieve relevant chunks         chunks = retrieve_simple(query, top_k=top_k)                  if not chunks:             yield "No relevant documents found."             return                  # Build context         context = "\n\n".join([             f"[{c['document_title']}]\n{c['content']}"             for c in chunks         ])                  system_prompt = f"""You are a helpful assistant.          You have access to the following documents:


{context}


Use this information to answer accurately.         """                  # Stream response         with self.client.messages.stream(             model="claude-3-5-sonnet-20241022",             max_tokens=1024,             system=system_prompt,             messages=[{                 "role": "user",                 "content": query             }]         ) as stream:             for text in stream.text_stream:                 yield text

Und der API Endpoint:

# views.py - continued from .services.rag import RAGService


class RAGQueryView(APIView):     """Query RAG system."""     permission_classes = [IsAuthenticated]          def post(self, request):         query = request.data.get('query')         if not query:             return Response(                 {"error": "Query is required"},                 status=status.HTTP_400_BAD_REQUEST             )                  rag_service = RAGService()         response_text, chunks = rag_service.generate_response(query, top_k=5)                  return Response({             "response": response_text,             "sources": chunks         })


class RAGQueryStreamView(APIView):     """Stream RAG query."""     permission_classes = [IsAuthenticated]          def post(self, request):         query = request.data.get('query')         if not query:             return Response(                 {"error": "Query is required"},                 status=status.HTTP_400_BAD_REQUEST             )                  def stream():             rag_service = RAGService()             for chunk in rag_service.stream_response(query, top_k=5):                 yield f"data: {json.dumps({'chunk': chunk})}\n\n"                  return StreamingHttpResponse(             stream(),             content_type='text/event-stream'         )

Production Best Practices

Ein paar wichtige Dinge, bevor das System in Production geht:

1. Batch Processing für große Dokumente: Wenn ihr viele große Dokumente habt, macht das Embedding in Celery Tasks, nicht im HTTP Request.

# tasks.py from celery import shared_task from .models import Document, DocumentChunk from .services.embeddings import EmbeddingService from .services.chunking import TextChunker


@shared_task def process_document(document_id: str):     """Process document chunks and generate embeddings (Celery task)."""     document = Document.objects.get(id=document_id)          chunker = TextChunker()     chunks = chunker.chunk_text(document.content)          embedding_service = EmbeddingService()     embeddings = embedding_service.embed_texts(chunks)          chunk_objects = [         DocumentChunk(             document=document,             chunk_index=idx,             content=chunk_text,             embedding=embedding         )         for idx, (chunk_text, embedding) in enumerate(zip(chunks, embeddings))     ]          DocumentChunk.objects.bulk_create(chunk_objects, batch_size=100)

2. Indexing für Geschwindigkeit: pgvector kann Index für große Vektortabellen erzeugen:

CREATE INDEX ON document_chunk  USING ivfflat (embedding vector_cosine_ops)  WITH (lists = 100);

3. Monitoring: Loggt Retrieval Quality und Latency

import time from django.utils import timezone


class RetrievalLog(models.Model):     query = models.TextField()     num_results = models.IntegerField()     latency_ms = models.FloatField()     timestamp = models.DateTimeField(auto_now_add=True)

KI-Beratung für Ihr Unternehmen

e-laborat hilft Mittelständlern bei der KI-Einführung — pragmatisch, praxisnah, mit Berliner Startup-Mentalität.

Erstgespräch vereinbaren →

Fazit

Ihr habt jetzt ein komplettes RAG-System, das: - Dokumente uploadt und verarbeitet - Text intelligent chunked - Embeddings mit OpenAI generiert - Mit pgvector ähnliche Chunks findet - Ein LLM mit Context füttert

Das ist nicht einfach eine Cool Prototype — das ist ein produktives System, das viele Unternehmen nutzen. Die Kombination aus Django, PostgreSQL und pgvector ist stabil, günstig und einfach zu debuggen.

Nächste Schritte: Fine-tune euer Chunking, experimentiert mit verschiedenen Embedding-Modellen, oder baaut ein React-Frontend, das die Streaming-RAG nutzt. Oder kontaktiert e-laborat für einen KI-Readiness-Check und ein Code Review eures Systems.