Docker, Monitoring, and Scaling
From Development to Production: Build Once, Deploy Everywhere
You've built the pieces. A robust backend with RAD-System patterns, a beautiful Angular frontend, orchestrated workflows, and flexible LLM provider switching. Now comes the infrastructure: how do you reliably deploy, scale, and monitor this system in production?
This article is the bridge from architecture to operations. Docker, Docker Compose, environment parity, health checks, monitoring, and the scaling strategies that let you handle 10 users or 10,000.
The Deployment Philosophy: Artifact-Based Distribution
The RAG System follows a strict principle: build once, deploy everywhere.
┌─────────────────────────────────────┐
│ Development Environment │
│ npm run build │
│ npm test │
│ Docker build │
└──────────────┬──────────────────────┘
│
▼
┌────────────────┐
│ Docker Image │
│ (immutable) │
└────────┬───────┘
│
┌────────┴──────────────────┬─────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Dev │ │ Staging │ │ Prod │
│ env │ │ env │ │ env │
│ (custom) │ │ (mirrored) │ (locked) │
└──────────┘ └──────────┘ └──────────┘
Same artifact everywhere. Only environment variables and secrets differ.
Docker Compose Architecture
The system's Docker Compose orchestrates all services:
# Core services
services:
# Qdrant: Vector database
qdrant:
image: qdrant/qdrant:v1.16
ports: ["6333:6333", "6334:6334"]
volumes: ["./Volumes/qdrant:/qdrant/storage"]
networks: ["rag_network"]
# PostgreSQL: Relational data
postgresdb:
image: postgres:16-alpine
ports: ["5432:5432"]
volumes: ["./Volumes/postgres:/var/lib/postgresql/data"]
networks: ["rag_network"]
# Redis: Caching & sessions
redis:
image: redis:7-alpine
ports: ["6379:6379"]
volumes: ["./Volumes/redis:/data"]
networks: ["rag_network"]
# n8n: Workflow orchestration
n8n:
build: "./Images/n8n"
ports: ["5678:5678"]
volumes: ["./Volumes/n8n:/home/node/.n8n"]
networks: ["rag_network"]
# FastAPI: NLP services
fastapi:
build: "./Images/fastapi"
ports: ["8000:8000"]
environment:
- EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
- MAX_WORKERS=4
networks: ["rag_network"]
# NestJS Backend
backend:
build: "./Images/backend"
ports: ["3000:3000"]
environment:
- DATABASE_URL=postgres://user:pass@postgresdb:5432/rag
- REDIS_URL=redis://:password@redis:6379
- N8N_BASE_URL=http://n8n:5678
networks: ["rag_network"]
depends_on:
- postgresdb
- redis
- n8n
# Angular Frontend (dev server)
frontend:
build: "./Images/frontend"
ports: ["4200:4200"]
networks: ["rag_network"]
networks:
rag_network:
driver: bridge
Key principles:
- No external databases: Everything self-contained
- Bind mounts: Data persists across restarts via filesystem volumes
- Network isolation: Services communicate via Docker network
- Environment variables: Configuration external to images
Build Process: Build Scripts
The DevOps/builder/ directory contains production-grade build scripts:
# Build backend
./DevOps/builder/build-be.sh dist
# Build frontend
./DevOps/builder/build-fe.sh prod
# Build Docker images
docker-compose build
# Or build specific services
docker-compose build backend fastapi
Backend Build Pipeline
#!/bin/bash
# 1. Compile TypeScript
npm run build
# 2. Prune dev dependencies
npm ci --only=production
# 3. Copy to Docker volume
cp -r dist node_modules ./Volumes/backend/app/
# 4. Create versioned tarball
tar czf ./update/backend-v1.0.0.tar.gz -C ./Volumes/backend .
echo "✅ Backend built: backend-v1.0.0.tar.gz"
Why tarballs?
- Versioning: Each release is an immutable artifact
- Rollback: Quick revert to previous version
- Offline deployment: No need for npm registry during deployment
Frontend Build Pipeline
#!/bin/bash
# 1. Install dependencies
npm ci
# 2. Build production bundle (optimized)
ng build --configuration production --output-hashing all
# 3. Copy to Docker volume
cp -r dist/rag-fe ./Volumes/frontend/html/
# 4. Compress
tar czf ./update/frontend-v1.0.0.tar.gz -C ./Volumes/frontend .
echo "✅ Frontend built: frontend-v1.0.0.tar.gz"
Environment Configuration
Each environment (dev, staging, prod) has its own .env file:
.env.dev:
NODE_ENV=development
DEBUG=true
DATABASE_URL=postgres://dev:dev@localhost:5432/rag_dev
REDIS_URL=redis://localhost:6379
JWT_SECRET=dev-secret-no-security
LOG_LEVEL=debug
.env.prod:
NODE_ENV=production
DEBUG=false
DATABASE_URL=postgres://prod_user:${SECURE_DB_PASS}@db-prod.example.com:5432/rag
REDIS_URL=redis://:${REDIS_PASSWORD}@redis-prod:6379
JWT_SECRET=${PROD_JWT_SECRET} # From vault
LOG_LEVEL=warn
RATE_LIMIT=1000 # Requests per minute
Loading order (first found wins):
- Command-line arguments
- Environment variables
.envfile- RagConfigService (database)
- Defaults in code
This allows:
- Local overrides for debugging
- Secrets management via environment
- Database-stored dynamic config
- Safe fallbacks
Volume Management: Data Persistence
Bind mounts map filesystem directories directly into containers:
services:
qdrant:
volumes:
- ./Volumes/qdrant:/qdrant/storage # Host path → Container path
postgres:
volumes:
- ./Volumes/postgres:/var/lib/postgresql/data
redis:
volumes:
- ./Volumes/redis:/data
Why bind mounts? Direct filesystem access simplifies backups, makes volumes portable, and keeps data structured and visible.
Backup Strategy:
#!/bin/bash
# Backup all volumes
BACKUP_DIR="./backups/$(date +%Y%m%d-%H%M%S)"
echo "Backing up volumes to $BACKUP_DIR..."
# Qdrant
docker exec rag_qdrant \
qdrant-backup create backup_$(date +%s)
# PostgreSQL
docker exec rag_postgres pg_dump \
-U postgres rag > "$BACKUP_DIR/postgres.sql"
# Redis
docker exec rag_redis redis-cli BGSAVE
docker cp rag_redis:/data/dump.rdb "$BACKUP_DIR/redis.rdb"
# Tar everything
tar czf "./backups/backup-$(date +%Y%m%d).tar.gz" "$BACKUP_DIR"
echo "✅ Backup complete: $BACKUP_DIR"
Health Checks and Auto-Healing
Every container declares health status:
services:
backend:
image: rag-backend:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
fastapi:
image: rag-fastapi:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 5s
retries: 2
Health endpoints (declared in each service):
@Get('/health')
@Public() // No auth required
async getHealth(): Promise<{ status: string; checks: any }> {
const checks = {
database: await this.checkDatabase(),
redis: await this.checkRedis(),
n8n: await this.checkN8n(),
uptime: process.uptime()
};
const allHealthy = Object.values(checks)
.every(check => check.status === 'ok');
return {
status: allHealthy ? 'healthy' : 'degraded',
checks
};
}
Docker automatically removes and restarts containers with failed health checks.
Networking and Service Discovery
Services communicate by name:
// Backend connecting to other services
const POSTGRES_URL = process.env.DATABASE_URL; // postgres://postgresdb:5432/rag
const REDIS_URL = process.env.REDIS_URL; // redis://redis:6379
const N8N_URL = process.env.N8N_BASE_URL; // http://n8n:5678
const FASTAPI_URL = process.env.FASTAPI_URL; // http://fastapi:8000
Docker's internal DNS resolves these names to container IPs automatically.
Port mapping:
Local Network → Docker Bridge → Container Port
localhost:3000 → 172.17.0.2:3000 → backend:3000
localhost:5432 → 172.17.0.3:5432 → postgres:5432
Scaling Strategies
Horizontal Scaling: Multiple Workers
For compute-heavy FastAPI service:
services:
fastapi:
image: rag-fastapi:latest
ports:
- "8001:8000"
- "8002:8000"
- "8003:8000"
deploy:
replicas: 3
resources:
limits:
cpus: "0.5"
memory: 1G
Load distribute incoming requests:
Client Request
↓
nginx (reverse proxy)
├→ FastAPI Worker 1 (port 8001)
├→ FastAPI Worker 2 (port 8002)
└→ FastAPI Worker 3 (port 8003)
nginx configuration:
upstream fastapi {
least_conn; # Load balancing strategy
server fastapi:8000 max_fails=3 fail_timeout=30s;
server fastapi:8001 max_fails=3 fail_timeout=30s;
server fastapi:8002 max_fails=3 fail_timeout=30s;
}
server {
listen 8000;
location / {
proxy_pass http://fastapi;
proxy_set_header Host $host;
proxy_connect_timeout 30s;
proxy_read_timeout 60s;
}
}
Vertical Scaling: Resource Limits
Docker resource constraints:
services:
backend:
deploy:
resources:
limits:
cpus: "1" # Max 1 CPU core
memory: 512M # Max 512MB RAM
reservations:
cpus: "0.5" # Guaranteed 0.5 core
memory: 256M # Guaranteed 256MB RAM
Monitor resource usage:
# Check container resource consumption
docker stats
# Output:
# CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O
# rag_backend 2.5% 195MiB / 512MiB 38% 3.2MB / 1.5MB
# rag_fastapi 15% 420MiB / 1GiB 42% 8.6MB / 4.2MB
# rag_postgres 8.2% 285MiB / 2GiB 14% 12MB / 8MB
Database Scaling
PostgreSQL replication (for reads):
services:
postgres-primary:
image: postgres:16
environment:
- POSTGRES_REPLICATION_MODE=master
postgres-replica:
image: postgres:16
environment:
- POSTGRES_REPLICATION_MODE=slave
- POSTGRES_MASTER_SERVICE=postgres-primary
Connection pooling (reduce connection overhead):
services:
pgbouncer:
image: pgbouncer
environment:
- DATABASES_HOST=postgres-primary
- POOL_MODE=transaction
- DEFAULT_POOL_SIZE=25
- MAX_CLIENT_CONN=1000
Qdrant Clustering
For production scale, use Qdrant cluster:
services:
qdrant-node1:
image: qdrant/qdrant:latest
environment:
- QDRANT_URI_PATH=/qdrant
- QDRANT_CLUSTER_ENABLED=true
- QDRANT__CLUSTER__CONSENSUS__TICK_PERIOD_MS=100
qdrant-node2:
image: qdrant/qdrant:latest
environment:
- QDRANT_URI_PATH=/qdrant
- QDRANT_CLUSTER_ENABLED=true
- QDRANT_BOOTSTRAP__CLUSTER__INITIAL_PEER_TIMEOUT_MS=30000
Monitoring and Logging
Application Health Dashboard
Real-time metrics expose at /metrics (Prometheus format):
@Get('/metrics')
@Public()
async getMetrics(): Promise<string> {
const metrics = {
'rag_query_duration_seconds': this.queryDurations,
'rag_embedding_errors_total': this.embeddingErrors,
'rag_database_connection_pool_size': this.dbPool.size,
'rag_n8n_workflow_executions_total': this.n8nExecutions,
'rag_llm_provider_availability': this.providerHealth
};
return formatPrometheus(metrics);
}
Prometheus scrapes /metrics every 30 seconds:
global:
scrape_interval: 30s
scrape_configs:
- job_name: 'rag-backend'
static_configs:
- targets: ['localhost:3000']
- job_name: 'rag-fastapi'
static_configs:
- targets: ['localhost:8000']
Logging Aggregation
All services log to stdout (Docker best practice):
logger.info('Query executed', {
userId: user.id,
duration: 1234,
tokens: { input: 50, output: 100 }
});
Docker captures logs:
# View backend logs
docker logs rag_backend --follow --tail 100
# View all service logs
docker-compose logs --follow
Centralized logging (ELK stack):
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.0.0
logstash:
image: docker.elastic.co/logstash/logstash:8.0.0
volumes:
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
kibana:
image: docker.elastic.co/kibana/kibana:8.0.0
ports:
- "5601:5601"
Logstash pipes Docker logs to Elasticsearch:
Docker Container → stdout → Logstash → Elasticsearch ← Kibana Dashboard
Alerting Rules
Alert when services degrade:
# prometheus-alerts.yml
groups:
- name: rag-system
rules:
- alert: HighQueryLatency
expr: histogram_quantile(0.95, rag_query_duration_seconds) > 5
for: 5m
annotations:
summary: "95th percentile query latency > 5s"
- alert: EmbeddingServiceDown
expr: up{job="rag-fastapi"} == 0
for: 1m
annotations:
summary: "FastAPI embedding service is down"
- alert: DatabaseConnectionPoolExhausted
expr: rag_database_connection_pool_usage > 0.9
for: 2m
annotations:
summary: "Database connection pool >90% full"
Deployment Workflow
Development → Staging → Production
# 1. Tag release
git tag v1.2.0
# 2. Build artifacts
./DevOps/builder/build-be.sh dist
./DevOps/builder/build-fe.sh prod
docker-compose build
# 3. Push to registry
docker tag rag-backend:latest myregistry.com/rag-backend:v1.2.0
docker push myregistry.com/rag-backend:v1.2.0
# 4. Deploy to staging
ssh staging-server
cd /apps/rag-system
docker-compose pull
docker-compose up -d
# 5. Run tests
./scripts/smoke-tests.sh staging
# 6. Deploy to production
ssh prod-server
cd /apps/rag-system
docker-compose pull
docker-compose up -d
# 7. Verify health
curl prod.example.com/health
Blue-Green Deployment
Keep two production versions running:
# Current: Blue (v1.2.0)
# New: Green (v1.3.0)
# Deploy to green
docker-compose -f docker-compose-green.yml pull
docker-compose -f docker-compose-green.yml up -d
# Test green endpoints
./scripts/smoke-tests.sh green
# Switch traffic
nginx.conf: upstream producer -> green
systemctl reload nginx
# If issues, switch back
upstream producer -> blue
# Remove old blue
docker-compose -f docker-compose-blue.yml down
Production Deployment Checklist
Security
☐ All secrets in environment/vault (not in .env)
☐ HTTPS/TLS enabled
☐ JWT secrets strong (random, unique)
☐ Database password strong
☐ Redis password set
☐ n8n basic auth enabled
☐ Firewall restricts ports
Infrastructure
☐ Resource limits set (CPU, memory)
☐ Health checks configured
☐ Storage backup automated (daily)
☐ Load balancer in front
☐ DNS resolution working
☐ Monitoring and alerts active
Operations
☐ Logging aggregation working
☐ Error tracking (Sentry) configured
☐ On-call rotation established
☐ Incident response plan documented
☐ Runbook for common issues
☐ Change log maintained
Performance
☐ Database indexes created
☐ Redis cache configured
☐ Query N+1 problems resolved
☐ API rate limiting active
☐ CDN configured (for frontend)
Compliance
☐ Data retention policy enforced
☐ Audit logs enabled
☐ GDPR compliance verified
☐ Regular backups tested
The Complete Picture
From this article:
- Artifact-based distribution: Build once, deploy everywhere
- Docker Compose: Self-contained, reproducible environments
- Health checks: Automatic healing when services fail
- Scaling strategies: Both horizontal and vertical
- Monitoring & logging: Real-time visibility
- Backup & disaster recovery: Data protection
- Blue-green deployment: Zero-downtime updates
Looking Ahead
This concludes the RAG System blog series. You've journeyed through:
- Architecture & Privacy - The "why?" behind design decisions
- n8n Workflows - Orchestrating complex operations
- Vector Embeddings - Semantic search at scale
- FastAPI NLP - Python's NLP powerhouse
- NestJS Backend - Enterprise-grade APIs with RAD-System
- Angular Frontend - Beautiful, responsive chat interface
- Provider Switching - LLM flexibility without lock-in
- DevOps & Scaling - Taking systems to production
---
Key Takeaways:
✅ Artifact-based builds = Reproducible, reliable deployments
✅ Docker Compose = Simplified infrastructure
✅ Health checks = Self-healing systems
✅ Resource limits = Stable, predictable performance
✅ Monitoring = Visibility into system health
✅ Scaling strategies = Room to grow
You now have a production-ready RAG system that's:
- Scalable: Add workers, replicas, or resources as needed
- Resilient: Automatic recovery from failures
- Flexible: Swap components without redeployment
- Observable: Full visibility into operations
- Maintainable: Clear separation of concerns
- Cost-effective: Run locally or cloud, free or expensive models
The system is yours to deploy, modify, and scale. Build with it. Learn from it. And most importantly: ship it.
---
GitHub:
- RAD System (open-source): Github source code
- RAG System (source not published): Github RAG System Overview
---
Thank you for reading this series. Whether you're deploying your first RAG system or optimizing the tenth, I hope this architecture provides a solid foundation. The future of AI isn't just about models—it's about systems that are understandable, maintainable, and yours.
Build well. 🚀