Freedom of Choice
Configuration-Driven Architecture for Multi-LLM Support
You've spent the last three articles learning about a solid backend, fast embeddings, and a beautiful frontend. Now comes the moment that makes this system truly remarkable: you can swap LLM providers without touching code.
Want to use OpenAI's GPT-4 for one context and run Ollama locally for another? Configuration. Need to test Anthropic Claude before committing? Update a JSON file. Running on a restricted network where cloud APIs aren't allowed? n8n has you covered.
This article is about the architecture that makes provider agnosticism not just possible, but elegant.
The Provider Agnostic Philosophy
Most RAG systems hardcode LLM calls:
# Bad: Locked to OpenAI
from openai import OpenAI
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": question}]
)
This approach has consequences:
- Changing providers requires code changes
- Different provider APIs mean different code paths
- Testing alternatives is friction-filled
- Lock-in to provider-specific features
The RAG System's Approach:
Configuration Layer (decide: which provider, which model, where to send requests)
↓
Provider Interface Layer (normalize requests and responses)
↓
n8n Orchestration Layer (route to appropriate service)
↓
Provider-Specific Services (FastAPI, LM Studio, etc.)
The key: configuration drives behavior. No code changes after deployment.
Supported Providers
The system supports:
Cloud APIs
- OpenAI: GPT-4, GPT-3.5-turbo, GPT-4 Turbo
- Google Gemini: Gemini 2.0 Flash, Pro, Vision
- Anthropic Claude: Claude 3.5 Sonnet, Opus, Haiku
- Groq: Models optimized for inference speed
Local Engines
- Ollama: Easy local model management (llama2, mistral, etc.)
- LM Studio: GUI + API for local models
- vLLM: High-throughput LLM server
- llama.cpp: Optimized C++ inference
Custom HTTP Endpoints
- Any service following a standard interface
Configuration Structure
Provider configurations live in the database under the rag_config table, keyed as PROVIDER_CONFIG:
{
"version": "1.0.0",
"providers": {
"openai": {
"enabled": true,
"type": "http_api",
"endpoint": "https://api.openai.com/v1",
"apiKey": "${OPENAI_API_KEY}",
"models": [
{
"id": "gpt-4",
"name": "GPT-4",
"type": "chat",
"costPer1kTokens": 0.03,
"contextWindow": 8192
}
],
"defaultModel": "gpt-4",
"timeout": 120,
"retryPolicy": {
"maxRetries": 3,
"backoffMultiplier": 2
}
},
"ollama": {
"enabled": true,
"type": "http_api",
"endpoint": "http://localhost:11434",
"models": [
{
"id": "llama2:13b",
"name": "Llama 2 (13B)",
"type": "chat",
"costPer1kTokens": 0,
"contextWindow": 4096
}
],
"defaultModel": "llama2:13b",
"timeout": 300,
"local": true
},
"claude": {
"enabled": false,
"type": "http_api",
"endpoint": "https://api.anthropic.com",
"apiKey": "${ANTHROPIC_API_KEY}",
"models": [
{
"id": "claude-3-5-sonnet",
"name": "Claude 3.5 Sonnet",
"type": "chat",
"costPer1kTokens": 0.015,
"contextWindow": 200000
}
],
"defaultModel": "claude-3-5-sonnet"
}
},
"routing": {
"default": "openai",
"contextSpecific": {
"hr_policies": "openai",
"internal_wiki": "ollama",
"sensitive_data": "ollama"
},
"fallback": "openai"
}
}
Key Points:
enabledflag: Disable providers without removing configtype: Eitherhttp_apior SSH-based (for remote inference)endpoint: URL to the provider serviceapiKey: Referenced as env var (${OPENAI_API_KEY}) for securitymodels: Available models with capabilities and costsdefaultModel: Assumed if not specified in requestlocalflag: Marks providers that don't require internetrouting: Which provider to use per context + fallback strategy
How It Works: The Routing Layer
When a user asks a question:
User Query
↓
NestJS Backend /chat/query endpoint
↓
Extract: question, contextId, optional modelId
↓
RagProxyService.proxyQuery(dto)
↓
Check routing rules:
- Is modelId specified? Use that provider
- Is context-specific routing defined? Use that
- Fall back to "default" provider
↓
Build normalized request for chosen provider
↓
n8n Workflow: Embed query, Search Qdrant, Retrieve context
↓
Send context + question to chosen LLM
↓
LLM response is normalized to standard format
↓
Return to frontend
The Routing Service
@Injectable()
export class LlmRoutingService {
constructor(
private providerConfigService: ProviderConfigService,
private httpService: HttpService
) {}
/**
* Determine which provider to use for this request
*/
resolveProvider(
requestedModelId?: string,
contextId?: number
): IProviderConfig {
const config = this.providerConfigService.getConfig();
// 1. Explicit model requested
if (requestedModelId) {
for (const [providerName, providerConfig] of Object.entries(config.providers)) {
if (providerConfig.enabled) {
const model = providerConfig.models.find(m => m.id === requestedModelId);
if (model) {
return providerConfig;
}
}
}
}
// 2. Context-specific routing
if (contextId && config.routing.contextSpecific[contextId]) {
const providerName = config.routing.contextSpecific[contextId];
const provider = config.providers[providerName];
if (provider?.enabled) {
return provider;
}
}
// 3. Default provider
const defaultProviderName = config.routing.default;
const defaultProvider = config.providers[defaultProviderName];
if (defaultProvider?.enabled) {
return defaultProvider;
}
// 4. Fallback: First enabled provider
for (const [providerName, providerConfig] of Object.entries(config.providers)) {
if (providerConfig.enabled) {
return providerConfig;
}
}
throw new Error('No LLM provider available');
}
/**
* Get request/response format for a specific provider
*/
getNormalizer(provider: IProviderConfig): LlmNormalizer {
switch (provider.name) {
case 'openai':
return new OpenAiNormalizer();
case 'claude':
return new ClaudeNormalizer();
case 'gemini':
return new GeminiNormalizer();
case 'ollama':
return new OllamaNormalizer();
default:
return new GenericHttpNormalizer();
}
}
}
Provider Normalizers: Abstracting Away Differences
Different providers have different request/response formats. A normalizer converts between them.
OpenAI Format (de facto standard)
{
"model": "gpt-4",
"messages": [
{"role": "system", "content": "You are helpful..."},
{"role": "user", "content": "What is RAG?"}
],
"temperature": 0.7,
"max_tokens": 500
}
Response:
{
"choices": [{
"message": {"role": "assistant", "content": "RAG is..."},
"finish_reason": "stop"
}],
"usage": {"prompt_tokens": 50, "completion_tokens": 100}
}
Claude Format (Different structure)
{
"model": "claude-3-5-sonnet-20241022",
"max_tokens": 500,
"system": "You are helpful...",
"messages": [
{"role": "user", "content": "What is RAG?"}
]
}
Response:
{
"content": [{"type": "text", "text": "RAG is..."}],
"usage": {"input_tokens": 50, "output_tokens": 100}
}
Normalizer Pattern
interface LlmNormalizer {
normalize(query: string, context: string): any;
denormalize(response: any): { answer: string; usage: Usage };
}
class OpenAiNormalizer implements LlmNormalizer {
normalize(query: string, context: string): any {
return {
model: "gpt-4",
messages: [
{ role: "system", content: `Context:\n${context}` },
{ role: "user", content: query }
],
temperature: 0.7,
max_tokens: 1000
};
}
denormalize(response: any): { answer: string; usage: Usage } {
return {
answer: response.choices[0].message.content,
usage: {
inputTokens: response.usage.prompt_tokens,
outputTokens: response.usage.completion_tokens
}
};
}
}
class ClaudeNormalizer implements LlmNormalizer {
normalize(query: string, context: string): any {
return {
model: "claude-3-5-sonnet-20241022",
max_tokens: 1000,
system: `Context:\n${context}`,
messages: [
{ role: "user", content: query }
]
};
}
denormalize(response: any): { answer: string; usage: Usage } {
return {
answer: response.content[0].text,
usage: {
inputTokens: response.usage.input_tokens,
outputTokens: response.usage.output_tokens
}
};
}
}
class OllamaNormalizer implements LlmNormalizer {
normalize(query: string, context: string): any {
return {
model: "llama2:13b",
prompt: `${context}\n\nQuestion: ${query}\n\nAnswer:`,
stream: false
};
}
denormalize(response: any): { answer: string; usage: Usage } {
return {
answer: response.response,
usage: {
inputTokens: response.prompt_eval_count || 0,
outputTokens: response.eval_count || 0
}
};
}
}
SSH-Based Remote Inference
For organizations with isolated networks (no internet), the rag-ssh module enables remote inference on a secure machine.
Scenario:
- Main RAG system is in a restricted network
- A remote machine (with internet) runs LLM services
- SSH tunnel connects them securely
# On the remote machine
ssh -R 10.0.0.1:8000:localhost:8000 main-rag-server@10.0.0.1
# Now http://localhost:8000 on Main RAG = http://localhost:8000 on Remote
In configuration:
{
"providers": {
"openai_remote": {
"enabled": true,
"type": "ssh",
"sshHost": "10.0.0.1",
"sshPort": 22,
"sshUser": "rag_service",
"sshKeyPath": "/secure/keys/rag_ssh_key",
"localPort": 8000,
"remoteEndpoint": "https://api.openai.com/v1"
}
}
}
When a query uses this provider:
- SSH connection authenticates
- Request is tunneled to remote machine
- Remote machine calls OpenAI API
- Response comes back through tunnel
- No sensitive data on local network
Hybrid Mode: Different Providers for Different Contexts
The RAG system shines with hybrid setups:
{
"routing": {
"contextSpecific": {
"public_docs": "openai", // Fast, high quality
"internal_policies": "ollama", // Local, no data leaving
"research_papers": "claude", // Better for long context
"customer_data": "ollama_isolated" // Airgapped, highest security
}
}
}
Benefits:
- Cost optimization: Use expensive API for complex queries, cheap local for simple ones
- Data privacy: Sensitive data stays on local Ollama
- Performance: Route complex analysis to better models
- Compliance: Different contexts follow different data handling rules
Model Evaluation and Cost Analysis
Configuration includes model capabilities and costs:
const models = config.providers.openai.models;
// Find most cost-effective model for this query
const cheapest = models.reduce((a, b) =>
(a.costPer1kTokens < b.costPer1kTokens) ? a : b
);
// Find best quality model
const bestQuality = models.reduce((a, b) =>
(a.qualityScore > b.qualityScore) ? a : b
);
// Adaptive routing based on query length
const estimatedTokens = question.length / 4; // Rough estimate
if (estimatedTokens > 10000) {
// Use model with larger context window
return models.find(m => m.contextWindow > 20000);
}
Admins can update /admin/models to add cost data gleaned from recent API calls:
{
"models": [
{
"id": "gpt-4",
"actualCostPer1kTokens": 0.031, // Updated from last month's billing
"avgLatency": 1500, // milliseconds
"errorRate": 0.002, // 0.2%
"quality": 9.5 // 1-10 scale
}
]
}
Cost Reporting
@Injectable()
export class CostTrackingService {
async trackQuery(
provider: string,
model: string,
inputTokens: number,
outputTokens: number
): Promise<void> {
const config = this.providerConfigService.getConfig();
const modelConfig = config.providers[provider]
.models.find(m => m.id === model);
if (!modelConfig) return;
const costPerM = modelConfig.costPer1mTokens || 0;
const totalTokens = inputTokens + outputTokens;
const cost = (totalTokens / 1_000_000) * costPerM;
// Log for billing and analytics
await this.costRepository.save({
provider,
model,
inputTokens,
outputTokens,
cost,
timestamp: new Date()
});
}
}
Generate usage reports:
async getCostReport(startDate: Date, endDate: Date): Promise<CostReport> {
const entries = await this.costRepository.find({
where: { timestamp: Between(startDate, endDate) }
});
const byProvider = groupBy(entries, 'provider');
const byModel = groupBy(entries, 'model');
return {
totalCost: entries.reduce((sum, e) => sum + e.cost, 0),
byProvider: Object.entries(byProvider).map(([provider, costs]) => ({
provider,
totalCost: costs.reduce((sum, c) => c.cost, 0),
count: costs.length
})),
byModel: Object.entries(byModel).map(([model, costs]) => ({
model,
totalCost: costs.reduce((sum, c) => c.cost, 0),
count: costs.length
}))
};
}
Fallback Strategies
What if the primary provider is down?
async proxyQuery(dto: RagQueryDto): Promise<any> {
const config = this.providerConfigService.getConfig();
// Build provider candidates in order of preference
const candidates = [
config.routing.contextSpecific[dto.contextId],
config.routing.default,
...Object.keys(config.providers) // All others as fallback
].filter(name => config.providers[name]?.enabled);
for (const providerName of candidates) {
try {
const provider = config.providers[providerName];
const normalizer = this.getNormalizer(provider);
const normalized = normalizer.normalize(dto.question, dto.context);
const response = await this.callProvider(provider, normalized);
return normalizer.denormalize(response);
} catch (error) {
this.logger.warn(
`Provider ${providerName} failed: ${error.message}`
);
// Try next provider
continue;
}
}
throw new Error('All LLM providers failed');
}
Configuration:
{
"routing": {
"default": "openai",
"fallback": "claude", // If openai fails
"tertiaryFallback": "ollama" // If both fail
}
}
Monitoring and Alerting
Track provider health:
@Injectable()
export class ProviderHealthService {
async checkProviderHealth(): Promise<ProviderStatus[]> {
const config = this.providerConfigService.getConfig();
const statuses: ProviderStatus[] = [];
for (const [name, provider] of Object.entries(config.providers)) {
let status: 'healthy' | 'degraded' | 'down' = 'healthy';
let latency = 0;
let errorRate = 0;
try {
const startTime = Date.now();
await this.pingProvider(provider);
latency = Date.now() - startTime;
// Get error rate from last 100 queries
const recentErrors = await this.getRecentErrorCount(name);
errorRate = recentErrors / 100;
if (latency > 5000 || errorRate > 0.1) {
status = 'degraded';
}
} catch (error) {
status = 'down';
}
statuses.push({
provider: name,
status,
latency,
errorRate,
timestamp: new Date()
});
}
return statuses;
}
async generateHealthReport(): Promise<void> {
const statuses = await this.checkProviderHealth();
for (const status of statuses) {
if (status.status === 'down') {
await this.alertService.sendAlert(
`Provider ${status.provider} is DOWN`,
`severity: critical`
);
} else if (status.status === 'degraded') {
await this.alertService.sendAlert(
`Provider ${status.provider} is degraded`,
`latency: ${status.latency}ms, error_rate: ${status.errorRate}`
);
}
}
}
}
Prompt Engineering Per Provider
Different models benefit from different prompts:
interface PromptTemplate {
system: string;
userPrefix: string;
citations: string;
}
const promptTemplates = {
'openai': {
system: 'You are a helpful assistant. Answer based on provided context.',
userPrefix: 'Based on the following context:\n',
citations: '\nCite specific sources [1], [2], etc.'
},
'claude': {
system: 'You are Claude, an AI assistant made by Anthropic.',
userPrefix: 'I have the following context:\n',
citations: '\nPlease include citations to source materials.'
},
'ollama': {
system: 'You are a helpful AI assistant.',
userPrefix: 'Context:\n',
citations: '\nIncludecitations.'
}
};
function buildPrompt(provider: string, question: string, context: string): string {
const template = promptTemplates[provider];
return `${template.system}\n\n${template.userPrefix}${context}\n\nQuestion: ${question}${template.citations}`;
}
Looking Ahead
The LLM provider switching layer makes your system resilient, cost-effective, and future-proof. You're not betting on OpenAI. You're not locked into local inference for everything.
In the final article, we'll bring everything together: DevOps, Deployment, and Scaling. Docker, Kubernetes, monitoring, and taking this system from laptop to production.
---
Key Takeaways:
✅ Configuration-driven: Swap providers without code changes
✅ Normalized interfaces: Abstracts provider differences
✅ Hybrid routing: Different contexts use different providers
✅ Cost tracking: Monitor spending across providers
✅ Fallback strategies: System stays up when a provider fails
✅ Remote inference: SSH tunnels for isolated networks
✅ Health monitoring: Alerts when providers degrade
No more "We can only use OpenAI." Now it's "We can use whoever makes sense."
---
GitHub:
- RAD System (open-source): Github source code
- RAG System (source not published): Github RAG System Overview