Freedom of Choice

Configuration-Driven Architecture for Multi-LLM Support

You've spent the last three articles learning about a solid backend, fast embeddings, and a beautiful frontend. Now comes the moment that makes this system truly remarkable: you can swap LLM providers without touching code.

Want to use OpenAI's GPT-4 for one context and run Ollama locally for another? Configuration. Need to test Anthropic Claude before committing? Update a JSON file. Running on a restricted network where cloud APIs aren't allowed? n8n has you covered.

This article is about the architecture that makes provider agnosticism not just possible, but elegant.

The Provider Agnostic Philosophy

Most RAG systems hardcode LLM calls:


# Bad: Locked to OpenAI
from openai import OpenAI
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": question}]
)

This approach has consequences:

  • Changing providers requires code changes
  • Different provider APIs mean different code paths
  • Testing alternatives is friction-filled
  • Lock-in to provider-specific features

The RAG System's Approach:


Configuration Layer (decide: which provider, which model, where to send requests)
         ↓
Provider Interface Layer (normalize requests and responses)
         ↓
n8n Orchestration Layer (route to appropriate service)
         ↓
Provider-Specific Services (FastAPI, LM Studio, etc.)

The key: configuration drives behavior. No code changes after deployment.

Supported Providers

The system supports:

Cloud APIs

  • OpenAI: GPT-4, GPT-3.5-turbo, GPT-4 Turbo
  • Google Gemini: Gemini 2.0 Flash, Pro, Vision
  • Anthropic Claude: Claude 3.5 Sonnet, Opus, Haiku
  • Groq: Models optimized for inference speed

Local Engines

  • Ollama: Easy local model management (llama2, mistral, etc.)
  • LM Studio: GUI + API for local models
  • vLLM: High-throughput LLM server
  • llama.cpp: Optimized C++ inference

Custom HTTP Endpoints

  • Any service following a standard interface

Configuration Structure

Provider configurations live in the database under the rag_config table, keyed as PROVIDER_CONFIG:


{
  "version": "1.0.0",
  "providers": {
    "openai": {
      "enabled": true,
      "type": "http_api",
      "endpoint": "https://api.openai.com/v1",
      "apiKey": "${OPENAI_API_KEY}",
      "models": [
        {
          "id": "gpt-4",
          "name": "GPT-4",
          "type": "chat",
          "costPer1kTokens": 0.03,
          "contextWindow": 8192
        }
      ],
      "defaultModel": "gpt-4",
      "timeout": 120,
      "retryPolicy": {
        "maxRetries": 3,
        "backoffMultiplier": 2
      }
    },
    "ollama": {
      "enabled": true,
      "type": "http_api",
      "endpoint": "http://localhost:11434",
      "models": [
        {
          "id": "llama2:13b",
          "name": "Llama 2 (13B)",
          "type": "chat",
          "costPer1kTokens": 0,
          "contextWindow": 4096
        }
      ],
      "defaultModel": "llama2:13b",
      "timeout": 300,
      "local": true
    },
    "claude": {
      "enabled": false,
      "type": "http_api",
      "endpoint": "https://api.anthropic.com",
      "apiKey": "${ANTHROPIC_API_KEY}",
      "models": [
        {
          "id": "claude-3-5-sonnet",
          "name": "Claude 3.5 Sonnet",
          "type": "chat",
          "costPer1kTokens": 0.015,
          "contextWindow": 200000
        }
      ],
      "defaultModel": "claude-3-5-sonnet"
    }
  },
  "routing": {
    "default": "openai",
    "contextSpecific": {
      "hr_policies": "openai",
      "internal_wiki": "ollama",
      "sensitive_data": "ollama"
    },
    "fallback": "openai"
  }
}

Key Points:

  1. enabled flag: Disable providers without removing config
  2. type: Either http_api or SSH-based (for remote inference)
  3. endpoint: URL to the provider service
  4. apiKey: Referenced as env var (${OPENAI_API_KEY}) for security
  5. models: Available models with capabilities and costs
  6. defaultModel: Assumed if not specified in request
  7. local flag: Marks providers that don't require internet
  8. routing: Which provider to use per context + fallback strategy

How It Works: The Routing Layer

When a user asks a question:


User Query
    ↓
NestJS Backend /chat/query endpoint
    ↓
Extract: question, contextId, optional modelId
    ↓
RagProxyService.proxyQuery(dto)
    ↓
Check routing rules:
  - Is modelId specified? Use that provider
  - Is context-specific routing defined? Use that
  - Fall back to "default" provider
    ↓
Build normalized request for chosen provider
    ↓
n8n Workflow: Embed query, Search Qdrant, Retrieve context
    ↓
Send context + question to chosen LLM
    ↓
LLM response is normalized to standard format
    ↓
Return to frontend

The Routing Service


@Injectable()
export class LlmRoutingService {
  constructor(
    private providerConfigService: ProviderConfigService,
    private httpService: HttpService
  ) {}

  /**
   * Determine which provider to use for this request
   */
  resolveProvider(
    requestedModelId?: string,
    contextId?: number
  ): IProviderConfig {
    const config = this.providerConfigService.getConfig();

    // 1. Explicit model requested
    if (requestedModelId) {
      for (const [providerName, providerConfig] of Object.entries(config.providers)) {
        if (providerConfig.enabled) {
          const model = providerConfig.models.find(m => m.id === requestedModelId);
          if (model) {
            return providerConfig;
          }
        }
      }
    }

    // 2. Context-specific routing
    if (contextId && config.routing.contextSpecific[contextId]) {
      const providerName = config.routing.contextSpecific[contextId];
      const provider = config.providers[providerName];
      if (provider?.enabled) {
        return provider;
      }
    }

    // 3. Default provider
    const defaultProviderName = config.routing.default;
    const defaultProvider = config.providers[defaultProviderName];
    if (defaultProvider?.enabled) {
      return defaultProvider;
    }

    // 4. Fallback: First enabled provider
    for (const [providerName, providerConfig] of Object.entries(config.providers)) {
      if (providerConfig.enabled) {
        return providerConfig;
      }
    }

    throw new Error('No LLM provider available');
  }

  /**
   * Get request/response format for a specific provider
   */
  getNormalizer(provider: IProviderConfig): LlmNormalizer {
    switch (provider.name) {
      case 'openai':
        return new OpenAiNormalizer();
      case 'claude':
        return new ClaudeNormalizer();
      case 'gemini':
        return new GeminiNormalizer();
      case 'ollama':
        return new OllamaNormalizer();
      default:
        return new GenericHttpNormalizer();
    }
  }
}

Provider Normalizers: Abstracting Away Differences

Different providers have different request/response formats. A normalizer converts between them.

OpenAI Format (de facto standard)


{
  "model": "gpt-4",
  "messages": [
    {"role": "system", "content": "You are helpful..."},
    {"role": "user", "content": "What is RAG?"}
  ],
  "temperature": 0.7,
  "max_tokens": 500
}

Response:


{
  "choices": [{
    "message": {"role": "assistant", "content": "RAG is..."},
    "finish_reason": "stop"
  }],
  "usage": {"prompt_tokens": 50, "completion_tokens": 100}
}

Claude Format (Different structure)


{
  "model": "claude-3-5-sonnet-20241022",
  "max_tokens": 500,
  "system": "You are helpful...",
  "messages": [
    {"role": "user", "content": "What is RAG?"}
  ]
}

Response:


{
  "content": [{"type": "text", "text": "RAG is..."}],
  "usage": {"input_tokens": 50, "output_tokens": 100}
}

Normalizer Pattern


interface LlmNormalizer {
  normalize(query: string, context: string): any;
  denormalize(response: any): { answer: string; usage: Usage };
}

class OpenAiNormalizer implements LlmNormalizer {
  normalize(query: string, context: string): any {
    return {
      model: "gpt-4",
      messages: [
        { role: "system", content: `Context:\n${context}` },
        { role: "user", content: query }
      ],
      temperature: 0.7,
      max_tokens: 1000
    };
  }

  denormalize(response: any): { answer: string; usage: Usage } {
    return {
      answer: response.choices[0].message.content,
      usage: {
        inputTokens: response.usage.prompt_tokens,
        outputTokens: response.usage.completion_tokens
      }
    };
  }
}

class ClaudeNormalizer implements LlmNormalizer {
  normalize(query: string, context: string): any {
    return {
      model: "claude-3-5-sonnet-20241022",
      max_tokens: 1000,
      system: `Context:\n${context}`,
      messages: [
        { role: "user", content: query }
      ]
    };
  }

  denormalize(response: any): { answer: string; usage: Usage } {
    return {
      answer: response.content[0].text,
      usage: {
        inputTokens: response.usage.input_tokens,
        outputTokens: response.usage.output_tokens
      }
    };
  }
}

class OllamaNormalizer implements LlmNormalizer {
  normalize(query: string, context: string): any {
    return {
      model: "llama2:13b",
      prompt: `${context}\n\nQuestion: ${query}\n\nAnswer:`,
      stream: false
    };
  }

  denormalize(response: any): { answer: string; usage: Usage } {
    return {
      answer: response.response,
      usage: {
        inputTokens: response.prompt_eval_count || 0,
        outputTokens: response.eval_count || 0
      }
    };
  }
}

SSH-Based Remote Inference

For organizations with isolated networks (no internet), the rag-ssh module enables remote inference on a secure machine.

Scenario:

  • Main RAG system is in a restricted network
  • A remote machine (with internet) runs LLM services
  • SSH tunnel connects them securely

# On the remote machine
ssh -R 10.0.0.1:8000:localhost:8000 main-rag-server@10.0.0.1
# Now http://localhost:8000 on Main RAG = http://localhost:8000 on Remote

In configuration:


{
  "providers": {
    "openai_remote": {
      "enabled": true,
      "type": "ssh",
      "sshHost": "10.0.0.1",
      "sshPort": 22,
      "sshUser": "rag_service",
      "sshKeyPath": "/secure/keys/rag_ssh_key",
      "localPort": 8000,
      "remoteEndpoint": "https://api.openai.com/v1"
    }
  }
}

When a query uses this provider:

  1. SSH connection authenticates
  2. Request is tunneled to remote machine
  3. Remote machine calls OpenAI API
  4. Response comes back through tunnel
  5. No sensitive data on local network

Hybrid Mode: Different Providers for Different Contexts

The RAG system shines with hybrid setups:


{
  "routing": {
    "contextSpecific": {
      "public_docs": "openai",           // Fast, high quality
      "internal_policies": "ollama",     // Local, no data leaving
      "research_papers": "claude",       // Better for long context
      "customer_data": "ollama_isolated" // Airgapped, highest security
    }
  }
}

Benefits:

  • Cost optimization: Use expensive API for complex queries, cheap local for simple ones
  • Data privacy: Sensitive data stays on local Ollama
  • Performance: Route complex analysis to better models
  • Compliance: Different contexts follow different data handling rules

Model Evaluation and Cost Analysis

Configuration includes model capabilities and costs:


const models = config.providers.openai.models;

// Find most cost-effective model for this query
const cheapest = models.reduce((a, b) => 
  (a.costPer1kTokens < b.costPer1kTokens) ? a : b
);

// Find best quality model
const bestQuality = models.reduce((a, b) => 
  (a.qualityScore > b.qualityScore) ? a : b
);

// Adaptive routing based on query length
const estimatedTokens = question.length / 4; // Rough estimate
if (estimatedTokens > 10000) {
  // Use model with larger context window
  return models.find(m => m.contextWindow > 20000);
}

Admins can update /admin/models to add cost data gleaned from recent API calls:


{
  "models": [
    {
      "id": "gpt-4",
      "actualCostPer1kTokens": 0.031,  // Updated from last month's billing
      "avgLatency": 1500,              // milliseconds
      "errorRate": 0.002,              // 0.2%
      "quality": 9.5                   // 1-10 scale
    }
  ]
}

Cost Reporting


@Injectable()
export class CostTrackingService {
  async trackQuery(
    provider: string,
    model: string,
    inputTokens: number,
    outputTokens: number
  ): Promise<void> {
    const config = this.providerConfigService.getConfig();
    const modelConfig = config.providers[provider]
      .models.find(m => m.id === model);

    if (!modelConfig) return;

    const costPerM = modelConfig.costPer1mTokens || 0;
    const totalTokens = inputTokens + outputTokens;
    const cost = (totalTokens / 1_000_000) * costPerM;

    // Log for billing and analytics
    await this.costRepository.save({
      provider,
      model,
      inputTokens,
      outputTokens,
      cost,
      timestamp: new Date()
    });
  }
}

Generate usage reports:


async getCostReport(startDate: Date, endDate: Date): Promise<CostReport> {
  const entries = await this.costRepository.find({
    where: { timestamp: Between(startDate, endDate) }
  });

  const byProvider = groupBy(entries, 'provider');
  const byModel = groupBy(entries, 'model');

  return {
    totalCost: entries.reduce((sum, e) => sum + e.cost, 0),
    byProvider: Object.entries(byProvider).map(([provider, costs]) => ({
      provider,
      totalCost: costs.reduce((sum, c) => c.cost, 0),
      count: costs.length
    })),
    byModel: Object.entries(byModel).map(([model, costs]) => ({
      model,
      totalCost: costs.reduce((sum, c) => c.cost, 0),
      count: costs.length
    }))
  };
}

Fallback Strategies

What if the primary provider is down?


async proxyQuery(dto: RagQueryDto): Promise<any> {
  const config = this.providerConfigService.getConfig();
  
  // Build provider candidates in order of preference
  const candidates = [
    config.routing.contextSpecific[dto.contextId],
    config.routing.default,
    ...Object.keys(config.providers) // All others as fallback
  ].filter(name => config.providers[name]?.enabled);

  for (const providerName of candidates) {
    try {
      const provider = config.providers[providerName];
      const normalizer = this.getNormalizer(provider);
      
      const normalized = normalizer.normalize(dto.question, dto.context);
      const response = await this.callProvider(provider, normalized);
      
      return normalizer.denormalize(response);
    } catch (error) {
      this.logger.warn(
        `Provider ${providerName} failed: ${error.message}`
      );
      // Try next provider
      continue;
    }
  }

  throw new Error('All LLM providers failed');
}

Configuration:


{
  "routing": {
    "default": "openai",
    "fallback": "claude",  // If openai fails
    "tertiaryFallback": "ollama"  // If both fail
  }
}

Monitoring and Alerting

Track provider health:


@Injectable()
export class ProviderHealthService {
  async checkProviderHealth(): Promise<ProviderStatus[]> {
    const config = this.providerConfigService.getConfig();
    const statuses: ProviderStatus[] = [];

    for (const [name, provider] of Object.entries(config.providers)) {
      let status: 'healthy' | 'degraded' | 'down' = 'healthy';
      let latency = 0;
      let errorRate = 0;

      try {
        const startTime = Date.now();
        await this.pingProvider(provider);
        latency = Date.now() - startTime;

        // Get error rate from last 100 queries
        const recentErrors = await this.getRecentErrorCount(name);
        errorRate = recentErrors / 100;

        if (latency > 5000 || errorRate > 0.1) {
          status = 'degraded';
        }
      } catch (error) {
        status = 'down';
      }

      statuses.push({
        provider: name,
        status,
        latency,
        errorRate,
        timestamp: new Date()
      });
    }

    return statuses;
  }

  async generateHealthReport(): Promise<void> {
    const statuses = await this.checkProviderHealth();
    
    for (const status of statuses) {
      if (status.status === 'down') {
        await this.alertService.sendAlert(
          `Provider ${status.provider} is DOWN`,
          `severity: critical`
        );
      } else if (status.status === 'degraded') {
        await this.alertService.sendAlert(
          `Provider ${status.provider} is degraded`,
          `latency: ${status.latency}ms, error_rate: ${status.errorRate}`
        );
      }
    }
  }
}

Prompt Engineering Per Provider

Different models benefit from different prompts:


interface PromptTemplate {
  system: string;
  userPrefix: string;
  citations: string;
}

const promptTemplates = {
  'openai': {
    system: 'You are a helpful assistant. Answer based on provided context.',
    userPrefix: 'Based on the following context:\n',
    citations: '\nCite specific sources [1], [2], etc.'
  },
  'claude': {
    system: 'You are Claude, an AI assistant made by Anthropic.',
    userPrefix: 'I have the following context:\n',
    citations: '\nPlease include citations to source materials.'
  },
  'ollama': {
    system: 'You are a helpful AI assistant.',
    userPrefix: 'Context:\n',
    citations: '\nIncludecitations.'
  }
};

function buildPrompt(provider: string, question: string, context: string): string {
  const template = promptTemplates[provider];
  return `${template.system}\n\n${template.userPrefix}${context}\n\nQuestion: ${question}${template.citations}`;
}

Looking Ahead

The LLM provider switching layer makes your system resilient, cost-effective, and future-proof. You're not betting on OpenAI. You're not locked into local inference for everything.

In the final article, we'll bring everything together: DevOps, Deployment, and Scaling. Docker, Kubernetes, monitoring, and taking this system from laptop to production.

---

Key Takeaways:

Configuration-driven: Swap providers without code changes

Normalized interfaces: Abstracts provider differences

Hybrid routing: Different contexts use different providers

Cost tracking: Monitor spending across providers

Fallback strategies: System stays up when a provider fails

Remote inference: SSH tunnels for isolated networks

Health monitoring: Alerts when providers degrade

No more "We can only use OpenAI." Now it's "We can use whoever makes sense."

---

GitHub: