# Multi-Model Strategy: Cost-Performance Trade-offs **Last updated:** 2026-05 | Verified: MCP 2026-05 **Status:** GA **Category:** Cost Optimization & FinOps for AI --- ## Introduksjon Moderne AI-løsninger krever ofte forskjellige modellkapabiliteter for ulike oppgaver. En multi-model strategy innebærer intelligent routing av requests til den mest kostnadseffektive modellen som tilfredsstiller kvalitetskravene. Med Azure OpenAI-modeller som varierer fra GPT-4.1-nano (59 400 tokens/PTU) til GPT-5 (4 750 tokens/PTU) kan besparelsene være betydelige — opptil 90% kostnadsdifferanse mellom modeller for enkle oppgaver. Model Router fra Microsoft er en trent språkmodell som automatiserer denne beslutningsprosessen i real-time. Den analyserer prompt-kompleksitet, resonnementskrav og oppgavetype for å velge optimal modell fra et sett på opptil 18 underliggende modeller (inkludert GPT-serien, Claude, DeepSeek, Llama og Grok). Dette gir én deployment-overflate med kombinert kosteffektivitet og kvalitet. For organisasjoner som ønsker mer kontroll, tilbyr custom gateway-løsninger (via Azure API Management eller egen kode) mulighet for egendefinerte routing-regler basert på client identity, quota management, blue-green deployments eller data sovereignty-krav. Denne kunnskapsfilen dekker både managed (Model Router) og custom gateway-strategier for multi-model deployments. ## Kjernekomponenter ### Model Router (Managed Multi-Model Strategy) | Komponent | Beskrivelse | Versjon/Status | |-----------|-------------|----------------| | **Model Router** | Trent LLM som router prompts til beste underliggende modell | `2025-11-18` (GA) | | **Routing Modes** | Quality (max nøyaktighet), Balanced (default), Cost (max besparelse) | GA | | **Model Subset** | Custom selection av underliggende modeller for routing | GA | | **Deployment Types** | Global Standard, Data Zone Standard | Regional: East US 2, Sweden Central | | **Underlying Models** | 18 modeller: GPT-4.1/5-serien, o-series, Claude, DeepSeek, Llama, Grok | Varierer per modell | **Underliggende modeller i Model Router `2025-11-18`:** - **OpenAI-modeller:** gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-5, gpt-5-mini, gpt-5-nano, gpt-5-chat, o4-mini, gpt-4o, gpt-4o-mini - **Reasoning-modeller:** o4-mini (preview) - **3rd-party modeller:** DeepSeek-V3.1, gpt-oss-120b, Llama-4-Maverick-17B-128E-Instruct-FP8, grok-4, grok-4-fast - **Claude (krever egen deployment):** claude-haiku-4-5, claude-opus-4-1, claude-sonnet-4-5 **Rate limits (Model Router `2025-11-18`):** | Deployment Type | Default RPM | Default TPM | Enterprise RPM | Enterprise TPM | |-----------------|-------------|-------------|----------------|----------------| | GlobalStandard | 250 | 250 000 | 400 | 400 000 | | DataZoneStandard | 150 | 150 000 | 300 | 300 000 | ### Custom Gateway Architectures | Topology | Use Case | Tools | |----------|----------|-------| | **Single Instance + Multiple Deployments** | Routing mellom modellversjoner eller fine-tuned models | Azure API Management | | **Multiple Instances (Same Region)** | Security segmentation, chargeback, failover, quota spillover (Provisioned → Standard) | Azure API Management | | **Multiple Instances (Multi-Region)** | Regional failover, data residency, mixed model availability | Azure API Management (multi-region) eller custom code (ACA/AKS) | **Gateway implementations:** - **Azure API Management:** PaaS-løsning med backend pools, circuit breaker, policy-basert routing - **Custom Code:** Full kontroll, typisk Azure Container Apps eller AKS, frontet av Azure Front Door/Traffic Manager ## Arkitekturmønstre ### 1. Model Router: Managed Multi-Model Routing **Scenario:** Automatisk routing uten custom gateway-kode. **Arkitektur:** ``` Client → Model Router Deployment → [Auto-selected underlying model] ``` **Routing modes:** - **Balanced (default):** Velger blant modeller innenfor 1-2% kvalitetsrange av beste modell, prioriterer kostnad - **Cost:** Større kvalitetsbånd (5-6% fra beste), maksimerer besparelse - **Quality:** Alltid høyeste kvalitet, ignorerer kostnad **Model subset:** Custom deploy med eksplisitt subset (f.eks. kun GPT-4.1, GPT-4.1-mini, o4-mini) for compliance eller budsjettskranker. **Fordeler:** - Én deployment-overflate, ingen gateway-kode - Real-time routing uten lag - Supports tools/function calling (agentic scenarios) **Ulemper:** - Mindre kontroll over routing-logikk - Context window begrenset til minste underliggende modell (128k for GPT-4.1-serien) - Routing basert kun på text input (ikke images) **Kostnader:** - Input prompt: Charged per pricing page (fra nov 2025) - Ingen ekstra hosting cost (inkludert i model deployment) --- ### 2. Static Model Routing (Task-Specific Models) **Scenario:** Eksplisitt model selection per oppgavetype i client-kode. **Arkitektur:** ``` Client Logic: if task == "summary": use gpt-4.1-mini if task == "reasoning": use o4-mini if task == "simple_qa": use gpt-4.1-nano → Azure OpenAI deployments (direct) ``` **Decision criteria:** | Task Type | Model | Rationale | |-----------|-------|-----------| | Simple Q&A, classification | gpt-4.1-nano | 59 400 TPM/PTU, laveste kostnad | | Summarization, translation | gpt-4.1-mini | 14 900 TPM/PTU, god balance | | Complex reasoning | o4-mini | Reasoning-capable, 5 400 TPM/PTU | | High-quality content | gpt-5 | 4 750 TPM/PTU, best quality | **Fordeler:** - Full kontroll, ingen routing-lag - Predictable costs per task type **Ulemper:** - Logic i client-kode (maintainability) - Ingen dynamic fallback ved throttling --- ### 3. Dynamic Complexity-Based Routing (Custom Gateway) **Scenario:** Gateway analyserer prompt-kompleksitet og router dynamisk. **Arkitektur:** ``` Client → Azure API Management (eller custom gateway) ├─ Complexity Score (token count, question marks, "explain", "analyze") ├─ Score < 50: route to gpt-4.1-nano ├─ Score 50-200: route to gpt-4.1-mini └─ Score > 200: route to gpt-5 → Azure OpenAI instances (multiple deployments) ``` **Implementation (Azure API Management policy):** ```xml ``` **Fordeler:** - Server-side logic (client-agnostic) - Supports versioning/blue-green deployments - Usage tracking per client (via API Management analytics) **Ulemper:** - Gateway = single point of failure (krever multi-region for HA) - Complexity i policy-logic --- ### 4. Cascading Model Pipeline (Quality Fallback) **Scenario:** Start med billig modell, retry med dyrere ved lav confidence. **Arkitektur:** ``` Client → Gateway ├─ Try gpt-4.1-nano ├─ If confidence < 0.7: retry with gpt-4.1-mini └─ If confidence < 0.7: retry with gpt-5 → Multiple Azure OpenAI deployments ``` **Implementation (pseudokode):** ```python response = call_model("gpt-4.1-nano", prompt) if response.confidence < 0.7: response = call_model("gpt-4.1-mini", prompt) if response.confidence < 0.7: response = call_model("gpt-5", prompt) return response ``` **Fordeler:** - Quality guarantee med cost optimization - Automatic escalation **Ulemper:** - Latency ved retries - Complexity i confidence scoring (krever logprobs eller custom metrics) --- ### 5. Provisioned + Standard Spillover (Cost + Elasticity) **Scenario:** Provisioned PTU for baseline, Standard deployment for burst traffic. **Arkitektur:** ``` Client → Azure API Management ├─ Primary: Provisioned PTU deployment (300 PTU) └─ Spillover (on 429): Standard deployment → Same Azure OpenAI instance or multiple instances ``` **Cost model:** - **Provisioned:** Fast hourly cost ($/PTU/hr), predict for 80-90% av traffic - **Standard:** Pay-per-token for burst (10-20% av traffic) **Implementation (Azure API Management policy):** ```xml ``` **Fordeler:** - Cost optimization: provisioned for baseline, pay-as-you-go for peaks - Latency guarantee via PTU **Ulemper:** - Provisioned capacity må rightsizes (bruk [Azure AI Foundry PTU calculator](https://ai.azure.com/resource/calculator)) - Standard quotas er subscription-level (ikke instance-level) ## Beslutningsveiledning ### Når bruke Model Router vs. Custom Gateway | Kriterium | Model Router | Custom Gateway | |-----------|--------------|----------------| | **Deployment kompleksitet** | Lav (én deployment) | Høy (infrastruktur + policy) | | **Routing control** | Modes + subset | Full kontroll (logic, rules, client identity) | | **Data residency** | Data Zone Standard (single zone) | Krever per-region gateways for compliance | | **Multi-region failover** | Nei (single deployment) | Ja (med API Management multi-region eller custom HA) | | **Client segmentation** | Nei | Ja (quota per client, chargeback models) | | **Blue-green deployments** | Nei | Ja (route to different model versions) | | **Cost** | Model Router input charge + token usage | Gateway hosting + token usage | | **Latency** | Real-time routing (minimal overhead) | Gateway hop (~5-20ms, avhengig av region) | **Tommelfingerregel:** - **Model Router:** For de fleste use cases med standard routing needs - **Custom Gateway:** Når du trenger client identity routing, data sovereignty, multi-region HA, eller quota management --- ### Decision Tree: Velge Multi-Model Strategy ``` START: Trenger du multi-model routing? ├─ NEI: Bruk single model deployment (Standard eller Provisioned) └─ JA: ├─ Trenger du data residency compliance på tvers av regioner? │ ├─ JA: Custom gateway per region (API Management multi-region) │ └─ NEI: Continue ├─ Trenger du client-specific quota eller chargeback? │ ├─ JA: Custom gateway (API Management + client identity routing) │ └─ NEI: Continue ├─ Trenger du blue-green deployments eller model versioning? │ ├─ JA: Custom gateway (API Management policies) │ └─ NEI: Continue └─ Default: Model Router (Balanced mode) ├─ Cost-sensitive workload: Model Router (Cost mode) └─ Quality-critical workload: Model Router (Quality mode) ``` --- ### Vanlige feil | Feil | Konsekvens | Fix | |------|------------|-----| | **Routing til forskjellige model versions** | Inconsistent responses, breaking changes | Alltid samme model + version i load balancing/failover | | **Ignoring `Retry-After` header** | Aggressive retries forverrer throttling | Circuit breaker logic med `Retry-After` respekt | | **Gateway i single region for multi-region backends** | Latency + egress costs | Multi-region gateway deployment (API Management multi-region) | | **Cross-geopolitical routing** | Data residency violation | Isolated gateways per geopolitical region | | **Standard deployments i multiple subscriptions (samme region)** | Ikke økt quota (subscription-level quota) | Bruk Global/Data Zone Standard deployments istedenfor | | **Underdimensjonert Provisioned PTU** | Spillover til Standard = cost overruns | Bruk [PTU calculator](https://ai.azure.com/resource/calculator), rightsizing | --- ### Røde flagg - 🚩 **Gateway som single point of failure:** Deploy HA gateway (multi-region eller availability zones) - 🚩 **No health checks på gateway:** Synthetic transactions eller `/status` endpoint for upstream health - 🚩 **Complex routing logic i gateway policies:** Vurder custom code gateway (ACA/AKS) for bedre testability - 🚩 **Model Router med custom context window > 128k:** Subset-select kun modeller som støtter dette (f.eks. GPT-5-serien med 400k context) - 🚩 **Provisioned PTU scaling on-demand:** PTU capacity er ikke garantert, bruk reservations for production ## Integrasjon med Microsoft-stakken ### Azure OpenAI + Model Router **Quick Deploy:** ```bash # Foundry portal: Model catalog → Model Router → Quick Deploy # Deployment type: Global Standard eller Data Zone Standard # Routing mode: Balanced (default), Cost, Quality ``` **Custom Deploy (med Model Subset):** ```bash # Foundry portal: Model catalog → Model Router → Custom Deploy # 1. Velg deployment type # 2. Set Routing mode: Cost # 3. Model subset: Select kun gpt-4.1-mini, gpt-4.1-nano, o4-mini # 4. Deploy ``` **Python SDK (bruk Model Router):** ```python import os from openai import OpenAI client = OpenAI( api_key=os.getenv("AZURE_OPENAI_API_KEY"), base_url="https://YOUR-RESOURCE.openai.azure.com/openai/v1/" ) response = client.chat.completions.create( model="model-router", # Model Router deployment name messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain quantum computing in simple terms."} ] ) print(response.choices[0].message.content) # Model Router automatically selected underlying model (visible in response.model field) ``` --- ### Azure API Management (Custom Gateway) **Backend pools for load balancing:** ```xml https://aoai-instance1.openai.azure.com https://aoai-instance2.openai.azure.com https://aoai-instance3.openai.azure.com ``` **Circuit breaker policy (preview):** ```xml ``` **Referansearkitekturer:** - [Smart load balancing for Azure OpenAI using Azure API Management](https://github.com/Azure-Samples/openai-apim-lb) (GitHub) - [Scaling Azure OpenAI using Azure API Management](https://github.com/Azure/aoai-apim/) (GitHub, Provisioned + Standard spillover) - [GenAI gateway toolkit](https://github.com/Azure-Samples/apim-genai-gateway-toolkit) (Load testing + policies) --- ### Semantic Kernel (Application layer routing) ```csharp // Static routing per task type var kernel = Kernel.CreateBuilder() .AddAzureOpenAIChatCompletion( deploymentName: "gpt-4.1-nano", endpoint: "https://YOUR-RESOURCE.openai.azure.com", apiKey: apiKey, serviceId: "simple-tasks") .AddAzureOpenAIChatCompletion( deploymentName: "gpt-5", endpoint: "https://YOUR-RESOURCE.openai.azure.com", apiKey: apiKey, serviceId: "complex-tasks") .Build(); // Select service dynamically var chatService = taskComplexity > threshold ? kernel.GetRequiredService("complex-tasks") : kernel.GetRequiredService("simple-tasks"); ``` --- ### AI Foundry Model Catalog **Tiered inference (utenfor Azure OpenAI):** - **Foundry Model Catalog:** Meta Llama, Mistral, Cohere, Phi-modeller - **Deployment options:** Managed compute, Serverless API, Pay-as-you-go - **Use case:** Combine Azure OpenAI med open-source modeller for cost-tier strategy Eksempel: GPT-4.1 for critical tasks, Phi-4 (Microsoft open model) for simple classification. ## Offentlig sektor (Norge) ### Datasuverenitet og Multi-Model Routing **Model Router:** - **Data Zone Standard:** Holder data innenfor Microsoft-spesifisert data zone (f.eks. EU Data Boundary) - **Underliggende modeller:** Må deployes i samme data zone (unntatt Claude, som krever separate deployments) **Custom Gateway (multi-region):** - **Geopolitical boundaries:** Deploy isolated gateways per region (f.eks. Norway East, West Europe) - **Data residency:** Ensure no cross-region routing (NSG rules, policy enforcement) - **Compliance:** Azure Policy for consistency (model versions, encryption, network perimeter) **GDPR/Schrems II:** - Prefer Data Zone Standard deployments - Audit gateway logs for data flows (Azure Monitor, Log Analytics) --- ### Budsjettprosesser og Kostnadskontroll **Utfordring:** Offentlige etater har årlige budsjetter, AI-kostnader må forecasting. **Multi-model strategy for budsjettforutsigbarhet:** 1. **Baseline med Provisioned PTU:** - Allokér fast kostnad ($/PTU/hr) for 80-90% av forventet traffic - Bruk [PTU calculator](https://ai.azure.com/resource/calculator) for sizing - Purchase Azure Reservations (1-year eller 3-year) for cost savings (opptil 50%) 2. **Burst traffic med Standard:** - Standard deployment for peak periods (budget 10-20% ekstra) - Azure Cost Management alerts ved threshold (f.eks. 90% av månedsbudsjett) 3. **Model Router (Cost mode) for volume workloads:** - Batch-prosessering av dokumenter: Cost mode router til billigste modell - Quality-critical (f.eks. juridisk analyse): Quality mode for nøyaktighet **Cost Management integration:** ```bash # Azure Cost Management API: Track costs per resource group az consumption usage list --start-date 2026-02-01 --end-date 2026-02-28 \ --query "[?contains(instanceName, 'model-router')]" \ --output table ``` --- ### Compliance-krav (Schrems II, NIS2) **Multi-region gateway for compliance:** - **NIS2 (Network and Information Security Directive):** Krever høy tilgjengelighet, incident response - **Multi-region deployment:** Active-active gateways (Azure API Management multi-region) for SLA > 99.9% - **Incident response:** Azure Monitor alerts på gateway health, automatic failover **Audit trail:** - Gateway logger alle routing decisions (Azure Log Analytics) - Include client identity, selected model, response time, cost per request ## Kostnad og lisensiering ### Prissammenligning mellom modeller **Standard Deployment (Pay-as-you-go, NOK per 1M tokens, estimert 2026 rates):** | Model | Input (NOK/1M tokens) | Output (NOK/1M tokens) | Ratio (Output:Input) | |-------|-----------------------|------------------------|----------------------| | gpt-4.1-nano | ~50 | ~200 | 4:1 | | gpt-4.1-mini | ~150 | ~600 | 4:1 | | gpt-4.1 | ~300 | ~1200 | 4:1 | | gpt-5-mini | ~100 | ~400 | 4:1 | | gpt-5 | ~500 | ~2000 | 4:1 | | gpt-5-chat | ~250 | ~1000 | 4:1 | | o4-mini | ~350 | ~1400 | 4:1 | | gpt-4o | ~250 | ~1000 | 4:1 | | gpt-4o-mini | ~75 | ~300 | 4:1 | *(Priser er estimater basert på USD-pricing + valutakurs. Verifiser [Azure Pricing Calculator](https://azure.microsoft.com/pricing/calculator) for eksakte NOK-priser.)* **Provisioned Throughput (PTU, NOK per PTU/hr, estimert):** | Model | TPM per PTU (Input) | PTU/hr cost (NOK, estimated) | |-------|---------------------|------------------------------| | gpt-4.1-nano | 59 400 | ~80-120 | | gpt-4.1-mini | 14 900 | ~80-120 | | gpt-4.1 | 3 000 | ~120-180 | | gpt-5-mini | 23 750 | ~100-150 | | gpt-5 | 4 750 | ~180-250 | | o4-mini | 5 400 | ~150-200 | *(Provisioned pricing varierer per region og reservation type. Bruk [Azure Pricing Calculator](https://azure.microsoft.com/pricing/calculator).)* --- ### Besparelsespotensiale **Eksempel: Dokumentsammendrag (offentlig etat, 10M tokens/måned):** | Strategi | Model(s) | Monthly Cost (NOK, estimert) | Savings | |----------|----------|------------------------------|---------| | **Baseline (all GPT-5)** | gpt-5 | ~25 000 (10M input + 2M output) | - | | **Static routing** | 70% gpt-4.1-mini, 30% gpt-5 | ~10 000 | 60% | | **Model Router (Balanced)** | Auto-routing | ~8 000 | 68% | | **Model Router (Cost mode)** | Auto-routing (larger quality band) | ~6 000 | 76% | **Provisioned PTU scenario (high-volume, 100M tokens/måned):** | Strategi | Setup | Monthly Cost (NOK, estimated) | Savings | |----------|-------|-------------------------------|---------| | **Standard pay-as-you-go** | 100M input, 20M output | ~200 000 | - | | **Provisioned (300 PTU gpt-5)** | 300 PTU × 730 hrs × ~200 NOK/PTU/hr | ~43 800 + token overage | 78% | | **Provisioned + Standard spillover** | 200 PTU + Standard for 20% burst | ~35 000 | 82% | *(Estimater avhenger av traffic patterns. Bruk [PTU calculator](https://ai.azure.com/resource/calculator) for nøyaktig sizing.)* --- ### Optimaliseringstips 1. **Right-size Provisioned PTU:** - Benchmark actual workload (ikke estimater) - Start med 80% av forventet peak, use Standard spillover for 20% - Purchase Azure Reservations (1-year) for 30-50% savings på PTU cost 2. **Model Router for varierende workloads:** - Bruk Balanced mode som default - Cost mode for batch-processing (ikke time-sensitive) - Quality mode for compliance-kritiske outputs (juridisk, helse) 3. **Cache optimization:** - Prompt caching (GPT-4.1+): 100% discount på cached tokens - Semantic Kernel memory: Cache embeddings for RAG 4. **Fine-tuning for cost reduction:** - Fine-tuned gpt-4o-mini kan matche gpt-4o quality for specific tasks - Cost: $1.70/hour hosting + token usage (same rate as base model) - Example: Fine-tune for domain-specific summarization → replace GPT-5 with gpt-4.1-mini 5. **Monitor and adjust:** - Azure Cost Management: Set budgets + alerts - Gateway analytics: Track cost per client, per model, per task type - Monthly review: Adjust Model Router subset or gateway rules based on cost/quality metrics ## For arkitekten (Cosmo) ### Spørsmål å stille kunden 1. **Traffic patterns:** - Hva er forventet requests per minute (peak og average)? - Er traffic jevn over døgnet, eller er det klare peak-perioder? - Hvor mange tokens per request (input + output)? 2. **Quality vs. Cost prioritering:** - Er det rom for 1-2% kvalitetsreduksjon for kostbesparelse (Balanced mode)? - Eller er 100% kvalitet ikke-forhandlbart (Quality mode)? - Hvilke oppgaver kan bruke billigere modeller (klassifisering, simple Q&A)? 3. **Compliance og data residency:** - Må data forbli innenfor Norge/EU/spesifikt geography? - Kreves audit trail for model selection decisions? - Er det multi-tenant scenario med chargeback-krav? 4. **Existing infrastructure:** - Bruker dere allerede Azure API Management, eller må gateway deployes fra scratch? - Finnes det multi-region krav for HA/DR? - Hva er akseptabel latency for gateway hop (5-20ms)? 5. **Budget og forecasting:** - Er det fast årlig budsjett, eller pay-as-you-go flexibility? - Kan dere committe til 1-year reservation for PTU savings? - Hva er threshold for cost alerts (90% av budsjett)? 6. **Deployment strategi:** - Trenger dere blue-green deployments for model versioning? - Vil dere starte med Model Router og vurdere custom gateway senere? - Er det behov for client-specific quota (per-team, per-prosjekt)? 7. **Monitoring og optimalisering:** - Hvem eier cost management (IT, finance, product team)? - Hvor ofte skal cost/quality metrics reviewes (månedlig, kvartalsvis)? - Finnes det baseline metrics for quality (f.eks. F1-score, BLEU)? --- ### Fallgruver | Fallgruve | Impact | Mitigation | |-----------|--------|------------| | **Over-provisioning PTU** | Waste (betaler for unused capacity) | Start med 80% av peak, use Standard spillover | | **Under-provisioning PTU** | Poor UX (throttling, latency) + cost overruns (Standard overage) | Benchmark actual traffic, rightsize monthly | | **Ignoring context window limits (Model Router)** | Failed requests (hvis prompt > 128k til modell som ikke støtter det) | Model subset selection (kun models med required context window) | | **Complex routing logic i gateway policies** | Maintenance hell, hard to debug | Start simple (token count), iterate. Vurder custom code gateway for complexity. | | **No circuit breaker** | Cascade failures, throttling amplification | Azure API Management circuit breaker policy (respekter `Retry-After`) | | **Single-region gateway for multi-region backends** | Latency + egress costs + SPoF | Deploy multi-region API Management eller custom HA gateway | | **Cross-geopolitical routing** | Compliance violation (GDPR, Schrems II) | Isolated gateways per region, NSG rules enforcement | | **No cost monitoring** | Budget overruns discovery too late | Azure Cost Management alerts, monthly reviews, gateway analytics | --- ### Anbefalinger per modenhetsnivå **Level 1 (Pilot/POC):** - Start med **Model Router (Balanced mode)** for minimal complexity - Single deployment (Global Standard eller Data Zone Standard) - Monitor cost vs. quality over 1-2 måneder - Decision point: Er besparelse + quality akseptabelt? → Produksjoniser. Nei? → Vurder custom gateway. **Level 2 (Production, single-region):** - **Model Router (Custom deploy)** med model subset for compliance - Eller **Azure API Management** for simple routing (token count, task type) - Provisioned PTU for baseline + Standard spillover - Azure Cost Management alerts + monthly reviews **Level 3 (Enterprise, multi-region, multi-tenant):** - **Custom gateway** (Azure API Management multi-region eller ACA/AKS + Azure Front Door) - Client identity-based routing, chargeback models - Provisioned PTU med 1-year reservations per region - Automated cost optimization (dynamic model selection basert på budget thresholds) - Compliance audit trail (Log Analytics, Azure Policy) **Level 4 (Advanced optimization):** - **Hybrid multi-model strategy:** Azure OpenAI (premium tasks) + AI Foundry open models (commodity tasks) - Fine-tuned models for domain-specific cost reduction - Real-time cost/quality feedback loop (A/B testing av routing strategies) - FinOps team ownership med automated chargebacks ## Kilder og verifisering **Microsoft Learn (MCP-verified):** 1. [Model router for Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/model-router) — **Verified** (MCP fetch, 2026-04) 2. [Use a gateway in front of multiple Azure OpenAI deployments](https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/azure-openai-gateway-multi-backend) — **Verified** (MCP fetch, 2026-04). Dokument bekrefter: (a) credential termination og reestablishment ved gateway anbefales fremfor pass-through client credentials, (b) gateway gir client-based usage tracking og chargeback-støtte, (c) Azure OpenAI er nå tagget som "Foundry Tools / Azure OpenAI in Foundry Models". 3. [Understanding costs associated with provisioned throughput units (PTU)](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/provisioned-throughput-onboarding) — **Verified** (MCP search, 2026-04) 4. [Azure OpenAI in Azure AI Foundry Models](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/models) — **Verified** (MCP search, 2026-04) 5. [GPT-4o vs GPT-4o mini model selection](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/whats-new) — **Verified** (MCP search, 2026-04) **GitHub samples (MCP-referenced):** 1. [Smart load balancing for Azure OpenAI (Azure API Management)](https://github.com/Azure-Samples/openai-apim-lb) — **Verified** 2. [Scaling Azure OpenAI using Azure API Management](https://github.com/Azure/aoai-apim/) — **Verified** 3. [GenAI gateway toolkit](https://github.com/Azure-Samples/apim-genai-gateway-toolkit) — **Verified** **Pricing and calculators:** 1. [Azure Pricing Calculator](https://azure.microsoft.com/pricing/calculator) — **Baseline** (pricing subject to change) 2. [Azure AI Foundry PTU calculator](https://ai.azure.com/resource/calculator) — **Verified** (MCP-referenced) **Konfidensnivå per seksjon:** | Seksjon | Confidence | Source | |---------|------------|--------| | Model Router (components, modes, models) | **Verified** | MCP microsoft-learn fetch | | Custom Gateway architectures | **Verified** | MCP microsoft-learn fetch | | Arkitekturmønstre (1-5) | **Verified** | MCP microsoft-learn + GitHub samples | | Prissammenligning | **Baseline** | Estimated from USD pricing + currency conversion (verify with Azure Pricing Calculator) | | Besparelsespotensiale | **Baseline** | Example calculations (actual savings depend on workload) | | Offentlig sektor (compliance, budsjett) | **Baseline** | General best practices (verify with legal/compliance team) | | Integration (API Management policies) | **Verified** | MCP code samples + GitHub repos | --- **Sist oppdatert:** 2026-04 (basert på Model Router version `2025-11-18` og Azure OpenAI pricing per april 2026). Verified (MCP 2026-04). **Neste review:** Ved nye Model Router-versjoner eller større pricing changes.