# Multi-Model Strategy: Cost-Performance Trade-offs
**Last updated:** 2026-05 | Verified: MCP 2026-05
**Status:** GA
**Category:** Cost Optimization & FinOps for AI
---
## Introduksjon
Moderne AI-løsninger krever ofte forskjellige modellkapabiliteter for ulike oppgaver. En multi-model strategy innebærer intelligent routing av requests til den mest kostnadseffektive modellen som tilfredsstiller kvalitetskravene. Med Azure OpenAI-modeller som varierer fra GPT-4.1-nano (59 400 tokens/PTU) til GPT-5 (4 750 tokens/PTU) kan besparelsene være betydelige — opptil 90% kostnadsdifferanse mellom modeller for enkle oppgaver.
Model Router fra Microsoft er en trent språkmodell som automatiserer denne beslutningsprosessen i real-time. Den analyserer prompt-kompleksitet, resonnementskrav og oppgavetype for å velge optimal modell fra et sett på opptil 18 underliggende modeller (inkludert GPT-serien, Claude, DeepSeek, Llama og Grok). Dette gir én deployment-overflate med kombinert kosteffektivitet og kvalitet.
For organisasjoner som ønsker mer kontroll, tilbyr custom gateway-løsninger (via Azure API Management eller egen kode) mulighet for egendefinerte routing-regler basert på client identity, quota management, blue-green deployments eller data sovereignty-krav. Denne kunnskapsfilen dekker både managed (Model Router) og custom gateway-strategier for multi-model deployments.
## Kjernekomponenter
### Model Router (Managed Multi-Model Strategy)
| Komponent | Beskrivelse | Versjon/Status |
|-----------|-------------|----------------|
| **Model Router** | Trent LLM som router prompts til beste underliggende modell | `2025-11-18` (GA) |
| **Routing Modes** | Quality (max nøyaktighet), Balanced (default), Cost (max besparelse) | GA |
| **Model Subset** | Custom selection av underliggende modeller for routing | GA |
| **Deployment Types** | Global Standard, Data Zone Standard | Regional: East US 2, Sweden Central |
| **Underlying Models** | 18 modeller: GPT-4.1/5-serien, o-series, Claude, DeepSeek, Llama, Grok | Varierer per modell |
**Underliggende modeller i Model Router `2025-11-18`:**
- **OpenAI-modeller:** gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-5, gpt-5-mini, gpt-5-nano, gpt-5-chat, o4-mini, gpt-4o, gpt-4o-mini
- **Reasoning-modeller:** o4-mini (preview)
- **3rd-party modeller:** DeepSeek-V3.1, gpt-oss-120b, Llama-4-Maverick-17B-128E-Instruct-FP8, grok-4, grok-4-fast
- **Claude (krever egen deployment):** claude-haiku-4-5, claude-opus-4-1, claude-sonnet-4-5
**Rate limits (Model Router `2025-11-18`):**
| Deployment Type | Default RPM | Default TPM | Enterprise RPM | Enterprise TPM |
|-----------------|-------------|-------------|----------------|----------------|
| GlobalStandard | 250 | 250 000 | 400 | 400 000 |
| DataZoneStandard | 150 | 150 000 | 300 | 300 000 |
### Custom Gateway Architectures
| Topology | Use Case | Tools |
|----------|----------|-------|
| **Single Instance + Multiple Deployments** | Routing mellom modellversjoner eller fine-tuned models | Azure API Management |
| **Multiple Instances (Same Region)** | Security segmentation, chargeback, failover, quota spillover (Provisioned → Standard) | Azure API Management |
| **Multiple Instances (Multi-Region)** | Regional failover, data residency, mixed model availability | Azure API Management (multi-region) eller custom code (ACA/AKS) |
**Gateway implementations:**
- **Azure API Management:** PaaS-løsning med backend pools, circuit breaker, policy-basert routing
- **Custom Code:** Full kontroll, typisk Azure Container Apps eller AKS, frontet av Azure Front Door/Traffic Manager
## Arkitekturmønstre
### 1. Model Router: Managed Multi-Model Routing
**Scenario:** Automatisk routing uten custom gateway-kode.
**Arkitektur:**
```
Client → Model Router Deployment → [Auto-selected underlying model]
```
**Routing modes:**
- **Balanced (default):** Velger blant modeller innenfor 1-2% kvalitetsrange av beste modell, prioriterer kostnad
- **Cost:** Større kvalitetsbånd (5-6% fra beste), maksimerer besparelse
- **Quality:** Alltid høyeste kvalitet, ignorerer kostnad
**Model subset:** Custom deploy med eksplisitt subset (f.eks. kun GPT-4.1, GPT-4.1-mini, o4-mini) for compliance eller budsjettskranker.
**Fordeler:**
- Én deployment-overflate, ingen gateway-kode
- Real-time routing uten lag
- Supports tools/function calling (agentic scenarios)
**Ulemper:**
- Mindre kontroll over routing-logikk
- Context window begrenset til minste underliggende modell (128k for GPT-4.1-serien)
- Routing basert kun på text input (ikke images)
**Kostnader:**
- Input prompt: Charged per pricing page (fra nov 2025)
- Ingen ekstra hosting cost (inkludert i model deployment)
---
### 2. Static Model Routing (Task-Specific Models)
**Scenario:** Eksplisitt model selection per oppgavetype i client-kode.
**Arkitektur:**
```
Client Logic:
if task == "summary": use gpt-4.1-mini
if task == "reasoning": use o4-mini
if task == "simple_qa": use gpt-4.1-nano
→ Azure OpenAI deployments (direct)
```
**Decision criteria:**
| Task Type | Model | Rationale |
|-----------|-------|-----------|
| Simple Q&A, classification | gpt-4.1-nano | 59 400 TPM/PTU, laveste kostnad |
| Summarization, translation | gpt-4.1-mini | 14 900 TPM/PTU, god balance |
| Complex reasoning | o4-mini | Reasoning-capable, 5 400 TPM/PTU |
| High-quality content | gpt-5 | 4 750 TPM/PTU, best quality |
**Fordeler:**
- Full kontroll, ingen routing-lag
- Predictable costs per task type
**Ulemper:**
- Logic i client-kode (maintainability)
- Ingen dynamic fallback ved throttling
---
### 3. Dynamic Complexity-Based Routing (Custom Gateway)
**Scenario:** Gateway analyserer prompt-kompleksitet og router dynamisk.
**Arkitektur:**
```
Client → Azure API Management (eller custom gateway)
├─ Complexity Score (token count, question marks, "explain", "analyze")
├─ Score < 50: route to gpt-4.1-nano
├─ Score 50-200: route to gpt-4.1-mini
└─ Score > 200: route to gpt-5
→ Azure OpenAI instances (multiple deployments)
```
**Implementation (Azure API Management policy):**
```xml
```
**Fordeler:**
- Server-side logic (client-agnostic)
- Supports versioning/blue-green deployments
- Usage tracking per client (via API Management analytics)
**Ulemper:**
- Gateway = single point of failure (krever multi-region for HA)
- Complexity i policy-logic
---
### 4. Cascading Model Pipeline (Quality Fallback)
**Scenario:** Start med billig modell, retry med dyrere ved lav confidence.
**Arkitektur:**
```
Client → Gateway
├─ Try gpt-4.1-nano
├─ If confidence < 0.7: retry with gpt-4.1-mini
└─ If confidence < 0.7: retry with gpt-5
→ Multiple Azure OpenAI deployments
```
**Implementation (pseudokode):**
```python
response = call_model("gpt-4.1-nano", prompt)
if response.confidence < 0.7:
response = call_model("gpt-4.1-mini", prompt)
if response.confidence < 0.7:
response = call_model("gpt-5", prompt)
return response
```
**Fordeler:**
- Quality guarantee med cost optimization
- Automatic escalation
**Ulemper:**
- Latency ved retries
- Complexity i confidence scoring (krever logprobs eller custom metrics)
---
### 5. Provisioned + Standard Spillover (Cost + Elasticity)
**Scenario:** Provisioned PTU for baseline, Standard deployment for burst traffic.
**Arkitektur:**
```
Client → Azure API Management
├─ Primary: Provisioned PTU deployment (300 PTU)
└─ Spillover (on 429): Standard deployment
→ Same Azure OpenAI instance or multiple instances
```
**Cost model:**
- **Provisioned:** Fast hourly cost ($/PTU/hr), predict for 80-90% av traffic
- **Standard:** Pay-per-token for burst (10-20% av traffic)
**Implementation (Azure API Management policy):**
```xml
```
**Fordeler:**
- Cost optimization: provisioned for baseline, pay-as-you-go for peaks
- Latency guarantee via PTU
**Ulemper:**
- Provisioned capacity må rightsizes (bruk [Azure AI Foundry PTU calculator](https://ai.azure.com/resource/calculator))
- Standard quotas er subscription-level (ikke instance-level)
## Beslutningsveiledning
### Når bruke Model Router vs. Custom Gateway
| Kriterium | Model Router | Custom Gateway |
|-----------|--------------|----------------|
| **Deployment kompleksitet** | Lav (én deployment) | Høy (infrastruktur + policy) |
| **Routing control** | Modes + subset | Full kontroll (logic, rules, client identity) |
| **Data residency** | Data Zone Standard (single zone) | Krever per-region gateways for compliance |
| **Multi-region failover** | Nei (single deployment) | Ja (med API Management multi-region eller custom HA) |
| **Client segmentation** | Nei | Ja (quota per client, chargeback models) |
| **Blue-green deployments** | Nei | Ja (route to different model versions) |
| **Cost** | Model Router input charge + token usage | Gateway hosting + token usage |
| **Latency** | Real-time routing (minimal overhead) | Gateway hop (~5-20ms, avhengig av region) |
**Tommelfingerregel:**
- **Model Router:** For de fleste use cases med standard routing needs
- **Custom Gateway:** Når du trenger client identity routing, data sovereignty, multi-region HA, eller quota management
---
### Decision Tree: Velge Multi-Model Strategy
```
START: Trenger du multi-model routing?
├─ NEI: Bruk single model deployment (Standard eller Provisioned)
└─ JA:
├─ Trenger du data residency compliance på tvers av regioner?
│ ├─ JA: Custom gateway per region (API Management multi-region)
│ └─ NEI: Continue
├─ Trenger du client-specific quota eller chargeback?
│ ├─ JA: Custom gateway (API Management + client identity routing)
│ └─ NEI: Continue
├─ Trenger du blue-green deployments eller model versioning?
│ ├─ JA: Custom gateway (API Management policies)
│ └─ NEI: Continue
└─ Default: Model Router (Balanced mode)
├─ Cost-sensitive workload: Model Router (Cost mode)
└─ Quality-critical workload: Model Router (Quality mode)
```
---
### Vanlige feil
| Feil | Konsekvens | Fix |
|------|------------|-----|
| **Routing til forskjellige model versions** | Inconsistent responses, breaking changes | Alltid samme model + version i load balancing/failover |
| **Ignoring `Retry-After` header** | Aggressive retries forverrer throttling | Circuit breaker logic med `Retry-After` respekt |
| **Gateway i single region for multi-region backends** | Latency + egress costs | Multi-region gateway deployment (API Management multi-region) |
| **Cross-geopolitical routing** | Data residency violation | Isolated gateways per geopolitical region |
| **Standard deployments i multiple subscriptions (samme region)** | Ikke økt quota (subscription-level quota) | Bruk Global/Data Zone Standard deployments istedenfor |
| **Underdimensjonert Provisioned PTU** | Spillover til Standard = cost overruns | Bruk [PTU calculator](https://ai.azure.com/resource/calculator), rightsizing |
---
### Røde flagg
- 🚩 **Gateway som single point of failure:** Deploy HA gateway (multi-region eller availability zones)
- 🚩 **No health checks på gateway:** Synthetic transactions eller `/status` endpoint for upstream health
- 🚩 **Complex routing logic i gateway policies:** Vurder custom code gateway (ACA/AKS) for bedre testability
- 🚩 **Model Router med custom context window > 128k:** Subset-select kun modeller som støtter dette (f.eks. GPT-5-serien med 400k context)
- 🚩 **Provisioned PTU scaling on-demand:** PTU capacity er ikke garantert, bruk reservations for production
## Integrasjon med Microsoft-stakken
### Azure OpenAI + Model Router
**Quick Deploy:**
```bash
# Foundry portal: Model catalog → Model Router → Quick Deploy
# Deployment type: Global Standard eller Data Zone Standard
# Routing mode: Balanced (default), Cost, Quality
```
**Custom Deploy (med Model Subset):**
```bash
# Foundry portal: Model catalog → Model Router → Custom Deploy
# 1. Velg deployment type
# 2. Set Routing mode: Cost
# 3. Model subset: Select kun gpt-4.1-mini, gpt-4.1-nano, o4-mini
# 4. Deploy
```
**Python SDK (bruk Model Router):**
```python
import os
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("AZURE_OPENAI_API_KEY"),
base_url="https://YOUR-RESOURCE.openai.azure.com/openai/v1/"
)
response = client.chat.completions.create(
model="model-router", # Model Router deployment name
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
]
)
print(response.choices[0].message.content)
# Model Router automatically selected underlying model (visible in response.model field)
```
---
### Azure API Management (Custom Gateway)
**Backend pools for load balancing:**
```xml
https://aoai-instance1.openai.azure.com
https://aoai-instance2.openai.azure.com
https://aoai-instance3.openai.azure.com
```
**Circuit breaker policy (preview):**
```xml
```
**Referansearkitekturer:**
- [Smart load balancing for Azure OpenAI using Azure API Management](https://github.com/Azure-Samples/openai-apim-lb) (GitHub)
- [Scaling Azure OpenAI using Azure API Management](https://github.com/Azure/aoai-apim/) (GitHub, Provisioned + Standard spillover)
- [GenAI gateway toolkit](https://github.com/Azure-Samples/apim-genai-gateway-toolkit) (Load testing + policies)
---
### Semantic Kernel (Application layer routing)
```csharp
// Static routing per task type
var kernel = Kernel.CreateBuilder()
.AddAzureOpenAIChatCompletion(
deploymentName: "gpt-4.1-nano",
endpoint: "https://YOUR-RESOURCE.openai.azure.com",
apiKey: apiKey,
serviceId: "simple-tasks")
.AddAzureOpenAIChatCompletion(
deploymentName: "gpt-5",
endpoint: "https://YOUR-RESOURCE.openai.azure.com",
apiKey: apiKey,
serviceId: "complex-tasks")
.Build();
// Select service dynamically
var chatService = taskComplexity > threshold
? kernel.GetRequiredService("complex-tasks")
: kernel.GetRequiredService("simple-tasks");
```
---
### AI Foundry Model Catalog
**Tiered inference (utenfor Azure OpenAI):**
- **Foundry Model Catalog:** Meta Llama, Mistral, Cohere, Phi-modeller
- **Deployment options:** Managed compute, Serverless API, Pay-as-you-go
- **Use case:** Combine Azure OpenAI med open-source modeller for cost-tier strategy
Eksempel: GPT-4.1 for critical tasks, Phi-4 (Microsoft open model) for simple classification.
## Offentlig sektor (Norge)
### Datasuverenitet og Multi-Model Routing
**Model Router:**
- **Data Zone Standard:** Holder data innenfor Microsoft-spesifisert data zone (f.eks. EU Data Boundary)
- **Underliggende modeller:** Må deployes i samme data zone (unntatt Claude, som krever separate deployments)
**Custom Gateway (multi-region):**
- **Geopolitical boundaries:** Deploy isolated gateways per region (f.eks. Norway East, West Europe)
- **Data residency:** Ensure no cross-region routing (NSG rules, policy enforcement)
- **Compliance:** Azure Policy for consistency (model versions, encryption, network perimeter)
**GDPR/Schrems II:**
- Prefer Data Zone Standard deployments
- Audit gateway logs for data flows (Azure Monitor, Log Analytics)
---
### Budsjettprosesser og Kostnadskontroll
**Utfordring:** Offentlige etater har årlige budsjetter, AI-kostnader må forecasting.
**Multi-model strategy for budsjettforutsigbarhet:**
1. **Baseline med Provisioned PTU:**
- Allokér fast kostnad ($/PTU/hr) for 80-90% av forventet traffic
- Bruk [PTU calculator](https://ai.azure.com/resource/calculator) for sizing
- Purchase Azure Reservations (1-year eller 3-year) for cost savings (opptil 50%)
2. **Burst traffic med Standard:**
- Standard deployment for peak periods (budget 10-20% ekstra)
- Azure Cost Management alerts ved threshold (f.eks. 90% av månedsbudsjett)
3. **Model Router (Cost mode) for volume workloads:**
- Batch-prosessering av dokumenter: Cost mode router til billigste modell
- Quality-critical (f.eks. juridisk analyse): Quality mode for nøyaktighet
**Cost Management integration:**
```bash
# Azure Cost Management API: Track costs per resource group
az consumption usage list --start-date 2026-02-01 --end-date 2026-02-28 \
--query "[?contains(instanceName, 'model-router')]" \
--output table
```
---
### Compliance-krav (Schrems II, NIS2)
**Multi-region gateway for compliance:**
- **NIS2 (Network and Information Security Directive):** Krever høy tilgjengelighet, incident response
- **Multi-region deployment:** Active-active gateways (Azure API Management multi-region) for SLA > 99.9%
- **Incident response:** Azure Monitor alerts på gateway health, automatic failover
**Audit trail:**
- Gateway logger alle routing decisions (Azure Log Analytics)
- Include client identity, selected model, response time, cost per request
## Kostnad og lisensiering
### Prissammenligning mellom modeller
**Standard Deployment (Pay-as-you-go, NOK per 1M tokens, estimert 2026 rates):**
| Model | Input (NOK/1M tokens) | Output (NOK/1M tokens) | Ratio (Output:Input) |
|-------|-----------------------|------------------------|----------------------|
| gpt-4.1-nano | ~50 | ~200 | 4:1 |
| gpt-4.1-mini | ~150 | ~600 | 4:1 |
| gpt-4.1 | ~300 | ~1200 | 4:1 |
| gpt-5-mini | ~100 | ~400 | 4:1 |
| gpt-5 | ~500 | ~2000 | 4:1 |
| gpt-5-chat | ~250 | ~1000 | 4:1 |
| o4-mini | ~350 | ~1400 | 4:1 |
| gpt-4o | ~250 | ~1000 | 4:1 |
| gpt-4o-mini | ~75 | ~300 | 4:1 |
*(Priser er estimater basert på USD-pricing + valutakurs. Verifiser [Azure Pricing Calculator](https://azure.microsoft.com/pricing/calculator) for eksakte NOK-priser.)*
**Provisioned Throughput (PTU, NOK per PTU/hr, estimert):**
| Model | TPM per PTU (Input) | PTU/hr cost (NOK, estimated) |
|-------|---------------------|------------------------------|
| gpt-4.1-nano | 59 400 | ~80-120 |
| gpt-4.1-mini | 14 900 | ~80-120 |
| gpt-4.1 | 3 000 | ~120-180 |
| gpt-5-mini | 23 750 | ~100-150 |
| gpt-5 | 4 750 | ~180-250 |
| o4-mini | 5 400 | ~150-200 |
*(Provisioned pricing varierer per region og reservation type. Bruk [Azure Pricing Calculator](https://azure.microsoft.com/pricing/calculator).)*
---
### Besparelsespotensiale
**Eksempel: Dokumentsammendrag (offentlig etat, 10M tokens/måned):**
| Strategi | Model(s) | Monthly Cost (NOK, estimert) | Savings |
|----------|----------|------------------------------|---------|
| **Baseline (all GPT-5)** | gpt-5 | ~25 000 (10M input + 2M output) | - |
| **Static routing** | 70% gpt-4.1-mini, 30% gpt-5 | ~10 000 | 60% |
| **Model Router (Balanced)** | Auto-routing | ~8 000 | 68% |
| **Model Router (Cost mode)** | Auto-routing (larger quality band) | ~6 000 | 76% |
**Provisioned PTU scenario (high-volume, 100M tokens/måned):**
| Strategi | Setup | Monthly Cost (NOK, estimated) | Savings |
|----------|-------|-------------------------------|---------|
| **Standard pay-as-you-go** | 100M input, 20M output | ~200 000 | - |
| **Provisioned (300 PTU gpt-5)** | 300 PTU × 730 hrs × ~200 NOK/PTU/hr | ~43 800 + token overage | 78% |
| **Provisioned + Standard spillover** | 200 PTU + Standard for 20% burst | ~35 000 | 82% |
*(Estimater avhenger av traffic patterns. Bruk [PTU calculator](https://ai.azure.com/resource/calculator) for nøyaktig sizing.)*
---
### Optimaliseringstips
1. **Right-size Provisioned PTU:**
- Benchmark actual workload (ikke estimater)
- Start med 80% av forventet peak, use Standard spillover for 20%
- Purchase Azure Reservations (1-year) for 30-50% savings på PTU cost
2. **Model Router for varierende workloads:**
- Bruk Balanced mode som default
- Cost mode for batch-processing (ikke time-sensitive)
- Quality mode for compliance-kritiske outputs (juridisk, helse)
3. **Cache optimization:**
- Prompt caching (GPT-4.1+): 100% discount på cached tokens
- Semantic Kernel memory: Cache embeddings for RAG
4. **Fine-tuning for cost reduction:**
- Fine-tuned gpt-4o-mini kan matche gpt-4o quality for specific tasks
- Cost: $1.70/hour hosting + token usage (same rate as base model)
- Example: Fine-tune for domain-specific summarization → replace GPT-5 with gpt-4.1-mini
5. **Monitor and adjust:**
- Azure Cost Management: Set budgets + alerts
- Gateway analytics: Track cost per client, per model, per task type
- Monthly review: Adjust Model Router subset or gateway rules based on cost/quality metrics
## For arkitekten (Cosmo)
### Spørsmål å stille kunden
1. **Traffic patterns:**
- Hva er forventet requests per minute (peak og average)?
- Er traffic jevn over døgnet, eller er det klare peak-perioder?
- Hvor mange tokens per request (input + output)?
2. **Quality vs. Cost prioritering:**
- Er det rom for 1-2% kvalitetsreduksjon for kostbesparelse (Balanced mode)?
- Eller er 100% kvalitet ikke-forhandlbart (Quality mode)?
- Hvilke oppgaver kan bruke billigere modeller (klassifisering, simple Q&A)?
3. **Compliance og data residency:**
- Må data forbli innenfor Norge/EU/spesifikt geography?
- Kreves audit trail for model selection decisions?
- Er det multi-tenant scenario med chargeback-krav?
4. **Existing infrastructure:**
- Bruker dere allerede Azure API Management, eller må gateway deployes fra scratch?
- Finnes det multi-region krav for HA/DR?
- Hva er akseptabel latency for gateway hop (5-20ms)?
5. **Budget og forecasting:**
- Er det fast årlig budsjett, eller pay-as-you-go flexibility?
- Kan dere committe til 1-year reservation for PTU savings?
- Hva er threshold for cost alerts (90% av budsjett)?
6. **Deployment strategi:**
- Trenger dere blue-green deployments for model versioning?
- Vil dere starte med Model Router og vurdere custom gateway senere?
- Er det behov for client-specific quota (per-team, per-prosjekt)?
7. **Monitoring og optimalisering:**
- Hvem eier cost management (IT, finance, product team)?
- Hvor ofte skal cost/quality metrics reviewes (månedlig, kvartalsvis)?
- Finnes det baseline metrics for quality (f.eks. F1-score, BLEU)?
---
### Fallgruver
| Fallgruve | Impact | Mitigation |
|-----------|--------|------------|
| **Over-provisioning PTU** | Waste (betaler for unused capacity) | Start med 80% av peak, use Standard spillover |
| **Under-provisioning PTU** | Poor UX (throttling, latency) + cost overruns (Standard overage) | Benchmark actual traffic, rightsize monthly |
| **Ignoring context window limits (Model Router)** | Failed requests (hvis prompt > 128k til modell som ikke støtter det) | Model subset selection (kun models med required context window) |
| **Complex routing logic i gateway policies** | Maintenance hell, hard to debug | Start simple (token count), iterate. Vurder custom code gateway for complexity. |
| **No circuit breaker** | Cascade failures, throttling amplification | Azure API Management circuit breaker policy (respekter `Retry-After`) |
| **Single-region gateway for multi-region backends** | Latency + egress costs + SPoF | Deploy multi-region API Management eller custom HA gateway |
| **Cross-geopolitical routing** | Compliance violation (GDPR, Schrems II) | Isolated gateways per region, NSG rules enforcement |
| **No cost monitoring** | Budget overruns discovery too late | Azure Cost Management alerts, monthly reviews, gateway analytics |
---
### Anbefalinger per modenhetsnivå
**Level 1 (Pilot/POC):**
- Start med **Model Router (Balanced mode)** for minimal complexity
- Single deployment (Global Standard eller Data Zone Standard)
- Monitor cost vs. quality over 1-2 måneder
- Decision point: Er besparelse + quality akseptabelt? → Produksjoniser. Nei? → Vurder custom gateway.
**Level 2 (Production, single-region):**
- **Model Router (Custom deploy)** med model subset for compliance
- Eller **Azure API Management** for simple routing (token count, task type)
- Provisioned PTU for baseline + Standard spillover
- Azure Cost Management alerts + monthly reviews
**Level 3 (Enterprise, multi-region, multi-tenant):**
- **Custom gateway** (Azure API Management multi-region eller ACA/AKS + Azure Front Door)
- Client identity-based routing, chargeback models
- Provisioned PTU med 1-year reservations per region
- Automated cost optimization (dynamic model selection basert på budget thresholds)
- Compliance audit trail (Log Analytics, Azure Policy)
**Level 4 (Advanced optimization):**
- **Hybrid multi-model strategy:** Azure OpenAI (premium tasks) + AI Foundry open models (commodity tasks)
- Fine-tuned models for domain-specific cost reduction
- Real-time cost/quality feedback loop (A/B testing av routing strategies)
- FinOps team ownership med automated chargebacks
## Kilder og verifisering
**Microsoft Learn (MCP-verified):**
1. [Model router for Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/model-router) — **Verified** (MCP fetch, 2026-04)
2. [Use a gateway in front of multiple Azure OpenAI deployments](https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/azure-openai-gateway-multi-backend) — **Verified** (MCP fetch, 2026-04). Dokument bekrefter: (a) credential termination og reestablishment ved gateway anbefales fremfor pass-through client credentials, (b) gateway gir client-based usage tracking og chargeback-støtte, (c) Azure OpenAI er nå tagget som "Foundry Tools / Azure OpenAI in Foundry Models".
3. [Understanding costs associated with provisioned throughput units (PTU)](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/provisioned-throughput-onboarding) — **Verified** (MCP search, 2026-04)
4. [Azure OpenAI in Azure AI Foundry Models](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/models) — **Verified** (MCP search, 2026-04)
5. [GPT-4o vs GPT-4o mini model selection](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/whats-new) — **Verified** (MCP search, 2026-04)
**GitHub samples (MCP-referenced):**
1. [Smart load balancing for Azure OpenAI (Azure API Management)](https://github.com/Azure-Samples/openai-apim-lb) — **Verified**
2. [Scaling Azure OpenAI using Azure API Management](https://github.com/Azure/aoai-apim/) — **Verified**
3. [GenAI gateway toolkit](https://github.com/Azure-Samples/apim-genai-gateway-toolkit) — **Verified**
**Pricing and calculators:**
1. [Azure Pricing Calculator](https://azure.microsoft.com/pricing/calculator) — **Baseline** (pricing subject to change)
2. [Azure AI Foundry PTU calculator](https://ai.azure.com/resource/calculator) — **Verified** (MCP-referenced)
**Konfidensnivå per seksjon:**
| Seksjon | Confidence | Source |
|---------|------------|--------|
| Model Router (components, modes, models) | **Verified** | MCP microsoft-learn fetch |
| Custom Gateway architectures | **Verified** | MCP microsoft-learn fetch |
| Arkitekturmønstre (1-5) | **Verified** | MCP microsoft-learn + GitHub samples |
| Prissammenligning | **Baseline** | Estimated from USD pricing + currency conversion (verify with Azure Pricing Calculator) |
| Besparelsespotensiale | **Baseline** | Example calculations (actual savings depend on workload) |
| Offentlig sektor (compliance, budsjett) | **Baseline** | General best practices (verify with legal/compliance team) |
| Integration (API Management policies) | **Verified** | MCP code samples + GitHub repos |
---
**Sist oppdatert:** 2026-04 (basert på Model Router version `2025-11-18` og Azure OpenAI pricing per april 2026). Verified (MCP 2026-04).
**Neste review:** Ved nye Model Router-versjoner eller større pricing changes.