- Critical bucket (9 files): substantive content updates basert på MCP-fetch - enterprise-governance: DSPM front door, AI-app-kategorier (3), single-tenant Entra ID - rag-cost-optimization, observability, ai-services-enterprise, multi-model-strategy: dato-bump - deterministic-cost: Copilot Credits offisiell common currency (2025-09-01), CCCU prepurchase - gpt5-gpt41-pricing: utvidet Copilot Studio modell-lineup (GPT-5.2, GPT-5.3, Claude 4.6, Grok 4.1) - vector-storage, request-batching: dato-bump (DS allerede dekkende) - High batch 1 (21 files, 10-30): Last updated 2026-04→2026-05 dato-bump Substantive Microsoft Learn-endringer var marginale per fetch — kosmetiske oppdateringer. Resterende: high batch 2 (filer 31-53, 23 filer) i ny sesjon. Se NEXT-SESSION-PROMPT.local.md.
28 KiB
Multi-Model Strategy: Cost-Performance Trade-offs
Last updated: 2026-05 | Verified: MCP 2026-05 Status: GA Category: Cost Optimization & FinOps for AI
Introduksjon
Moderne AI-løsninger krever ofte forskjellige modellkapabiliteter for ulike oppgaver. En multi-model strategy innebærer intelligent routing av requests til den mest kostnadseffektive modellen som tilfredsstiller kvalitetskravene. Med Azure OpenAI-modeller som varierer fra GPT-4.1-nano (59 400 tokens/PTU) til GPT-5 (4 750 tokens/PTU) kan besparelsene være betydelige — opptil 90% kostnadsdifferanse mellom modeller for enkle oppgaver.
Model Router fra Microsoft er en trent språkmodell som automatiserer denne beslutningsprosessen i real-time. Den analyserer prompt-kompleksitet, resonnementskrav og oppgavetype for å velge optimal modell fra et sett på opptil 18 underliggende modeller (inkludert GPT-serien, Claude, DeepSeek, Llama og Grok). Dette gir én deployment-overflate med kombinert kosteffektivitet og kvalitet.
For organisasjoner som ønsker mer kontroll, tilbyr custom gateway-løsninger (via Azure API Management eller egen kode) mulighet for egendefinerte routing-regler basert på client identity, quota management, blue-green deployments eller data sovereignty-krav. Denne kunnskapsfilen dekker både managed (Model Router) og custom gateway-strategier for multi-model deployments.
Kjernekomponenter
Model Router (Managed Multi-Model Strategy)
| Komponent | Beskrivelse | Versjon/Status |
|---|---|---|
| Model Router | Trent LLM som router prompts til beste underliggende modell | 2025-11-18 (GA) |
| Routing Modes | Quality (max nøyaktighet), Balanced (default), Cost (max besparelse) | GA |
| Model Subset | Custom selection av underliggende modeller for routing | GA |
| Deployment Types | Global Standard, Data Zone Standard | Regional: East US 2, Sweden Central |
| Underlying Models | 18 modeller: GPT-4.1/5-serien, o-series, Claude, DeepSeek, Llama, Grok | Varierer per modell |
Underliggende modeller i Model Router 2025-11-18:
- OpenAI-modeller: gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-5, gpt-5-mini, gpt-5-nano, gpt-5-chat, o4-mini, gpt-4o, gpt-4o-mini
- Reasoning-modeller: o4-mini (preview)
- 3rd-party modeller: DeepSeek-V3.1, gpt-oss-120b, Llama-4-Maverick-17B-128E-Instruct-FP8, grok-4, grok-4-fast
- Claude (krever egen deployment): claude-haiku-4-5, claude-opus-4-1, claude-sonnet-4-5
Rate limits (Model Router 2025-11-18):
| Deployment Type | Default RPM | Default TPM | Enterprise RPM | Enterprise TPM |
|---|---|---|---|---|
| GlobalStandard | 250 | 250 000 | 400 | 400 000 |
| DataZoneStandard | 150 | 150 000 | 300 | 300 000 |
Custom Gateway Architectures
| Topology | Use Case | Tools |
|---|---|---|
| Single Instance + Multiple Deployments | Routing mellom modellversjoner eller fine-tuned models | Azure API Management |
| Multiple Instances (Same Region) | Security segmentation, chargeback, failover, quota spillover (Provisioned → Standard) | Azure API Management |
| Multiple Instances (Multi-Region) | Regional failover, data residency, mixed model availability | Azure API Management (multi-region) eller custom code (ACA/AKS) |
Gateway implementations:
- Azure API Management: PaaS-løsning med backend pools, circuit breaker, policy-basert routing
- Custom Code: Full kontroll, typisk Azure Container Apps eller AKS, frontet av Azure Front Door/Traffic Manager
Arkitekturmønstre
1. Model Router: Managed Multi-Model Routing
Scenario: Automatisk routing uten custom gateway-kode.
Arkitektur:
Client → Model Router Deployment → [Auto-selected underlying model]
Routing modes:
- Balanced (default): Velger blant modeller innenfor 1-2% kvalitetsrange av beste modell, prioriterer kostnad
- Cost: Større kvalitetsbånd (5-6% fra beste), maksimerer besparelse
- Quality: Alltid høyeste kvalitet, ignorerer kostnad
Model subset: Custom deploy med eksplisitt subset (f.eks. kun GPT-4.1, GPT-4.1-mini, o4-mini) for compliance eller budsjettskranker.
Fordeler:
- Én deployment-overflate, ingen gateway-kode
- Real-time routing uten lag
- Supports tools/function calling (agentic scenarios)
Ulemper:
- Mindre kontroll over routing-logikk
- Context window begrenset til minste underliggende modell (128k for GPT-4.1-serien)
- Routing basert kun på text input (ikke images)
Kostnader:
- Input prompt: Charged per pricing page (fra nov 2025)
- Ingen ekstra hosting cost (inkludert i model deployment)
2. Static Model Routing (Task-Specific Models)
Scenario: Eksplisitt model selection per oppgavetype i client-kode.
Arkitektur:
Client Logic:
if task == "summary": use gpt-4.1-mini
if task == "reasoning": use o4-mini
if task == "simple_qa": use gpt-4.1-nano
→ Azure OpenAI deployments (direct)
Decision criteria:
| Task Type | Model | Rationale |
|---|---|---|
| Simple Q&A, classification | gpt-4.1-nano | 59 400 TPM/PTU, laveste kostnad |
| Summarization, translation | gpt-4.1-mini | 14 900 TPM/PTU, god balance |
| Complex reasoning | o4-mini | Reasoning-capable, 5 400 TPM/PTU |
| High-quality content | gpt-5 | 4 750 TPM/PTU, best quality |
Fordeler:
- Full kontroll, ingen routing-lag
- Predictable costs per task type
Ulemper:
- Logic i client-kode (maintainability)
- Ingen dynamic fallback ved throttling
3. Dynamic Complexity-Based Routing (Custom Gateway)
Scenario: Gateway analyserer prompt-kompleksitet og router dynamisk.
Arkitektur:
Client → Azure API Management (eller custom gateway)
├─ Complexity Score (token count, question marks, "explain", "analyze")
├─ Score < 50: route to gpt-4.1-nano
├─ Score 50-200: route to gpt-4.1-mini
└─ Score > 200: route to gpt-5
→ Azure OpenAI instances (multiple deployments)
Implementation (Azure API Management policy):
<choose>
<when condition="@(context.Request.Body.As<JObject>()["messages"][0]["content"].ToString().Length < 200)">
<set-backend-service backend-id="aoai-nano-backend" />
</when>
<when condition="@(context.Request.Body.As<JObject>()["messages"][0]["content"].ToString().Length < 1000)">
<set-backend-service backend-id="aoai-mini-backend" />
</when>
<otherwise>
<set-backend-service backend-id="aoai-gpt5-backend" />
</otherwise>
</choose>
Fordeler:
- Server-side logic (client-agnostic)
- Supports versioning/blue-green deployments
- Usage tracking per client (via API Management analytics)
Ulemper:
- Gateway = single point of failure (krever multi-region for HA)
- Complexity i policy-logic
4. Cascading Model Pipeline (Quality Fallback)
Scenario: Start med billig modell, retry med dyrere ved lav confidence.
Arkitektur:
Client → Gateway
├─ Try gpt-4.1-nano
├─ If confidence < 0.7: retry with gpt-4.1-mini
└─ If confidence < 0.7: retry with gpt-5
→ Multiple Azure OpenAI deployments
Implementation (pseudokode):
response = call_model("gpt-4.1-nano", prompt)
if response.confidence < 0.7:
response = call_model("gpt-4.1-mini", prompt)
if response.confidence < 0.7:
response = call_model("gpt-5", prompt)
return response
Fordeler:
- Quality guarantee med cost optimization
- Automatic escalation
Ulemper:
- Latency ved retries
- Complexity i confidence scoring (krever logprobs eller custom metrics)
5. Provisioned + Standard Spillover (Cost + Elasticity)
Scenario: Provisioned PTU for baseline, Standard deployment for burst traffic.
Arkitektur:
Client → Azure API Management
├─ Primary: Provisioned PTU deployment (300 PTU)
└─ Spillover (on 429): Standard deployment
→ Same Azure OpenAI instance or multiple instances
Cost model:
- Provisioned: Fast hourly cost ($/PTU/hr), predict for 80-90% av traffic
- Standard: Pay-per-token for burst (10-20% av traffic)
Implementation (Azure API Management policy):
<retry condition="@(context.Response.StatusCode == 429)" count="3" interval="1">
<set-backend-service backend-id="aoai-provisioned-backend" />
<forward-request />
<choose>
<when condition="@(context.Response.StatusCode == 429)">
<set-backend-service backend-id="aoai-standard-backend" />
</when>
</choose>
</retry>
Fordeler:
- Cost optimization: provisioned for baseline, pay-as-you-go for peaks
- Latency guarantee via PTU
Ulemper:
- Provisioned capacity må rightsizes (bruk Azure AI Foundry PTU calculator)
- Standard quotas er subscription-level (ikke instance-level)
Beslutningsveiledning
Når bruke Model Router vs. Custom Gateway
| Kriterium | Model Router | Custom Gateway |
|---|---|---|
| Deployment kompleksitet | Lav (én deployment) | Høy (infrastruktur + policy) |
| Routing control | Modes + subset | Full kontroll (logic, rules, client identity) |
| Data residency | Data Zone Standard (single zone) | Krever per-region gateways for compliance |
| Multi-region failover | Nei (single deployment) | Ja (med API Management multi-region eller custom HA) |
| Client segmentation | Nei | Ja (quota per client, chargeback models) |
| Blue-green deployments | Nei | Ja (route to different model versions) |
| Cost | Model Router input charge + token usage | Gateway hosting + token usage |
| Latency | Real-time routing (minimal overhead) | Gateway hop (~5-20ms, avhengig av region) |
Tommelfingerregel:
- Model Router: For de fleste use cases med standard routing needs
- Custom Gateway: Når du trenger client identity routing, data sovereignty, multi-region HA, eller quota management
Decision Tree: Velge Multi-Model Strategy
START: Trenger du multi-model routing?
├─ NEI: Bruk single model deployment (Standard eller Provisioned)
└─ JA:
├─ Trenger du data residency compliance på tvers av regioner?
│ ├─ JA: Custom gateway per region (API Management multi-region)
│ └─ NEI: Continue
├─ Trenger du client-specific quota eller chargeback?
│ ├─ JA: Custom gateway (API Management + client identity routing)
│ └─ NEI: Continue
├─ Trenger du blue-green deployments eller model versioning?
│ ├─ JA: Custom gateway (API Management policies)
│ └─ NEI: Continue
└─ Default: Model Router (Balanced mode)
├─ Cost-sensitive workload: Model Router (Cost mode)
└─ Quality-critical workload: Model Router (Quality mode)
Vanlige feil
| Feil | Konsekvens | Fix |
|---|---|---|
| Routing til forskjellige model versions | Inconsistent responses, breaking changes | Alltid samme model + version i load balancing/failover |
Ignoring Retry-After header |
Aggressive retries forverrer throttling | Circuit breaker logic med Retry-After respekt |
| Gateway i single region for multi-region backends | Latency + egress costs | Multi-region gateway deployment (API Management multi-region) |
| Cross-geopolitical routing | Data residency violation | Isolated gateways per geopolitical region |
| Standard deployments i multiple subscriptions (samme region) | Ikke økt quota (subscription-level quota) | Bruk Global/Data Zone Standard deployments istedenfor |
| Underdimensjonert Provisioned PTU | Spillover til Standard = cost overruns | Bruk PTU calculator, rightsizing |
Røde flagg
- 🚩 Gateway som single point of failure: Deploy HA gateway (multi-region eller availability zones)
- 🚩 No health checks på gateway: Synthetic transactions eller
/statusendpoint for upstream health - 🚩 Complex routing logic i gateway policies: Vurder custom code gateway (ACA/AKS) for bedre testability
- 🚩 Model Router med custom context window > 128k: Subset-select kun modeller som støtter dette (f.eks. GPT-5-serien med 400k context)
- 🚩 Provisioned PTU scaling on-demand: PTU capacity er ikke garantert, bruk reservations for production
Integrasjon med Microsoft-stakken
Azure OpenAI + Model Router
Quick Deploy:
# Foundry portal: Model catalog → Model Router → Quick Deploy
# Deployment type: Global Standard eller Data Zone Standard
# Routing mode: Balanced (default), Cost, Quality
Custom Deploy (med Model Subset):
# Foundry portal: Model catalog → Model Router → Custom Deploy
# 1. Velg deployment type
# 2. Set Routing mode: Cost
# 3. Model subset: Select kun gpt-4.1-mini, gpt-4.1-nano, o4-mini
# 4. Deploy
Python SDK (bruk Model Router):
import os
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("AZURE_OPENAI_API_KEY"),
base_url="https://YOUR-RESOURCE.openai.azure.com/openai/v1/"
)
response = client.chat.completions.create(
model="model-router", # Model Router deployment name
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
]
)
print(response.choices[0].message.content)
# Model Router automatically selected underlying model (visible in response.model field)
Azure API Management (Custom Gateway)
Backend pools for load balancing:
<backend-pool>
<backend id="aoai-nano-backend">
<url>https://aoai-instance1.openai.azure.com</url>
</backend>
<backend id="aoai-mini-backend">
<url>https://aoai-instance2.openai.azure.com</url>
</backend>
<backend id="aoai-gpt5-backend">
<url>https://aoai-instance3.openai.azure.com</url>
</backend>
</backend-pool>
Circuit breaker policy (preview):
<backends>
<backend>
<circuit-breaker rules="@{
new CircuitBreakerRule(
failureCondition: new HttpStatusCodeCondition(statusCodes: new[] { HttpStatusCode.TooManyRequests }),
tripDuration: TimeSpan.FromSeconds(60),
retryAfterHeader: true
)
}" />
</backend>
</backends>
Referansearkitekturer:
- Smart load balancing for Azure OpenAI using Azure API Management (GitHub)
- Scaling Azure OpenAI using Azure API Management (GitHub, Provisioned + Standard spillover)
- GenAI gateway toolkit (Load testing + policies)
Semantic Kernel (Application layer routing)
// Static routing per task type
var kernel = Kernel.CreateBuilder()
.AddAzureOpenAIChatCompletion(
deploymentName: "gpt-4.1-nano",
endpoint: "https://YOUR-RESOURCE.openai.azure.com",
apiKey: apiKey,
serviceId: "simple-tasks")
.AddAzureOpenAIChatCompletion(
deploymentName: "gpt-5",
endpoint: "https://YOUR-RESOURCE.openai.azure.com",
apiKey: apiKey,
serviceId: "complex-tasks")
.Build();
// Select service dynamically
var chatService = taskComplexity > threshold
? kernel.GetRequiredService<IChatCompletionService>("complex-tasks")
: kernel.GetRequiredService<IChatCompletionService>("simple-tasks");
AI Foundry Model Catalog
Tiered inference (utenfor Azure OpenAI):
- Foundry Model Catalog: Meta Llama, Mistral, Cohere, Phi-modeller
- Deployment options: Managed compute, Serverless API, Pay-as-you-go
- Use case: Combine Azure OpenAI med open-source modeller for cost-tier strategy
Eksempel: GPT-4.1 for critical tasks, Phi-4 (Microsoft open model) for simple classification.
Offentlig sektor (Norge)
Datasuverenitet og Multi-Model Routing
Model Router:
- Data Zone Standard: Holder data innenfor Microsoft-spesifisert data zone (f.eks. EU Data Boundary)
- Underliggende modeller: Må deployes i samme data zone (unntatt Claude, som krever separate deployments)
Custom Gateway (multi-region):
- Geopolitical boundaries: Deploy isolated gateways per region (f.eks. Norway East, West Europe)
- Data residency: Ensure no cross-region routing (NSG rules, policy enforcement)
- Compliance: Azure Policy for consistency (model versions, encryption, network perimeter)
GDPR/Schrems II:
- Prefer Data Zone Standard deployments
- Audit gateway logs for data flows (Azure Monitor, Log Analytics)
Budsjettprosesser og Kostnadskontroll
Utfordring: Offentlige etater har årlige budsjetter, AI-kostnader må forecasting.
Multi-model strategy for budsjettforutsigbarhet:
-
Baseline med Provisioned PTU:
- Allokér fast kostnad ($/PTU/hr) for 80-90% av forventet traffic
- Bruk PTU calculator for sizing
- Purchase Azure Reservations (1-year eller 3-year) for cost savings (opptil 50%)
-
Burst traffic med Standard:
- Standard deployment for peak periods (budget 10-20% ekstra)
- Azure Cost Management alerts ved threshold (f.eks. 90% av månedsbudsjett)
-
Model Router (Cost mode) for volume workloads:
- Batch-prosessering av dokumenter: Cost mode router til billigste modell
- Quality-critical (f.eks. juridisk analyse): Quality mode for nøyaktighet
Cost Management integration:
# Azure Cost Management API: Track costs per resource group
az consumption usage list --start-date 2026-02-01 --end-date 2026-02-28 \
--query "[?contains(instanceName, 'model-router')]" \
--output table
Compliance-krav (Schrems II, NIS2)
Multi-region gateway for compliance:
- NIS2 (Network and Information Security Directive): Krever høy tilgjengelighet, incident response
- Multi-region deployment: Active-active gateways (Azure API Management multi-region) for SLA > 99.9%
- Incident response: Azure Monitor alerts på gateway health, automatic failover
Audit trail:
- Gateway logger alle routing decisions (Azure Log Analytics)
- Include client identity, selected model, response time, cost per request
Kostnad og lisensiering
Prissammenligning mellom modeller
Standard Deployment (Pay-as-you-go, NOK per 1M tokens, estimert 2026 rates):
| Model | Input (NOK/1M tokens) | Output (NOK/1M tokens) | Ratio (Output:Input) |
|---|---|---|---|
| gpt-4.1-nano | ~50 | ~200 | 4:1 |
| gpt-4.1-mini | ~150 | ~600 | 4:1 |
| gpt-4.1 | ~300 | ~1200 | 4:1 |
| gpt-5-mini | ~100 | ~400 | 4:1 |
| gpt-5 | ~500 | ~2000 | 4:1 |
| gpt-5-chat | ~250 | ~1000 | 4:1 |
| o4-mini | ~350 | ~1400 | 4:1 |
| gpt-4o | ~250 | ~1000 | 4:1 |
| gpt-4o-mini | ~75 | ~300 | 4:1 |
(Priser er estimater basert på USD-pricing + valutakurs. Verifiser Azure Pricing Calculator for eksakte NOK-priser.)
Provisioned Throughput (PTU, NOK per PTU/hr, estimert):
| Model | TPM per PTU (Input) | PTU/hr cost (NOK, estimated) |
|---|---|---|
| gpt-4.1-nano | 59 400 | ~80-120 |
| gpt-4.1-mini | 14 900 | ~80-120 |
| gpt-4.1 | 3 000 | ~120-180 |
| gpt-5-mini | 23 750 | ~100-150 |
| gpt-5 | 4 750 | ~180-250 |
| o4-mini | 5 400 | ~150-200 |
(Provisioned pricing varierer per region og reservation type. Bruk Azure Pricing Calculator.)
Besparelsespotensiale
Eksempel: Dokumentsammendrag (offentlig etat, 10M tokens/måned):
| Strategi | Model(s) | Monthly Cost (NOK, estimert) | Savings |
|---|---|---|---|
| Baseline (all GPT-5) | gpt-5 | ~25 000 (10M input + 2M output) | - |
| Static routing | 70% gpt-4.1-mini, 30% gpt-5 | ~10 000 | 60% |
| Model Router (Balanced) | Auto-routing | ~8 000 | 68% |
| Model Router (Cost mode) | Auto-routing (larger quality band) | ~6 000 | 76% |
Provisioned PTU scenario (high-volume, 100M tokens/måned):
| Strategi | Setup | Monthly Cost (NOK, estimated) | Savings |
|---|---|---|---|
| Standard pay-as-you-go | 100M input, 20M output | ~200 000 | - |
| Provisioned (300 PTU gpt-5) | 300 PTU × 730 hrs × ~200 NOK/PTU/hr | ~43 800 + token overage | 78% |
| Provisioned + Standard spillover | 200 PTU + Standard for 20% burst | ~35 000 | 82% |
(Estimater avhenger av traffic patterns. Bruk PTU calculator for nøyaktig sizing.)
Optimaliseringstips
-
Right-size Provisioned PTU:
- Benchmark actual workload (ikke estimater)
- Start med 80% av forventet peak, use Standard spillover for 20%
- Purchase Azure Reservations (1-year) for 30-50% savings på PTU cost
-
Model Router for varierende workloads:
- Bruk Balanced mode som default
- Cost mode for batch-processing (ikke time-sensitive)
- Quality mode for compliance-kritiske outputs (juridisk, helse)
-
Cache optimization:
- Prompt caching (GPT-4.1+): 100% discount på cached tokens
- Semantic Kernel memory: Cache embeddings for RAG
-
Fine-tuning for cost reduction:
- Fine-tuned gpt-4o-mini kan matche gpt-4o quality for specific tasks
- Cost: $1.70/hour hosting + token usage (same rate as base model)
- Example: Fine-tune for domain-specific summarization → replace GPT-5 with gpt-4.1-mini
-
Monitor and adjust:
- Azure Cost Management: Set budgets + alerts
- Gateway analytics: Track cost per client, per model, per task type
- Monthly review: Adjust Model Router subset or gateway rules based on cost/quality metrics
For arkitekten (Cosmo)
Spørsmål å stille kunden
-
Traffic patterns:
- Hva er forventet requests per minute (peak og average)?
- Er traffic jevn over døgnet, eller er det klare peak-perioder?
- Hvor mange tokens per request (input + output)?
-
Quality vs. Cost prioritering:
- Er det rom for 1-2% kvalitetsreduksjon for kostbesparelse (Balanced mode)?
- Eller er 100% kvalitet ikke-forhandlbart (Quality mode)?
- Hvilke oppgaver kan bruke billigere modeller (klassifisering, simple Q&A)?
-
Compliance og data residency:
- Må data forbli innenfor Norge/EU/spesifikt geography?
- Kreves audit trail for model selection decisions?
- Er det multi-tenant scenario med chargeback-krav?
-
Existing infrastructure:
- Bruker dere allerede Azure API Management, eller må gateway deployes fra scratch?
- Finnes det multi-region krav for HA/DR?
- Hva er akseptabel latency for gateway hop (5-20ms)?
-
Budget og forecasting:
- Er det fast årlig budsjett, eller pay-as-you-go flexibility?
- Kan dere committe til 1-year reservation for PTU savings?
- Hva er threshold for cost alerts (90% av budsjett)?
-
Deployment strategi:
- Trenger dere blue-green deployments for model versioning?
- Vil dere starte med Model Router og vurdere custom gateway senere?
- Er det behov for client-specific quota (per-team, per-prosjekt)?
-
Monitoring og optimalisering:
- Hvem eier cost management (IT, finance, product team)?
- Hvor ofte skal cost/quality metrics reviewes (månedlig, kvartalsvis)?
- Finnes det baseline metrics for quality (f.eks. F1-score, BLEU)?
Fallgruver
| Fallgruve | Impact | Mitigation |
|---|---|---|
| Over-provisioning PTU | Waste (betaler for unused capacity) | Start med 80% av peak, use Standard spillover |
| Under-provisioning PTU | Poor UX (throttling, latency) + cost overruns (Standard overage) | Benchmark actual traffic, rightsize monthly |
| Ignoring context window limits (Model Router) | Failed requests (hvis prompt > 128k til modell som ikke støtter det) | Model subset selection (kun models med required context window) |
| Complex routing logic i gateway policies | Maintenance hell, hard to debug | Start simple (token count), iterate. Vurder custom code gateway for complexity. |
| No circuit breaker | Cascade failures, throttling amplification | Azure API Management circuit breaker policy (respekter Retry-After) |
| Single-region gateway for multi-region backends | Latency + egress costs + SPoF | Deploy multi-region API Management eller custom HA gateway |
| Cross-geopolitical routing | Compliance violation (GDPR, Schrems II) | Isolated gateways per region, NSG rules enforcement |
| No cost monitoring | Budget overruns discovery too late | Azure Cost Management alerts, monthly reviews, gateway analytics |
Anbefalinger per modenhetsnivå
Level 1 (Pilot/POC):
- Start med Model Router (Balanced mode) for minimal complexity
- Single deployment (Global Standard eller Data Zone Standard)
- Monitor cost vs. quality over 1-2 måneder
- Decision point: Er besparelse + quality akseptabelt? → Produksjoniser. Nei? → Vurder custom gateway.
Level 2 (Production, single-region):
- Model Router (Custom deploy) med model subset for compliance
- Eller Azure API Management for simple routing (token count, task type)
- Provisioned PTU for baseline + Standard spillover
- Azure Cost Management alerts + monthly reviews
Level 3 (Enterprise, multi-region, multi-tenant):
- Custom gateway (Azure API Management multi-region eller ACA/AKS + Azure Front Door)
- Client identity-based routing, chargeback models
- Provisioned PTU med 1-year reservations per region
- Automated cost optimization (dynamic model selection basert på budget thresholds)
- Compliance audit trail (Log Analytics, Azure Policy)
Level 4 (Advanced optimization):
- Hybrid multi-model strategy: Azure OpenAI (premium tasks) + AI Foundry open models (commodity tasks)
- Fine-tuned models for domain-specific cost reduction
- Real-time cost/quality feedback loop (A/B testing av routing strategies)
- FinOps team ownership med automated chargebacks
Kilder og verifisering
Microsoft Learn (MCP-verified):
- Model router for Azure AI Foundry — Verified (MCP fetch, 2026-04)
- Use a gateway in front of multiple Azure OpenAI deployments — Verified (MCP fetch, 2026-04). Dokument bekrefter: (a) credential termination og reestablishment ved gateway anbefales fremfor pass-through client credentials, (b) gateway gir client-based usage tracking og chargeback-støtte, (c) Azure OpenAI er nå tagget som "Foundry Tools / Azure OpenAI in Foundry Models".
- Understanding costs associated with provisioned throughput units (PTU) — Verified (MCP search, 2026-04)
- Azure OpenAI in Azure AI Foundry Models — Verified (MCP search, 2026-04)
- GPT-4o vs GPT-4o mini model selection — Verified (MCP search, 2026-04)
GitHub samples (MCP-referenced):
- Smart load balancing for Azure OpenAI (Azure API Management) — Verified
- Scaling Azure OpenAI using Azure API Management — Verified
- GenAI gateway toolkit — Verified
Pricing and calculators:
- Azure Pricing Calculator — Baseline (pricing subject to change)
- Azure AI Foundry PTU calculator — Verified (MCP-referenced)
Konfidensnivå per seksjon:
| Seksjon | Confidence | Source |
|---|---|---|
| Model Router (components, modes, models) | Verified | MCP microsoft-learn fetch |
| Custom Gateway architectures | Verified | MCP microsoft-learn fetch |
| Arkitekturmønstre (1-5) | Verified | MCP microsoft-learn + GitHub samples |
| Prissammenligning | Baseline | Estimated from USD pricing + currency conversion (verify with Azure Pricing Calculator) |
| Besparelsespotensiale | Baseline | Example calculations (actual savings depend on workload) |
| Offentlig sektor (compliance, budsjett) | Baseline | General best practices (verify with legal/compliance team) |
| Integration (API Management policies) | Verified | MCP code samples + GitHub repos |
Sist oppdatert: 2026-04 (basert på Model Router version 2025-11-18 og Azure OpenAI pricing per april 2026). Verified (MCP 2026-04).
Neste review: Ved nye Model Router-versjoner eller større pricing changes.