ktg-plugin-marketplace/plugins/ms-ai-architect/skills/ms-ai-security/references/cost-optimization/multi-model-strategy-costs.md
Kjell Tore Guttormsen 82bd665ba0 chore(ms-ai-architect): KB checkpoint refresh — 30 files (critical 9 + high batch 1) [skip-docs]
- Critical bucket (9 files): substantive content updates basert på MCP-fetch
  - enterprise-governance: DSPM front door, AI-app-kategorier (3), single-tenant Entra ID
  - rag-cost-optimization, observability, ai-services-enterprise, multi-model-strategy: dato-bump
  - deterministic-cost: Copilot Credits offisiell common currency (2025-09-01), CCCU prepurchase
  - gpt5-gpt41-pricing: utvidet Copilot Studio modell-lineup (GPT-5.2, GPT-5.3, Claude 4.6, Grok 4.1)
  - vector-storage, request-batching: dato-bump (DS allerede dekkende)

- High batch 1 (21 files, 10-30): Last updated 2026-04→2026-05 dato-bump
  Substantive Microsoft Learn-endringer var marginale per fetch — kosmetiske oppdateringer.

Resterende: high batch 2 (filer 31-53, 23 filer) i ny sesjon. Se NEXT-SESSION-PROMPT.local.md.
2026-05-05 14:28:35 +02:00

28 KiB
Raw Blame History

Multi-Model Strategy: Cost-Performance Trade-offs

Last updated: 2026-05 | Verified: MCP 2026-05 Status: GA Category: Cost Optimization & FinOps for AI


Introduksjon

Moderne AI-løsninger krever ofte forskjellige modellkapabiliteter for ulike oppgaver. En multi-model strategy innebærer intelligent routing av requests til den mest kostnadseffektive modellen som tilfredsstiller kvalitetskravene. Med Azure OpenAI-modeller som varierer fra GPT-4.1-nano (59 400 tokens/PTU) til GPT-5 (4 750 tokens/PTU) kan besparelsene være betydelige — opptil 90% kostnadsdifferanse mellom modeller for enkle oppgaver.

Model Router fra Microsoft er en trent språkmodell som automatiserer denne beslutningsprosessen i real-time. Den analyserer prompt-kompleksitet, resonnementskrav og oppgavetype for å velge optimal modell fra et sett på opptil 18 underliggende modeller (inkludert GPT-serien, Claude, DeepSeek, Llama og Grok). Dette gir én deployment-overflate med kombinert kosteffektivitet og kvalitet.

For organisasjoner som ønsker mer kontroll, tilbyr custom gateway-løsninger (via Azure API Management eller egen kode) mulighet for egendefinerte routing-regler basert på client identity, quota management, blue-green deployments eller data sovereignty-krav. Denne kunnskapsfilen dekker både managed (Model Router) og custom gateway-strategier for multi-model deployments.

Kjernekomponenter

Model Router (Managed Multi-Model Strategy)

Komponent Beskrivelse Versjon/Status
Model Router Trent LLM som router prompts til beste underliggende modell 2025-11-18 (GA)
Routing Modes Quality (max nøyaktighet), Balanced (default), Cost (max besparelse) GA
Model Subset Custom selection av underliggende modeller for routing GA
Deployment Types Global Standard, Data Zone Standard Regional: East US 2, Sweden Central
Underlying Models 18 modeller: GPT-4.1/5-serien, o-series, Claude, DeepSeek, Llama, Grok Varierer per modell

Underliggende modeller i Model Router 2025-11-18:

  • OpenAI-modeller: gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-5, gpt-5-mini, gpt-5-nano, gpt-5-chat, o4-mini, gpt-4o, gpt-4o-mini
  • Reasoning-modeller: o4-mini (preview)
  • 3rd-party modeller: DeepSeek-V3.1, gpt-oss-120b, Llama-4-Maverick-17B-128E-Instruct-FP8, grok-4, grok-4-fast
  • Claude (krever egen deployment): claude-haiku-4-5, claude-opus-4-1, claude-sonnet-4-5

Rate limits (Model Router 2025-11-18):

Deployment Type Default RPM Default TPM Enterprise RPM Enterprise TPM
GlobalStandard 250 250 000 400 400 000
DataZoneStandard 150 150 000 300 300 000

Custom Gateway Architectures

Topology Use Case Tools
Single Instance + Multiple Deployments Routing mellom modellversjoner eller fine-tuned models Azure API Management
Multiple Instances (Same Region) Security segmentation, chargeback, failover, quota spillover (Provisioned → Standard) Azure API Management
Multiple Instances (Multi-Region) Regional failover, data residency, mixed model availability Azure API Management (multi-region) eller custom code (ACA/AKS)

Gateway implementations:

  • Azure API Management: PaaS-løsning med backend pools, circuit breaker, policy-basert routing
  • Custom Code: Full kontroll, typisk Azure Container Apps eller AKS, frontet av Azure Front Door/Traffic Manager

Arkitekturmønstre

1. Model Router: Managed Multi-Model Routing

Scenario: Automatisk routing uten custom gateway-kode.

Arkitektur:

Client → Model Router Deployment → [Auto-selected underlying model]

Routing modes:

  • Balanced (default): Velger blant modeller innenfor 1-2% kvalitetsrange av beste modell, prioriterer kostnad
  • Cost: Større kvalitetsbånd (5-6% fra beste), maksimerer besparelse
  • Quality: Alltid høyeste kvalitet, ignorerer kostnad

Model subset: Custom deploy med eksplisitt subset (f.eks. kun GPT-4.1, GPT-4.1-mini, o4-mini) for compliance eller budsjettskranker.

Fordeler:

  • Én deployment-overflate, ingen gateway-kode
  • Real-time routing uten lag
  • Supports tools/function calling (agentic scenarios)

Ulemper:

  • Mindre kontroll over routing-logikk
  • Context window begrenset til minste underliggende modell (128k for GPT-4.1-serien)
  • Routing basert kun på text input (ikke images)

Kostnader:

  • Input prompt: Charged per pricing page (fra nov 2025)
  • Ingen ekstra hosting cost (inkludert i model deployment)

2. Static Model Routing (Task-Specific Models)

Scenario: Eksplisitt model selection per oppgavetype i client-kode.

Arkitektur:

Client Logic:
  if task == "summary": use gpt-4.1-mini
  if task == "reasoning": use o4-mini
  if task == "simple_qa": use gpt-4.1-nano
→ Azure OpenAI deployments (direct)

Decision criteria:

Task Type Model Rationale
Simple Q&A, classification gpt-4.1-nano 59 400 TPM/PTU, laveste kostnad
Summarization, translation gpt-4.1-mini 14 900 TPM/PTU, god balance
Complex reasoning o4-mini Reasoning-capable, 5 400 TPM/PTU
High-quality content gpt-5 4 750 TPM/PTU, best quality

Fordeler:

  • Full kontroll, ingen routing-lag
  • Predictable costs per task type

Ulemper:

  • Logic i client-kode (maintainability)
  • Ingen dynamic fallback ved throttling

3. Dynamic Complexity-Based Routing (Custom Gateway)

Scenario: Gateway analyserer prompt-kompleksitet og router dynamisk.

Arkitektur:

Client → Azure API Management (eller custom gateway)
  ├─ Complexity Score (token count, question marks, "explain", "analyze")
  ├─ Score < 50: route to gpt-4.1-nano
  ├─ Score 50-200: route to gpt-4.1-mini
  └─ Score > 200: route to gpt-5
→ Azure OpenAI instances (multiple deployments)

Implementation (Azure API Management policy):

<choose>
  <when condition="@(context.Request.Body.As<JObject>()["messages"][0]["content"].ToString().Length < 200)">
    <set-backend-service backend-id="aoai-nano-backend" />
  </when>
  <when condition="@(context.Request.Body.As<JObject>()["messages"][0]["content"].ToString().Length < 1000)">
    <set-backend-service backend-id="aoai-mini-backend" />
  </when>
  <otherwise>
    <set-backend-service backend-id="aoai-gpt5-backend" />
  </otherwise>
</choose>

Fordeler:

  • Server-side logic (client-agnostic)
  • Supports versioning/blue-green deployments
  • Usage tracking per client (via API Management analytics)

Ulemper:

  • Gateway = single point of failure (krever multi-region for HA)
  • Complexity i policy-logic

4. Cascading Model Pipeline (Quality Fallback)

Scenario: Start med billig modell, retry med dyrere ved lav confidence.

Arkitektur:

Client → Gateway
  ├─ Try gpt-4.1-nano
  ├─ If confidence < 0.7: retry with gpt-4.1-mini
  └─ If confidence < 0.7: retry with gpt-5
→ Multiple Azure OpenAI deployments

Implementation (pseudokode):

response = call_model("gpt-4.1-nano", prompt)
if response.confidence < 0.7:
    response = call_model("gpt-4.1-mini", prompt)
if response.confidence < 0.7:
    response = call_model("gpt-5", prompt)
return response

Fordeler:

  • Quality guarantee med cost optimization
  • Automatic escalation

Ulemper:

  • Latency ved retries
  • Complexity i confidence scoring (krever logprobs eller custom metrics)

5. Provisioned + Standard Spillover (Cost + Elasticity)

Scenario: Provisioned PTU for baseline, Standard deployment for burst traffic.

Arkitektur:

Client → Azure API Management
  ├─ Primary: Provisioned PTU deployment (300 PTU)
  └─ Spillover (on 429): Standard deployment
→ Same Azure OpenAI instance or multiple instances

Cost model:

  • Provisioned: Fast hourly cost ($/PTU/hr), predict for 80-90% av traffic
  • Standard: Pay-per-token for burst (10-20% av traffic)

Implementation (Azure API Management policy):

<retry condition="@(context.Response.StatusCode == 429)" count="3" interval="1">
  <set-backend-service backend-id="aoai-provisioned-backend" />
  <forward-request />
  <choose>
    <when condition="@(context.Response.StatusCode == 429)">
      <set-backend-service backend-id="aoai-standard-backend" />
    </when>
  </choose>
</retry>

Fordeler:

  • Cost optimization: provisioned for baseline, pay-as-you-go for peaks
  • Latency guarantee via PTU

Ulemper:

Beslutningsveiledning

Når bruke Model Router vs. Custom Gateway

Kriterium Model Router Custom Gateway
Deployment kompleksitet Lav (én deployment) Høy (infrastruktur + policy)
Routing control Modes + subset Full kontroll (logic, rules, client identity)
Data residency Data Zone Standard (single zone) Krever per-region gateways for compliance
Multi-region failover Nei (single deployment) Ja (med API Management multi-region eller custom HA)
Client segmentation Nei Ja (quota per client, chargeback models)
Blue-green deployments Nei Ja (route to different model versions)
Cost Model Router input charge + token usage Gateway hosting + token usage
Latency Real-time routing (minimal overhead) Gateway hop (~5-20ms, avhengig av region)

Tommelfingerregel:

  • Model Router: For de fleste use cases med standard routing needs
  • Custom Gateway: Når du trenger client identity routing, data sovereignty, multi-region HA, eller quota management

Decision Tree: Velge Multi-Model Strategy

START: Trenger du multi-model routing?
  ├─ NEI: Bruk single model deployment (Standard eller Provisioned)
  └─ JA:
      ├─ Trenger du data residency compliance på tvers av regioner?
      │   ├─ JA: Custom gateway per region (API Management multi-region)
      │   └─ NEI: Continue
      ├─ Trenger du client-specific quota eller chargeback?
      │   ├─ JA: Custom gateway (API Management + client identity routing)
      │   └─ NEI: Continue
      ├─ Trenger du blue-green deployments eller model versioning?
      │   ├─ JA: Custom gateway (API Management policies)
      │   └─ NEI: Continue
      └─ Default: Model Router (Balanced mode)
          ├─ Cost-sensitive workload: Model Router (Cost mode)
          └─ Quality-critical workload: Model Router (Quality mode)

Vanlige feil

Feil Konsekvens Fix
Routing til forskjellige model versions Inconsistent responses, breaking changes Alltid samme model + version i load balancing/failover
Ignoring Retry-After header Aggressive retries forverrer throttling Circuit breaker logic med Retry-After respekt
Gateway i single region for multi-region backends Latency + egress costs Multi-region gateway deployment (API Management multi-region)
Cross-geopolitical routing Data residency violation Isolated gateways per geopolitical region
Standard deployments i multiple subscriptions (samme region) Ikke økt quota (subscription-level quota) Bruk Global/Data Zone Standard deployments istedenfor
Underdimensjonert Provisioned PTU Spillover til Standard = cost overruns Bruk PTU calculator, rightsizing

Røde flagg

  • 🚩 Gateway som single point of failure: Deploy HA gateway (multi-region eller availability zones)
  • 🚩 No health checks på gateway: Synthetic transactions eller /status endpoint for upstream health
  • 🚩 Complex routing logic i gateway policies: Vurder custom code gateway (ACA/AKS) for bedre testability
  • 🚩 Model Router med custom context window > 128k: Subset-select kun modeller som støtter dette (f.eks. GPT-5-serien med 400k context)
  • 🚩 Provisioned PTU scaling on-demand: PTU capacity er ikke garantert, bruk reservations for production

Integrasjon med Microsoft-stakken

Azure OpenAI + Model Router

Quick Deploy:

# Foundry portal: Model catalog → Model Router → Quick Deploy
# Deployment type: Global Standard eller Data Zone Standard
# Routing mode: Balanced (default), Cost, Quality

Custom Deploy (med Model Subset):

# Foundry portal: Model catalog → Model Router → Custom Deploy
# 1. Velg deployment type
# 2. Set Routing mode: Cost
# 3. Model subset: Select kun gpt-4.1-mini, gpt-4.1-nano, o4-mini
# 4. Deploy

Python SDK (bruk Model Router):

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    base_url="https://YOUR-RESOURCE.openai.azure.com/openai/v1/"
)

response = client.chat.completions.create(
    model="model-router",  # Model Router deployment name
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ]
)

print(response.choices[0].message.content)
# Model Router automatically selected underlying model (visible in response.model field)

Azure API Management (Custom Gateway)

Backend pools for load balancing:

<backend-pool>
  <backend id="aoai-nano-backend">
    <url>https://aoai-instance1.openai.azure.com</url>
  </backend>
  <backend id="aoai-mini-backend">
    <url>https://aoai-instance2.openai.azure.com</url>
  </backend>
  <backend id="aoai-gpt5-backend">
    <url>https://aoai-instance3.openai.azure.com</url>
  </backend>
</backend-pool>

Circuit breaker policy (preview):

<backends>
  <backend>
    <circuit-breaker rules="@{
      new CircuitBreakerRule(
        failureCondition: new HttpStatusCodeCondition(statusCodes: new[] { HttpStatusCode.TooManyRequests }),
        tripDuration: TimeSpan.FromSeconds(60),
        retryAfterHeader: true
      )
    }" />
  </backend>
</backends>

Referansearkitekturer:


Semantic Kernel (Application layer routing)

// Static routing per task type
var kernel = Kernel.CreateBuilder()
    .AddAzureOpenAIChatCompletion(
        deploymentName: "gpt-4.1-nano",
        endpoint: "https://YOUR-RESOURCE.openai.azure.com",
        apiKey: apiKey,
        serviceId: "simple-tasks")
    .AddAzureOpenAIChatCompletion(
        deploymentName: "gpt-5",
        endpoint: "https://YOUR-RESOURCE.openai.azure.com",
        apiKey: apiKey,
        serviceId: "complex-tasks")
    .Build();

// Select service dynamically
var chatService = taskComplexity > threshold
    ? kernel.GetRequiredService<IChatCompletionService>("complex-tasks")
    : kernel.GetRequiredService<IChatCompletionService>("simple-tasks");

AI Foundry Model Catalog

Tiered inference (utenfor Azure OpenAI):

  • Foundry Model Catalog: Meta Llama, Mistral, Cohere, Phi-modeller
  • Deployment options: Managed compute, Serverless API, Pay-as-you-go
  • Use case: Combine Azure OpenAI med open-source modeller for cost-tier strategy

Eksempel: GPT-4.1 for critical tasks, Phi-4 (Microsoft open model) for simple classification.

Offentlig sektor (Norge)

Datasuverenitet og Multi-Model Routing

Model Router:

  • Data Zone Standard: Holder data innenfor Microsoft-spesifisert data zone (f.eks. EU Data Boundary)
  • Underliggende modeller: Må deployes i samme data zone (unntatt Claude, som krever separate deployments)

Custom Gateway (multi-region):

  • Geopolitical boundaries: Deploy isolated gateways per region (f.eks. Norway East, West Europe)
  • Data residency: Ensure no cross-region routing (NSG rules, policy enforcement)
  • Compliance: Azure Policy for consistency (model versions, encryption, network perimeter)

GDPR/Schrems II:

  • Prefer Data Zone Standard deployments
  • Audit gateway logs for data flows (Azure Monitor, Log Analytics)

Budsjettprosesser og Kostnadskontroll

Utfordring: Offentlige etater har årlige budsjetter, AI-kostnader må forecasting.

Multi-model strategy for budsjettforutsigbarhet:

  1. Baseline med Provisioned PTU:

    • Allokér fast kostnad ($/PTU/hr) for 80-90% av forventet traffic
    • Bruk PTU calculator for sizing
    • Purchase Azure Reservations (1-year eller 3-year) for cost savings (opptil 50%)
  2. Burst traffic med Standard:

    • Standard deployment for peak periods (budget 10-20% ekstra)
    • Azure Cost Management alerts ved threshold (f.eks. 90% av månedsbudsjett)
  3. Model Router (Cost mode) for volume workloads:

    • Batch-prosessering av dokumenter: Cost mode router til billigste modell
    • Quality-critical (f.eks. juridisk analyse): Quality mode for nøyaktighet

Cost Management integration:

# Azure Cost Management API: Track costs per resource group
az consumption usage list --start-date 2026-02-01 --end-date 2026-02-28 \
  --query "[?contains(instanceName, 'model-router')]" \
  --output table

Compliance-krav (Schrems II, NIS2)

Multi-region gateway for compliance:

  • NIS2 (Network and Information Security Directive): Krever høy tilgjengelighet, incident response
  • Multi-region deployment: Active-active gateways (Azure API Management multi-region) for SLA > 99.9%
  • Incident response: Azure Monitor alerts på gateway health, automatic failover

Audit trail:

  • Gateway logger alle routing decisions (Azure Log Analytics)
  • Include client identity, selected model, response time, cost per request

Kostnad og lisensiering

Prissammenligning mellom modeller

Standard Deployment (Pay-as-you-go, NOK per 1M tokens, estimert 2026 rates):

Model Input (NOK/1M tokens) Output (NOK/1M tokens) Ratio (Output:Input)
gpt-4.1-nano ~50 ~200 4:1
gpt-4.1-mini ~150 ~600 4:1
gpt-4.1 ~300 ~1200 4:1
gpt-5-mini ~100 ~400 4:1
gpt-5 ~500 ~2000 4:1
gpt-5-chat ~250 ~1000 4:1
o4-mini ~350 ~1400 4:1
gpt-4o ~250 ~1000 4:1
gpt-4o-mini ~75 ~300 4:1

(Priser er estimater basert på USD-pricing + valutakurs. Verifiser Azure Pricing Calculator for eksakte NOK-priser.)

Provisioned Throughput (PTU, NOK per PTU/hr, estimert):

Model TPM per PTU (Input) PTU/hr cost (NOK, estimated)
gpt-4.1-nano 59 400 ~80-120
gpt-4.1-mini 14 900 ~80-120
gpt-4.1 3 000 ~120-180
gpt-5-mini 23 750 ~100-150
gpt-5 4 750 ~180-250
o4-mini 5 400 ~150-200

(Provisioned pricing varierer per region og reservation type. Bruk Azure Pricing Calculator.)


Besparelsespotensiale

Eksempel: Dokumentsammendrag (offentlig etat, 10M tokens/måned):

Strategi Model(s) Monthly Cost (NOK, estimert) Savings
Baseline (all GPT-5) gpt-5 ~25 000 (10M input + 2M output) -
Static routing 70% gpt-4.1-mini, 30% gpt-5 ~10 000 60%
Model Router (Balanced) Auto-routing ~8 000 68%
Model Router (Cost mode) Auto-routing (larger quality band) ~6 000 76%

Provisioned PTU scenario (high-volume, 100M tokens/måned):

Strategi Setup Monthly Cost (NOK, estimated) Savings
Standard pay-as-you-go 100M input, 20M output ~200 000 -
Provisioned (300 PTU gpt-5) 300 PTU × 730 hrs × ~200 NOK/PTU/hr ~43 800 + token overage 78%
Provisioned + Standard spillover 200 PTU + Standard for 20% burst ~35 000 82%

(Estimater avhenger av traffic patterns. Bruk PTU calculator for nøyaktig sizing.)


Optimaliseringstips

  1. Right-size Provisioned PTU:

    • Benchmark actual workload (ikke estimater)
    • Start med 80% av forventet peak, use Standard spillover for 20%
    • Purchase Azure Reservations (1-year) for 30-50% savings på PTU cost
  2. Model Router for varierende workloads:

    • Bruk Balanced mode som default
    • Cost mode for batch-processing (ikke time-sensitive)
    • Quality mode for compliance-kritiske outputs (juridisk, helse)
  3. Cache optimization:

    • Prompt caching (GPT-4.1+): 100% discount på cached tokens
    • Semantic Kernel memory: Cache embeddings for RAG
  4. Fine-tuning for cost reduction:

    • Fine-tuned gpt-4o-mini kan matche gpt-4o quality for specific tasks
    • Cost: $1.70/hour hosting + token usage (same rate as base model)
    • Example: Fine-tune for domain-specific summarization → replace GPT-5 with gpt-4.1-mini
  5. Monitor and adjust:

    • Azure Cost Management: Set budgets + alerts
    • Gateway analytics: Track cost per client, per model, per task type
    • Monthly review: Adjust Model Router subset or gateway rules based on cost/quality metrics

For arkitekten (Cosmo)

Spørsmål å stille kunden

  1. Traffic patterns:

    • Hva er forventet requests per minute (peak og average)?
    • Er traffic jevn over døgnet, eller er det klare peak-perioder?
    • Hvor mange tokens per request (input + output)?
  2. Quality vs. Cost prioritering:

    • Er det rom for 1-2% kvalitetsreduksjon for kostbesparelse (Balanced mode)?
    • Eller er 100% kvalitet ikke-forhandlbart (Quality mode)?
    • Hvilke oppgaver kan bruke billigere modeller (klassifisering, simple Q&A)?
  3. Compliance og data residency:

    • Må data forbli innenfor Norge/EU/spesifikt geography?
    • Kreves audit trail for model selection decisions?
    • Er det multi-tenant scenario med chargeback-krav?
  4. Existing infrastructure:

    • Bruker dere allerede Azure API Management, eller må gateway deployes fra scratch?
    • Finnes det multi-region krav for HA/DR?
    • Hva er akseptabel latency for gateway hop (5-20ms)?
  5. Budget og forecasting:

    • Er det fast årlig budsjett, eller pay-as-you-go flexibility?
    • Kan dere committe til 1-year reservation for PTU savings?
    • Hva er threshold for cost alerts (90% av budsjett)?
  6. Deployment strategi:

    • Trenger dere blue-green deployments for model versioning?
    • Vil dere starte med Model Router og vurdere custom gateway senere?
    • Er det behov for client-specific quota (per-team, per-prosjekt)?
  7. Monitoring og optimalisering:

    • Hvem eier cost management (IT, finance, product team)?
    • Hvor ofte skal cost/quality metrics reviewes (månedlig, kvartalsvis)?
    • Finnes det baseline metrics for quality (f.eks. F1-score, BLEU)?

Fallgruver

Fallgruve Impact Mitigation
Over-provisioning PTU Waste (betaler for unused capacity) Start med 80% av peak, use Standard spillover
Under-provisioning PTU Poor UX (throttling, latency) + cost overruns (Standard overage) Benchmark actual traffic, rightsize monthly
Ignoring context window limits (Model Router) Failed requests (hvis prompt > 128k til modell som ikke støtter det) Model subset selection (kun models med required context window)
Complex routing logic i gateway policies Maintenance hell, hard to debug Start simple (token count), iterate. Vurder custom code gateway for complexity.
No circuit breaker Cascade failures, throttling amplification Azure API Management circuit breaker policy (respekter Retry-After)
Single-region gateway for multi-region backends Latency + egress costs + SPoF Deploy multi-region API Management eller custom HA gateway
Cross-geopolitical routing Compliance violation (GDPR, Schrems II) Isolated gateways per region, NSG rules enforcement
No cost monitoring Budget overruns discovery too late Azure Cost Management alerts, monthly reviews, gateway analytics

Anbefalinger per modenhetsnivå

Level 1 (Pilot/POC):

  • Start med Model Router (Balanced mode) for minimal complexity
  • Single deployment (Global Standard eller Data Zone Standard)
  • Monitor cost vs. quality over 1-2 måneder
  • Decision point: Er besparelse + quality akseptabelt? → Produksjoniser. Nei? → Vurder custom gateway.

Level 2 (Production, single-region):

  • Model Router (Custom deploy) med model subset for compliance
  • Eller Azure API Management for simple routing (token count, task type)
  • Provisioned PTU for baseline + Standard spillover
  • Azure Cost Management alerts + monthly reviews

Level 3 (Enterprise, multi-region, multi-tenant):

  • Custom gateway (Azure API Management multi-region eller ACA/AKS + Azure Front Door)
  • Client identity-based routing, chargeback models
  • Provisioned PTU med 1-year reservations per region
  • Automated cost optimization (dynamic model selection basert på budget thresholds)
  • Compliance audit trail (Log Analytics, Azure Policy)

Level 4 (Advanced optimization):

  • Hybrid multi-model strategy: Azure OpenAI (premium tasks) + AI Foundry open models (commodity tasks)
  • Fine-tuned models for domain-specific cost reduction
  • Real-time cost/quality feedback loop (A/B testing av routing strategies)
  • FinOps team ownership med automated chargebacks

Kilder og verifisering

Microsoft Learn (MCP-verified):

  1. Model router for Azure AI FoundryVerified (MCP fetch, 2026-04)
  2. Use a gateway in front of multiple Azure OpenAI deploymentsVerified (MCP fetch, 2026-04). Dokument bekrefter: (a) credential termination og reestablishment ved gateway anbefales fremfor pass-through client credentials, (b) gateway gir client-based usage tracking og chargeback-støtte, (c) Azure OpenAI er nå tagget som "Foundry Tools / Azure OpenAI in Foundry Models".
  3. Understanding costs associated with provisioned throughput units (PTU)Verified (MCP search, 2026-04)
  4. Azure OpenAI in Azure AI Foundry ModelsVerified (MCP search, 2026-04)
  5. GPT-4o vs GPT-4o mini model selectionVerified (MCP search, 2026-04)

GitHub samples (MCP-referenced):

  1. Smart load balancing for Azure OpenAI (Azure API Management)Verified
  2. Scaling Azure OpenAI using Azure API ManagementVerified
  3. GenAI gateway toolkitVerified

Pricing and calculators:

  1. Azure Pricing CalculatorBaseline (pricing subject to change)
  2. Azure AI Foundry PTU calculatorVerified (MCP-referenced)

Konfidensnivå per seksjon:

Seksjon Confidence Source
Model Router (components, modes, models) Verified MCP microsoft-learn fetch
Custom Gateway architectures Verified MCP microsoft-learn fetch
Arkitekturmønstre (1-5) Verified MCP microsoft-learn + GitHub samples
Prissammenligning Baseline Estimated from USD pricing + currency conversion (verify with Azure Pricing Calculator)
Besparelsespotensiale Baseline Example calculations (actual savings depend on workload)
Offentlig sektor (compliance, budsjett) Baseline General best practices (verify with legal/compliance team)
Integration (API Management policies) Verified MCP code samples + GitHub repos

Sist oppdatert: 2026-04 (basert på Model Router version 2025-11-18 og Azure OpenAI pricing per april 2026). Verified (MCP 2026-04).

Neste review: Ved nye Model Router-versjoner eller større pricing changes.