Kjell Tore Guttormsen ff6a50d14f docs(architect): weekly KB update — 106 files refreshed (2026-04)

Updates across all 5 skills: ms-ai-advisor, ms-ai-engineering,
ms-ai-governance, ms-ai-security, ms-ai-infrastructure.

Key changes:
- Language Services (Custom Text Classification, Text Analytics, QnA):
  retirement warning 2029-03-31, migration guides to Foundry/GPT-4o
- Agentic Retrieval: 50M free reasoning tokens/month (Public Preview)
- Computer Use: Claude Sonnet 4.5 (preview) + OpenAI CUA models
- Agent Registry: Risks column (M365 E7), user-shared/org-published types
- Declarative agents: schema v1.5 → v1.6, Store validation requirements
- MLflow 3: 13 built-in LLM judges, production monitoring, Genie Code
- AG-UI HITL: ApprovalRequiredAIFunction (C#) + @tool(approval_mode) (Python)
- Entra ID Ignite 2025: Agent ID Admin/Developer RBAC roles, Conditional Access
- Security Copilot: 400 SCU/month per 1000 M365 E5 licenses, auto-provisioned
- Fast Transcription API: phrase lists, 14-language multi-lingual transcription
- Azure Monitor Workbooks: Bicep support, RBAC specifics
- Power Platform Copilot: data residency (Norway/Europe → EU DB, Bing → USA)
- RAG security-rbac: 4-approach table (GA + 3 preview access control methods)
- IaC MLOps: Well-Architected OE:05 principles, Bicep/Terraform patterns
- Translator: image file batch translation Preview (JPEG/PNG/BMP/WebP)

All 106 files: Last updated 2026-04 | Verified: MCP 2026-04

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-10 09:13:24 +02:00

23 KiB

Raw Blame History

Azure AI Services - API Design and Best Practices

Last updated: 2026-04 | Verified: MCP 2026-04 Status: GA Category: Azure AI Services (Foundry Tools)

Introduksjon

Når du bygger produksjonsklare applikasjoner med Azure AI Services (Azure OpenAI, Content Safety, Translator, Document Intelligence, Computer Vision, etc.), er robust API-design og feilhåndtering kritisk. Distribuerte skytjenester krever at applikasjoner håndterer midlertidige feil, throttling, nettverksproblemer og uventede responser på en strukturert måte.

Denne referansen dekker best practices for:

Error handling — Strukturert feilhåndtering med Azure SDK exception hierarchy
Retry logic — Eksponentiell backoff, rate limiting og retry storms
Rate limiting — Throttling-håndtering og quota management
Batching — Effektiv bruk av Batch API for høyvolum-operasjoner
Connection management — Connection pooling og timeout-konfigurering
Idempotency — Design for at identiske requests kan håndteres trygt
Authentication patterns — Managed Identity vs. API keys

Kilde: Microsoft Learn (verified via MCP 2026-02)

Kjernekomponenter / Nøkkelegenskaper

1. Azure SDK Exception Hierarchy

Azure SDK for Python og .NET bruker en hierarkisk exception-modell som gir både generiske og spesifikke error-handling capabilities.

Exception-hierarki:

AzureError (base)
├── ClientAuthenticationError
├── ResourceNotFoundError
├── ResourceExistsError
├── ResourceModifiedError
├── ResourceNotModifiedError
├── ServiceRequestError
├── ServiceResponseError
└── HttpResponseError

Viktige exception-typer:

Exception	HTTP Status	Når den kastes	Retry?
`ClientAuthenticationError`	401	Authentication failure	❌ Nei — fix credentials
`ResourceNotFoundError`	404	Resource doesn't exist	❌ Nei (unless transient)
`ResourceExistsError`	409	Resource already exists	❌ Nei — handle duplicate
`HttpResponseError` (429)	429	Rate limit exceeded	✅ Ja — med backoff
`HttpResponseError` (500-504)	500-504	Server/gateway error	✅ Ja — transient
`ServiceRequestError`	N/A	Network/DNS failure	✅ Ja — network transient

2. HTTP Error Codes (Azure OpenAI)

Status Code	Error Type	Retry Strategy
400	Bad Request	❌ Fix input — don't retry
401	Authentication Error	❌ Fix credentials
403	Permission Denied	❌ Fix RBAC assignments
404	Not Found	❌ Verify resource exists
408	Request Timeout	✅ Retry with backoff
422	Unprocessable Entity	❌ Fix input validation
429	Rate Limit Error	✅ Retry with `retry-after` header
500	Internal Server Error	✅ Retry with backoff
502	Bad Gateway	✅ Retry with backoff
503	Service Unavailable	✅ Retry with backoff
504	Gateway Timeout	✅ Retry with backoff

Azure OpenAI SDKs (Python, .NET, Go) retry automatisk 408, 429, 500, 502, 503, 504 — opptil 3 ganger med exponentiell backoff.

3. Retry Logic Patterns

Eksponentiell backoff (anbefalt):

from azure.core.pipeline.policies import RetryPolicy

retry_policy = RetryPolicy(
    retry_total=5,                  # Max retry attempts
    retry_backoff_factor=2,         # 2^n seconds
    retry_backoff_max=60,           # Max backoff: 60s
    retry_on_status_codes=[408, 429, 500, 502, 503, 504]
)

client = BlobServiceClient(
    account_url="https://...",
    credential=credential,
    retry_policy=retry_policy
)

Azure OpenAI custom retry (Python):

from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2024-10-21",
    max_retries=5  # Default: 2
)

C# retry med Polly:

using Azure;
using Azure.AI.Inference;

try {
    var response = client.Complete(requestOptions);
} catch (RequestFailedException ex) {
    if (ex.ErrorCode == "content_filter") {
        Console.WriteLine($"Content filter triggered: {ex.Message}");
    } else if (ex.Status == 429) {
        // Implement exponential backoff
        Thread.Sleep(TimeSpan.FromSeconds(Math.Pow(2, retryCount)));
    } else {
        throw;
    }
}

4. Rate Limiting og 429 Responses

Azure OpenAI Provisioned Throughput:

429 respons betyr at provisjonerte PTU-er er fullt benyttet
Service returnerer retry-after og retry-after-ms headers
Standard SDK-oppførsel: Respekterer retry-after og retrier automatisk

Håndtering av 429:

Strategi	Når bruke	Latency Impact
Client-side retry	OK med høyere latency	⬆️ Høyere (venter på retry-after)
Fallback til annen deployment	Low-latency krav	⬇️ Lavere (umiddelbar failover)
Fallback til global-standard	Cost/availability balance	➡️ Moderat (noe høyere cost)

Rate limiting pattern (for bulk operations):

# Bad practice: Naive retry storm
for record in records:
    try:
        client.process(record)
    except RateLimitError:
        time.sleep(1)  # Fixed delay — overwhelms service

# Good practice: Rate limiter + durable queue
# 1. Enqueue to Azure Event Hubs/Service Bus
# 2. Job processor dequeues at controlled rate
# 3. Tracks PTU utilization via Azure Monitor

5. Batching (Azure OpenAI Batch API)

Batch API: Asynkrone batch-operasjoner med 50% lavere kostnad enn real-time API.

Bruksområder:

Large-scale data processing (embeddings, summarization)
Content generation (product descriptions, translations)
Document review (legal, compliance)
NLP tasks (sentiment analysis, classification)

Batch limits:

Parameter	Limit
Max batch files (no expiration)	500
Max batch files (with expiration)	10,000
Max input file size	200 MB (BYOS: 1 GB)
Max requests per file	100,000

Queueing with exponential backoff (Python):

import time

max_retries = 10
retry_count = 0
batch_job = None

while retry_count < max_retries:
    try:
        batch_job = client.batches.create(
            input_file_id=file_id,
            endpoint="/chat/completions",
            completion_window="24h"
        )
        break  # Success
    except Exception as e:
        if "token limit exceeded" in str(e):
            retry_count += 1
            wait_time = 2 ** retry_count
            time.sleep(wait_time)
        else:
            raise

Fail-fast regions (for batching): Australia East, East US, Germany West Central, Italy North, North Central US, Poland Central, Sweden Central, Switzerland North, East US 2, West US.

6. Connection Pooling og Timeouts

HTTP connection pooling (Python):

import requests

# Keep-alive enabled by default
session = requests.Session()
response = session.get("https://api.example.com")

Azure OpenAI timeout configuration (Python):

from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint="...",
    api_key="...",
    timeout=300.0  # 5 minutes (default: 600s/10 min)
)

Connection pooling for database SDKs:

SDK	Module
MySQL	`mysql.connector.pooling`
PostgreSQL	`psycopg2.pool`
SQLAlchemy	`sqlalchemy.pool`
Pyodbc	Built-in pooling

Best practice:

✅ Bruk connection pools for database/HTTP clients
✅ Sett realistiske timeouts (ikke 10 min for user-facing apps)
✅ Implementer keepalives for long-running connections
❌ IKKE opprett nye connections for hver request

7. Idempotency

Definisjon: En operasjon er idempotent hvis den kan kalles flere ganger uten å produsere flere side-effekter etter første kall.

HTTP idempotency:

HTTP Method	Idempotent?	Beskrivelse
`GET`	✅ Ja	Read-only, ingen side-effekter
`PUT`	✅ Ja	Replaces resource at URI
`DELETE`	✅ Ja	Deletes resource (samme outcome)
`POST`	❌ Nei	Creates new resource hver gang
`PATCH`	❌ Nei	Partial update (depends)

Idempotency-teknikker for Azure AI Services:

# 1. Check if already processed (database lookup)
def process_document(doc_id):
    if already_processed(doc_id):
        return cached_result(doc_id)

    result = client.analyze_document(...)
    save_result(doc_id, result)
    return result

# 2. Event-carried state transfer (Event Hubs)
event = {
    "doc_id": "12345",
    "operation": "set_status",
    "status": "completed",  # Not "increment_count" — idempotent
    "timestamp": "2026-02-03T10:00:00Z"
}

# 3. Deduplication window (Service Bus)
# Enable duplicate detection with MessageId
message.message_id = f"{order_id}-{timestamp}"

Duplicate detection (Azure Service Bus):

Default deduplication window: 10 minutes
Min: 20 seconds, Max: 7 days
Based on MessageId (or MessageId + PartitionKey if partitioned)

Arkitekturmønstre

Pattern 1: Rate Limiting med Durable Messaging

Problem: Bulk ingestion til throttled service (Azure Cosmos DB, Azure AI Search) resulterer i retry storms og høy feilrate.

Løsning: Bruk Azure Event Hubs/Service Bus som buffer + job processor med rate limiting.

User API → Event Hubs → Job Processor (rate-limited) → Azure AI Service
             (buffer)      (100 req/s controlled)

Implementering:

API enqueues messages (millions per second capacity)
Job processor leases partitions from blob storage (15s lease)
- Each partition = 100 PTUs (requests/s)
- Process dequeues only what it can handle in 1s
Monitor utilization via Azure Monitor (Provisioned-Managed Utilization V2)

Fordeler:

✅ Reduserer 429 errors fra 80% til <5%
✅ Predikterbar throughput
✅ Ingen data loss ved crash (durable queue)
✅ Skalerer horisontalt (multiple job processors)

Pattern 2: Circuit Breaker (for transient faults)

Problem: Gjentatte kall til utilgjengelig service forverrer problemet (thundering herd).

Løsning: Circuit Breaker pattern.

States:

State	Oppførsel
Closed	Normal operation — forwards requests
Open	Service unavailable — fails fast (no requests)
Half-open	Test if service recovered — 1 request

Implementering (Python):

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=60):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failure_count = 0
        self.state = 'closed'
        self.last_failure_time = None

    def call(self, func, *args, **kwargs):
        if self.state == 'open':
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = 'half-open'
            else:
                raise Exception("Circuit breaker open")

        try:
            result = func(*args, **kwargs)
            if self.state == 'half-open':
                self.state = 'closed'
                self.failure_count = 0
            return result
        except Exception:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = 'open'
            raise

Pattern 3: Idempotent Consumer (Event Hubs + Functions)

Problem: Event Hubs garanterer at-least-once delivery — events kan prosesseres flere ganger.

Løsning: Idempotent function design.

Teknikker:

Duplicate detection via database:

def process_event(event):
    if db.exists(event.id):
        return  # Already processed

    result = ai_client.analyze(event.data)
    db.save(event.id, result)

Event-carried state transfer:

{
  "account_id": "12345",
  "operation": "set_balance",
  "new_balance": 1000  // Not "withdraw 100" — idempotent
}

PeekLock receive mode (Service Bus):
- Consumer får exclusive lock (configurable duration)
- Sender acknowledgment ved success
- Message returneres til queue ved timeout/failure

Pattern 4: Fallback Strategy (429 Handling)

Multi-tier fallback:

from openai import AzureOpenAI

def generate_completion(prompt):
    try:
        # 1. Try provisioned deployment (lowest latency)
        return provisioned_client.chat.completions.create(...)
    except Exception as e:
        if e.status_code == 429:
            # 2. Fallback to standard deployment
            return standard_client.chat.completions.create(...)
        raise

# Alternative: Retry with backoff
client = AzureOpenAI(
    max_retries=5,
    timeout=300.0
)
response = client.with_options(max_retries=5).chat.completions.create(...)

Beslutningsveiledning

Når bruke Batch API vs. Real-time API?

Kriterium	Batch API	Real-time API
Latency krav	>24 timer OK	<1 sekund nødvendig
Volume	>10,000 requests	<1,000 requests
Cost sensitivity	Høy (50% saving)	Moderat
Use case	Offline analytics, bulk processing	User-facing chat, real-time translation

Retry Strategy Decision Tree

429 Error?
├─ Ja → Sjekk retry-after header → Vent og retry (max 5x)
│       └─ Hvis fortsatt 429 → Fallback til annen deployment
│
└─ 500-504? → Exponential backoff (2^n seconds, max 60s)
    ├─ Transient → Retry opptil 5 ganger
    └─ Persistent → Log error + alert ops team

401/403? → IKKE retry → Fix authentication/RBAC
400/422? → IKKE retry → Fix input validation

Rate Limiting Strategy

Scenario	Anbefalt Løsning
Single client, moderate load	SDK default retry logic (max_retries=5)
Multiple uncoordinated clients	Distributed lease system (blob storage) + partitions
Bulk ingestion	Event Hubs + job processor med rate limiter
User-facing app	Fallback til standard deployment ved 429

Integrasjon med Microsoft-stakken

Azure AI Foundry Integration

SDK-er som støtter Azure AI Foundry:

Python: azure-ai-inference, openai (Azure variant)
.NET: Azure.AI.Inference, Azure.AI.OpenAI
JavaScript/TypeScript: @azure/openai, @azure/ai-inference
Go: github.com/openai/openai-go (med Azure endpoint)

Authentication patterns:

# 1. DefaultAzureCredential (anbefalt for prod)
from azure.identity import DefaultAzureCredential
from azure.ai.inference import ChatCompletionsClient

credential = DefaultAzureCredential()
client = ChatCompletionsClient(
    endpoint="https://<resource>.openai.azure.com",
    credential=credential
)

# 2. Managed Identity (Azure-hosted apps)
from azure.identity import ManagedIdentityCredential

credential = ManagedIdentityCredential()

# 3. API Key (development only)
from azure.core.credentials import AzureKeyCredential

credential = AzureKeyCredential(os.getenv("AZURE_OPENAI_API_KEY"))

Azure Monitor Integration

Metrics å overvåke:

Metric	Threshold	Alert
`Provisioned-Managed Utilization V2`	>95%	Scale up PTUs
`Dependency failures`	>10%	Check retry logic
`Request duration`	>10s	Optimize prompts/batching
`429 error rate`	>5%	Increase quota or add fallback

Kusto query (Log Analytics):

AzureDiagnostics
| where ResourceType == "COGNITIVE-SERVICES"
| where Category == "RequestResponse"
| where resultCode_d == 429
| summarize count() by bin(TimeGenerated, 5m), clientIp_s
| order by count_ desc

Power Automate / Logic Apps Integration

Error handling i flows:

Configure retry policy:
- Retry count: 4
- Retry interval: Exponential (PT10S, PT20S, PT40S, PT80S)
- Retry on: 408, 429, 500, 502, 503, 504

Handle 429 with condition:

{
  "condition": "@equals(actions('Call_Azure_AI').statusCode, 429)",
  "ifTrue": {
    "Wait": "@actions('Call_Azure_AI').outputs.headers['retry-after']"
  }
}

Offentlig sektor (Norge)

Compliance og Error Handling

GDPR/Personopplysningsloven:

✅ Logg ALDRI personidentifiserende informasjon i error logs
✅ Bruk correlation IDs (ikke bruker-ID) i telemetry
✅ Respekter retry-after headers (ikke spam API-er)

Eksempel (sanitized logging):

import logging

logger = logging.getLogger(__name__)

try:
    result = client.analyze_document(doc_id)
except HttpResponseError as e:
    logger.error(
        "Document analysis failed",
        extra={
            "correlation_id": e.response.headers.get('x-ms-request-id'),
            "status_code": e.status_code,
            "doc_id": hash(doc_id),  # Hash, not plaintext
            "error_code": e.error.code if e.error else None
        }
    )

Idempotency for Offentlig Sektor Use Cases

Saksbehandlingssystemer:

✅ Bruk MessageId = {saksID}-{operasjon}-{timestamp}
✅ Aktiver duplicate detection (Service Bus)
✅ Check database før processing (deduplication table)

E-post varsling (som må være idempotent):

def send_notification(case_id, notification_type):
    message_id = f"{case_id}-{notification_type}"

    if already_sent(message_id):
        return  # Idempotent — don't resend

    send_email(...)
    mark_sent(message_id)

Kostnad og lisensiering

Kostnad-konsekvenser av API Design

429 Errors kosten ingenting (ingen PTU consumption), MEN:

❌ 400 errors (content filter) koster (prompt ble prosessert)
❌ 408 timeout koster (delvis processing)
❌ finish_reason: content_filter koster (completion ble filtrert)

Batch API savings:

Scenario	Real-time Cost	Batch Cost	Savings
1M tokens (GPT-4o)	~$10	~$5	50%
Embeddings (1M tokens)	~$0.13	~$0.065	50%

Provisioned vs. Standard:

Provisioned: Fast kostnad (per PTU/hour), predictable latency
Standard: Pay-per-token, ingen garantier ved high traffic

Reservation discounts (Provisioned):

1-årig commitment: ~37% discount
3-årig commitment: ~57% discount

For arkitekten (Cosmo)

Design Principles for Robust API Integration

Error Handling Hierarchy:

Try specific exceptions first → HttpResponseError → AzureError → generic Exception

Retry Decision Matrix:
- Transient (retry): 408, 429, 500-504, network errors
- Permanent (don't retry): 400, 401, 403, 404, 422
- Custom logic: 429 with fallback
Rate Limiting Strategy:
- Low volume (<100 req/s): SDK default retry
- High volume (>1000 req/s): Event Hubs + job processor
- Provisioned deployments: Monitor utilization, implement fallback
Batching Decision:
- Latency >1 min? → Batch API
- Volume >10k requests? → Batch API
- Cost critical? → Batch API
Idempotency Checklist:
- Operations designed for identical input?
- Duplicate detection enabled (if using Service Bus)?
- Database check before processing?
- Correlation IDs for tracing?

Common Anti-Patterns (og hvordan unngå dem)

Anti-Pattern	Problem	Løsning
while(true) retry loop	Retry storm → overwhelms service	Max retries + exponential backoff
Fixed 1-second delays	Ignores `retry-after` header	Use SDK retry eller respekter header
Ingen connection pooling	SNAT port exhaustion	Enable connection pooling
Hardcoded API keys	Security risk	Use Managed Identity + Key Vault
No timeout configuration	Hanging requests (10 min default)	Set realistic timeouts (30-300s)
Logging sensitive data	GDPR violation	Hash/mask PII in logs

Monitoring og Alerting

Kritiske metrics:

# Azure Monitor query for error rate trends
AzureDiagnostics
| where ResourceType == "COGNITIVE-SERVICES"
| where TimeGenerated > ago(1h)
| summarize
    total_requests = count(),
    errors = countif(resultCode_d >= 400)
    by bin(TimeGenerated, 5m)
| extend error_rate = (errors * 100.0) / total_requests
| where error_rate > 5  # Alert if >5% error rate

Alert rules:

429 rate >5% → Scale PTUs eller enable fallback
500-504 errors → Check service health dashboard
Average latency >5s → Optimize prompts eller batch processing

Architecture Decision Records (ADR) Triggers

Når skal du lage en ADR?

Velger Batch API over real-time API for produksjon
Implementerer custom retry logic (avviker fra SDK defaults)
Bruker distributed rate limiting (blob leases)
Velger Provisioned over Standard (cost/latency trade-off)
Implementerer multi-region fallback strategy

Kilder og verifisering

Verification status: ✅ Verified via Microsoft Learn MCP (2026-02)

Primary sources (fetched):

Handle errors produced by the Azure SDK for Python
- URL: https://learn.microsoft.com/en-us/azure/developer/python/sdk/fundamentals/errors
- Confidence: Verified (MCP fetch)
Rate Limiting pattern
- URL: https://learn.microsoft.com/en-us/azure/architecture/patterns/rate-limiting-pattern
- Confidence: Verified (MCP fetch)
Retry Storm antipattern
- URL: https://learn.microsoft.com/en-us/azure/architecture/antipatterns/retry-storm
- Confidence: Verified (MCP fetch)
Get started using provisioned deployments on Azure OpenAI
- URL: https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/provisioned-get-started
- Confidence: Verified (MCP fetch)
Getting started with Azure OpenAI batch deployments
- URL: https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/batch
- Confidence: Verified (MCP search)
Azure AI services authentication and authorization using .NET
- URL: https://learn.microsoft.com/en-us/dotnet/ai/azure-ai-services-authentication
- Confidence: Verified (MCP search)
Designing Azure Functions for identical input (idempotency)
- URL: https://learn.microsoft.com/en-us/azure/azure-functions/functions-idempotent
- Confidence: Verified (MCP search)
Duplicate detection (Azure Service Bus)
- URL: https://learn.microsoft.com/en-us/azure/service-bus-messaging/duplicate-detection
- Confidence: Verified (MCP search)

Code samples (verified):

Azure.AI.Inference (C#) error handling
Azure SDK Python retry policies
OpenAI Python SDK custom retry configuration

Related documentation:

Azure Monitor metrics and logging
Circuit Breaker pattern (Azure Architecture Center)
Connection pooling (Azure App Service best practices)

Baseline knowledge (model):

HTTP idempotency semantics (RFC 7231)
Exponential backoff algorithms
Connection pooling concepts

MCP call summary: 7 microsoft_docs_search + 4 microsoft_docs_fetch + 1 microsoft_code_sample_search = 12 total MCP calls

23 KiB Raw Blame History