# Azure AI Services - API Design and Best Practices **Last updated:** 2026-04 | Verified: MCP 2026-04 **Status:** GA **Category:** Azure AI Services (Foundry Tools) --- ## Introduksjon Når du bygger produksjonsklare applikasjoner med Azure AI Services (Azure OpenAI, Content Safety, Translator, Document Intelligence, Computer Vision, etc.), er robust API-design og feilhåndtering kritisk. Distribuerte skytjenester krever at applikasjoner håndterer midlertidige feil, throttling, nettverksproblemer og uventede responser på en strukturert måte. Denne referansen dekker best practices for: - **Error handling** — Strukturert feilhåndtering med Azure SDK exception hierarchy - **Retry logic** — Eksponentiell backoff, rate limiting og retry storms - **Rate limiting** — Throttling-håndtering og quota management - **Batching** — Effektiv bruk av Batch API for høyvolum-operasjoner - **Connection management** — Connection pooling og timeout-konfigurering - **Idempotency** — Design for at identiske requests kan håndteres trygt - **Authentication patterns** — Managed Identity vs. API keys **Kilde:** Microsoft Learn (verified via MCP 2026-02) --- ## Kjernekomponenter / Nøkkelegenskaper ### 1. Azure SDK Exception Hierarchy Azure SDK for Python og .NET bruker en hierarkisk exception-modell som gir både generiske og spesifikke error-handling capabilities. **Exception-hierarki:** ``` AzureError (base) ├── ClientAuthenticationError ├── ResourceNotFoundError ├── ResourceExistsError ├── ResourceModifiedError ├── ResourceNotModifiedError ├── ServiceRequestError ├── ServiceResponseError └── HttpResponseError ``` **Viktige exception-typer:** | Exception | HTTP Status | Når den kastes | Retry? | |-----------|-------------|----------------|--------| | `ClientAuthenticationError` | 401 | Authentication failure | ❌ Nei — fix credentials | | `ResourceNotFoundError` | 404 | Resource doesn't exist | ❌ Nei (unless transient) | | `ResourceExistsError` | 409 | Resource already exists | ❌ Nei — handle duplicate | | `HttpResponseError` (429) | 429 | Rate limit exceeded | ✅ Ja — med backoff | | `HttpResponseError` (500-504) | 500-504 | Server/gateway error | ✅ Ja — transient | | `ServiceRequestError` | N/A | Network/DNS failure | ✅ Ja — network transient | ### 2. HTTP Error Codes (Azure OpenAI) | Status Code | Error Type | Retry Strategy | |-------------|-----------|----------------| | 400 | Bad Request | ❌ Fix input — don't retry | | 401 | Authentication Error | ❌ Fix credentials | | 403 | Permission Denied | ❌ Fix RBAC assignments | | 404 | Not Found | ❌ Verify resource exists | | 408 | Request Timeout | ✅ Retry with backoff | | 422 | Unprocessable Entity | ❌ Fix input validation | | 429 | Rate Limit Error | ✅ Retry with `retry-after` header | | 500 | Internal Server Error | ✅ Retry with backoff | | 502 | Bad Gateway | ✅ Retry with backoff | | 503 | Service Unavailable | ✅ Retry with backoff | | 504 | Gateway Timeout | ✅ Retry with backoff | **Azure OpenAI SDKs** (Python, .NET, Go) retry automatisk 408, 429, 500, 502, 503, 504 — opptil 3 ganger med exponentiell backoff. ### 3. Retry Logic Patterns **Eksponentiell backoff (anbefalt):** ```python from azure.core.pipeline.policies import RetryPolicy retry_policy = RetryPolicy( retry_total=5, # Max retry attempts retry_backoff_factor=2, # 2^n seconds retry_backoff_max=60, # Max backoff: 60s retry_on_status_codes=[408, 429, 500, 502, 503, 504] ) client = BlobServiceClient( account_url="https://...", credential=credential, retry_policy=retry_policy ) ``` **Azure OpenAI custom retry (Python):** ```python from openai import AzureOpenAI client = AzureOpenAI( azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"), api_key=os.getenv("AZURE_OPENAI_API_KEY"), api_version="2024-10-21", max_retries=5 # Default: 2 ) ``` **C# retry med Polly:** ```csharp using Azure; using Azure.AI.Inference; try { var response = client.Complete(requestOptions); } catch (RequestFailedException ex) { if (ex.ErrorCode == "content_filter") { Console.WriteLine($"Content filter triggered: {ex.Message}"); } else if (ex.Status == 429) { // Implement exponential backoff Thread.Sleep(TimeSpan.FromSeconds(Math.Pow(2, retryCount))); } else { throw; } } ``` ### 4. Rate Limiting og 429 Responses **Azure OpenAI Provisioned Throughput:** - **429 respons** betyr at provisjonerte PTU-er er fullt benyttet - Service returnerer `retry-after` og `retry-after-ms` headers - **Standard SDK-oppførsel:** Respekterer `retry-after` og retrier automatisk **Håndtering av 429:** | Strategi | Når bruke | Latency Impact | |----------|-----------|----------------| | **Client-side retry** | OK med høyere latency | ⬆️ Høyere (venter på retry-after) | | **Fallback til annen deployment** | Low-latency krav | ⬇️ Lavere (umiddelbar failover) | | **Fallback til global-standard** | Cost/availability balance | ➡️ Moderat (noe høyere cost) | **Rate limiting pattern (for bulk operations):** ```python # Bad practice: Naive retry storm for record in records: try: client.process(record) except RateLimitError: time.sleep(1) # Fixed delay — overwhelms service # Good practice: Rate limiter + durable queue # 1. Enqueue to Azure Event Hubs/Service Bus # 2. Job processor dequeues at controlled rate # 3. Tracks PTU utilization via Azure Monitor ``` ### 5. Batching (Azure OpenAI Batch API) **Batch API:** Asynkrone batch-operasjoner med 50% lavere kostnad enn real-time API. **Bruksområder:** - Large-scale data processing (embeddings, summarization) - Content generation (product descriptions, translations) - Document review (legal, compliance) - NLP tasks (sentiment analysis, classification) **Batch limits:** | Parameter | Limit | |-----------|-------| | Max batch files (no expiration) | 500 | | Max batch files (with expiration) | 10,000 | | Max input file size | 200 MB (BYOS: 1 GB) | | Max requests per file | 100,000 | **Queueing with exponential backoff (Python):** ```python import time max_retries = 10 retry_count = 0 batch_job = None while retry_count < max_retries: try: batch_job = client.batches.create( input_file_id=file_id, endpoint="/chat/completions", completion_window="24h" ) break # Success except Exception as e: if "token limit exceeded" in str(e): retry_count += 1 wait_time = 2 ** retry_count time.sleep(wait_time) else: raise ``` **Fail-fast regions (for batching):** Australia East, East US, Germany West Central, Italy North, North Central US, Poland Central, Sweden Central, Switzerland North, East US 2, West US. ### 6. Connection Pooling og Timeouts **HTTP connection pooling (Python):** ```python import requests # Keep-alive enabled by default session = requests.Session() response = session.get("https://api.example.com") ``` **Azure OpenAI timeout configuration (Python):** ```python from openai import AzureOpenAI client = AzureOpenAI( azure_endpoint="...", api_key="...", timeout=300.0 # 5 minutes (default: 600s/10 min) ) ``` **Connection pooling for database SDKs:** | SDK | Module | |-----|--------| | MySQL | `mysql.connector.pooling` | | PostgreSQL | `psycopg2.pool` | | SQLAlchemy | `sqlalchemy.pool` | | Pyodbc | Built-in pooling | **Best practice:** - ✅ Bruk connection pools for database/HTTP clients - ✅ Sett realistiske timeouts (ikke 10 min for user-facing apps) - ✅ Implementer keepalives for long-running connections - ❌ IKKE opprett nye connections for hver request ### 7. Idempotency **Definisjon:** En operasjon er idempotent hvis den kan kalles flere ganger uten å produsere flere side-effekter etter første kall. **HTTP idempotency:** | HTTP Method | Idempotent? | Beskrivelse | |-------------|-------------|-------------| | `GET` | ✅ Ja | Read-only, ingen side-effekter | | `PUT` | ✅ Ja | Replaces resource at URI | | `DELETE` | ✅ Ja | Deletes resource (samme outcome) | | `POST` | ❌ Nei | Creates new resource hver gang | | `PATCH` | ❌ Nei | Partial update (depends) | **Idempotency-teknikker for Azure AI Services:** ```python # 1. Check if already processed (database lookup) def process_document(doc_id): if already_processed(doc_id): return cached_result(doc_id) result = client.analyze_document(...) save_result(doc_id, result) return result # 2. Event-carried state transfer (Event Hubs) event = { "doc_id": "12345", "operation": "set_status", "status": "completed", # Not "increment_count" — idempotent "timestamp": "2026-02-03T10:00:00Z" } # 3. Deduplication window (Service Bus) # Enable duplicate detection with MessageId message.message_id = f"{order_id}-{timestamp}" ``` **Duplicate detection (Azure Service Bus):** - Default deduplication window: 10 minutes - Min: 20 seconds, Max: 7 days - Based on `MessageId` (or `MessageId + PartitionKey` if partitioned) --- ## Arkitekturmønstre ### Pattern 1: Rate Limiting med Durable Messaging **Problem:** Bulk ingestion til throttled service (Azure Cosmos DB, Azure AI Search) resulterer i retry storms og høy feilrate. **Løsning:** Bruk Azure Event Hubs/Service Bus som buffer + job processor med rate limiting. ``` User API → Event Hubs → Job Processor (rate-limited) → Azure AI Service (buffer) (100 req/s controlled) ``` **Implementering:** 1. **API enqueues messages** (millions per second capacity) 2. **Job processor** leases partitions from blob storage (15s lease) - Each partition = 100 PTUs (requests/s) - Process dequeues only what it can handle in 1s 3. **Monitor utilization** via Azure Monitor (`Provisioned-Managed Utilization V2`) **Fordeler:** - ✅ Reduserer 429 errors fra 80% til <5% - ✅ Predikterbar throughput - ✅ Ingen data loss ved crash (durable queue) - ✅ Skalerer horisontalt (multiple job processors) ### Pattern 2: Circuit Breaker (for transient faults) **Problem:** Gjentatte kall til utilgjengelig service forverrer problemet (thundering herd). **Løsning:** Circuit Breaker pattern. **States:** | State | Oppførsel | |-------|-----------| | **Closed** | Normal operation — forwards requests | | **Open** | Service unavailable — fails fast (no requests) | | **Half-open** | Test if service recovered — 1 request | **Implementering (Python):** ```python class CircuitBreaker: def __init__(self, failure_threshold=5, recovery_timeout=60): self.failure_threshold = failure_threshold self.recovery_timeout = recovery_timeout self.failure_count = 0 self.state = 'closed' self.last_failure_time = None def call(self, func, *args, **kwargs): if self.state == 'open': if time.time() - self.last_failure_time > self.recovery_timeout: self.state = 'half-open' else: raise Exception("Circuit breaker open") try: result = func(*args, **kwargs) if self.state == 'half-open': self.state = 'closed' self.failure_count = 0 return result except Exception: self.failure_count += 1 self.last_failure_time = time.time() if self.failure_count >= self.failure_threshold: self.state = 'open' raise ``` ### Pattern 3: Idempotent Consumer (Event Hubs + Functions) **Problem:** Event Hubs garanterer at-least-once delivery — events kan prosesseres flere ganger. **Løsning:** Idempotent function design. **Teknikker:** 1. **Duplicate detection via database:** ```python def process_event(event): if db.exists(event.id): return # Already processed result = ai_client.analyze(event.data) db.save(event.id, result) ``` 2. **Event-carried state transfer:** ```json { "account_id": "12345", "operation": "set_balance", "new_balance": 1000 // Not "withdraw 100" — idempotent } ``` 3. **PeekLock receive mode (Service Bus):** - Consumer får exclusive lock (configurable duration) - Sender acknowledgment ved success - Message returneres til queue ved timeout/failure ### Pattern 4: Fallback Strategy (429 Handling) **Multi-tier fallback:** ```python from openai import AzureOpenAI def generate_completion(prompt): try: # 1. Try provisioned deployment (lowest latency) return provisioned_client.chat.completions.create(...) except Exception as e: if e.status_code == 429: # 2. Fallback to standard deployment return standard_client.chat.completions.create(...) raise # Alternative: Retry with backoff client = AzureOpenAI( max_retries=5, timeout=300.0 ) response = client.with_options(max_retries=5).chat.completions.create(...) ``` --- ## Beslutningsveiledning ### Når bruke Batch API vs. Real-time API? | Kriterium | Batch API | Real-time API | |-----------|-----------|---------------| | **Latency krav** | >24 timer OK | <1 sekund nødvendig | | **Volume** | >10,000 requests | <1,000 requests | | **Cost sensitivity** | Høy (50% saving) | Moderat | | **Use case** | Offline analytics, bulk processing | User-facing chat, real-time translation | ### Retry Strategy Decision Tree ``` 429 Error? ├─ Ja → Sjekk retry-after header → Vent og retry (max 5x) │ └─ Hvis fortsatt 429 → Fallback til annen deployment │ └─ 500-504? → Exponential backoff (2^n seconds, max 60s) ├─ Transient → Retry opptil 5 ganger └─ Persistent → Log error + alert ops team 401/403? → IKKE retry → Fix authentication/RBAC 400/422? → IKKE retry → Fix input validation ``` ### Rate Limiting Strategy | Scenario | Anbefalt Løsning | |----------|------------------| | **Single client, moderate load** | SDK default retry logic (max_retries=5) | | **Multiple uncoordinated clients** | Distributed lease system (blob storage) + partitions | | **Bulk ingestion** | Event Hubs + job processor med rate limiter | | **User-facing app** | Fallback til standard deployment ved 429 | --- ## Integrasjon med Microsoft-stakken ### Azure AI Foundry Integration **SDK-er som støtter Azure AI Foundry:** - **Python:** `azure-ai-inference`, `openai` (Azure variant) - **.NET:** `Azure.AI.Inference`, `Azure.AI.OpenAI` - **JavaScript/TypeScript:** `@azure/openai`, `@azure/ai-inference` - **Go:** `github.com/openai/openai-go` (med Azure endpoint) **Authentication patterns:** ```python # 1. DefaultAzureCredential (anbefalt for prod) from azure.identity import DefaultAzureCredential from azure.ai.inference import ChatCompletionsClient credential = DefaultAzureCredential() client = ChatCompletionsClient( endpoint="https://.openai.azure.com", credential=credential ) # 2. Managed Identity (Azure-hosted apps) from azure.identity import ManagedIdentityCredential credential = ManagedIdentityCredential() # 3. API Key (development only) from azure.core.credentials import AzureKeyCredential credential = AzureKeyCredential(os.getenv("AZURE_OPENAI_API_KEY")) ``` ### Azure Monitor Integration **Metrics å overvåke:** | Metric | Threshold | Alert | |--------|-----------|-------| | `Provisioned-Managed Utilization V2` | >95% | Scale up PTUs | | `Dependency failures` | >10% | Check retry logic | | `Request duration` | >10s | Optimize prompts/batching | | `429 error rate` | >5% | Increase quota or add fallback | **Kusto query (Log Analytics):** ```kusto AzureDiagnostics | where ResourceType == "COGNITIVE-SERVICES" | where Category == "RequestResponse" | where resultCode_d == 429 | summarize count() by bin(TimeGenerated, 5m), clientIp_s | order by count_ desc ``` ### Power Automate / Logic Apps Integration **Error handling i flows:** 1. **Configure retry policy:** - Retry count: 4 - Retry interval: Exponential (PT10S, PT20S, PT40S, PT80S) - Retry on: 408, 429, 500, 502, 503, 504 2. **Handle 429 with condition:** ```json { "condition": "@equals(actions('Call_Azure_AI').statusCode, 429)", "ifTrue": { "Wait": "@actions('Call_Azure_AI').outputs.headers['retry-after']" } } ``` --- ## Offentlig sektor (Norge) ### Compliance og Error Handling **GDPR/Personopplysningsloven:** - ✅ Logg ALDRI personidentifiserende informasjon i error logs - ✅ Bruk correlation IDs (ikke bruker-ID) i telemetry - ✅ Respekter `retry-after` headers (ikke spam API-er) **Eksempel (sanitized logging):** ```python import logging logger = logging.getLogger(__name__) try: result = client.analyze_document(doc_id) except HttpResponseError as e: logger.error( "Document analysis failed", extra={ "correlation_id": e.response.headers.get('x-ms-request-id'), "status_code": e.status_code, "doc_id": hash(doc_id), # Hash, not plaintext "error_code": e.error.code if e.error else None } ) ``` ### Idempotency for Offentlig Sektor Use Cases **Saksbehandlingssystemer:** - ✅ Bruk MessageId = `{saksID}-{operasjon}-{timestamp}` - ✅ Aktiver duplicate detection (Service Bus) - ✅ Check database før processing (deduplication table) **E-post varsling (som må være idempotent):** ```python def send_notification(case_id, notification_type): message_id = f"{case_id}-{notification_type}" if already_sent(message_id): return # Idempotent — don't resend send_email(...) mark_sent(message_id) ``` --- ## Kostnad og lisensiering ### Kostnad-konsekvenser av API Design **429 Errors kosten ingenting** (ingen PTU consumption), MEN: - ❌ 400 errors (content filter) **koster** (prompt ble prosessert) - ❌ 408 timeout **koster** (delvis processing) - ❌ `finish_reason: content_filter` **koster** (completion ble filtrert) **Batch API savings:** | Scenario | Real-time Cost | Batch Cost | Savings | |----------|----------------|------------|---------| | 1M tokens (GPT-4o) | ~$10 | ~$5 | 50% | | Embeddings (1M tokens) | ~$0.13 | ~$0.065 | 50% | **Provisioned vs. Standard:** - **Provisioned:** Fast kostnad (per PTU/hour), predictable latency - **Standard:** Pay-per-token, ingen garantier ved high traffic **Reservation discounts (Provisioned):** - 1-årig commitment: ~37% discount - 3-årig commitment: ~57% discount --- ## For arkitekten (Cosmo) ### Design Principles for Robust API Integration 1. **Error Handling Hierarchy:** ``` Try specific exceptions first → HttpResponseError → AzureError → generic Exception ``` 2. **Retry Decision Matrix:** - **Transient (retry):** 408, 429, 500-504, network errors - **Permanent (don't retry):** 400, 401, 403, 404, 422 - **Custom logic:** 429 with fallback 3. **Rate Limiting Strategy:** - **Low volume (<100 req/s):** SDK default retry - **High volume (>1000 req/s):** Event Hubs + job processor - **Provisioned deployments:** Monitor utilization, implement fallback 4. **Batching Decision:** - Latency >1 min? → Batch API - Volume >10k requests? → Batch API - Cost critical? → Batch API 5. **Idempotency Checklist:** - [ ] Operations designed for identical input? - [ ] Duplicate detection enabled (if using Service Bus)? - [ ] Database check before processing? - [ ] Correlation IDs for tracing? ### Common Anti-Patterns (og hvordan unngå dem) | Anti-Pattern | Problem | Løsning | |--------------|---------|---------| | **while(true) retry loop** | Retry storm → overwhelms service | Max retries + exponential backoff | | **Fixed 1-second delays** | Ignores `retry-after` header | Use SDK retry eller respekter header | | **Ingen connection pooling** | SNAT port exhaustion | Enable connection pooling | | **Hardcoded API keys** | Security risk | Use Managed Identity + Key Vault | | **No timeout configuration** | Hanging requests (10 min default) | Set realistic timeouts (30-300s) | | **Logging sensitive data** | GDPR violation | Hash/mask PII in logs | ### Monitoring og Alerting **Kritiske metrics:** ```python # Azure Monitor query for error rate trends AzureDiagnostics | where ResourceType == "COGNITIVE-SERVICES" | where TimeGenerated > ago(1h) | summarize total_requests = count(), errors = countif(resultCode_d >= 400) by bin(TimeGenerated, 5m) | extend error_rate = (errors * 100.0) / total_requests | where error_rate > 5 # Alert if >5% error rate ``` **Alert rules:** - **429 rate >5%** → Scale PTUs eller enable fallback - **500-504 errors** → Check service health dashboard - **Average latency >5s** → Optimize prompts eller batch processing ### Architecture Decision Records (ADR) Triggers **Når skal du lage en ADR?** - [ ] Velger Batch API over real-time API for produksjon - [ ] Implementerer custom retry logic (avviker fra SDK defaults) - [ ] Bruker distributed rate limiting (blob leases) - [ ] Velger Provisioned over Standard (cost/latency trade-off) - [ ] Implementerer multi-region fallback strategy --- ## Kilder og verifisering **Verification status:** ✅ Verified via Microsoft Learn MCP (2026-02) **Primary sources (fetched):** 1. **Handle errors produced by the Azure SDK for Python** - URL: https://learn.microsoft.com/en-us/azure/developer/python/sdk/fundamentals/errors - Confidence: **Verified** (MCP fetch) 2. **Rate Limiting pattern** - URL: https://learn.microsoft.com/en-us/azure/architecture/patterns/rate-limiting-pattern - Confidence: **Verified** (MCP fetch) 3. **Retry Storm antipattern** - URL: https://learn.microsoft.com/en-us/azure/architecture/antipatterns/retry-storm - Confidence: **Verified** (MCP fetch) 4. **Get started using provisioned deployments on Azure OpenAI** - URL: https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/provisioned-get-started - Confidence: **Verified** (MCP fetch) 5. **Getting started with Azure OpenAI batch deployments** - URL: https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/batch - Confidence: **Verified** (MCP search) 6. **Azure AI services authentication and authorization using .NET** - URL: https://learn.microsoft.com/en-us/dotnet/ai/azure-ai-services-authentication - Confidence: **Verified** (MCP search) 7. **Designing Azure Functions for identical input (idempotency)** - URL: https://learn.microsoft.com/en-us/azure/azure-functions/functions-idempotent - Confidence: **Verified** (MCP search) 8. **Duplicate detection (Azure Service Bus)** - URL: https://learn.microsoft.com/en-us/azure/service-bus-messaging/duplicate-detection - Confidence: **Verified** (MCP search) **Code samples (verified):** - Azure.AI.Inference (C#) error handling - Azure SDK Python retry policies - OpenAI Python SDK custom retry configuration **Related documentation:** - Azure Monitor metrics and logging - Circuit Breaker pattern (Azure Architecture Center) - Connection pooling (Azure App Service best practices) **Baseline knowledge (model):** - HTTP idempotency semantics (RFC 7231) - Exponential backoff algorithms - Connection pooling concepts **MCP call summary:** 7 microsoft_docs_search + 4 microsoft_docs_fetch + 1 microsoft_code_sample_search = 12 total MCP calls