Updates across all 5 skills: ms-ai-advisor, ms-ai-engineering, ms-ai-governance, ms-ai-security, ms-ai-infrastructure. Key changes: - Language Services (Custom Text Classification, Text Analytics, QnA): retirement warning 2029-03-31, migration guides to Foundry/GPT-4o - Agentic Retrieval: 50M free reasoning tokens/month (Public Preview) - Computer Use: Claude Sonnet 4.5 (preview) + OpenAI CUA models - Agent Registry: Risks column (M365 E7), user-shared/org-published types - Declarative agents: schema v1.5 → v1.6, Store validation requirements - MLflow 3: 13 built-in LLM judges, production monitoring, Genie Code - AG-UI HITL: ApprovalRequiredAIFunction (C#) + @tool(approval_mode) (Python) - Entra ID Ignite 2025: Agent ID Admin/Developer RBAC roles, Conditional Access - Security Copilot: 400 SCU/month per 1000 M365 E5 licenses, auto-provisioned - Fast Transcription API: phrase lists, 14-language multi-lingual transcription - Azure Monitor Workbooks: Bicep support, RBAC specifics - Power Platform Copilot: data residency (Norway/Europe → EU DB, Bing → USA) - RAG security-rbac: 4-approach table (GA + 3 preview access control methods) - IaC MLOps: Well-Architected OE:05 principles, Bicep/Terraform patterns - Translator: image file batch translation Preview (JPEG/PNG/BMP/WebP) All 106 files: Last updated 2026-04 | Verified: MCP 2026-04 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
23 KiB
Azure AI Services - API Design and Best Practices
Last updated: 2026-04 | Verified: MCP 2026-04 Status: GA Category: Azure AI Services (Foundry Tools)
Introduksjon
Når du bygger produksjonsklare applikasjoner med Azure AI Services (Azure OpenAI, Content Safety, Translator, Document Intelligence, Computer Vision, etc.), er robust API-design og feilhåndtering kritisk. Distribuerte skytjenester krever at applikasjoner håndterer midlertidige feil, throttling, nettverksproblemer og uventede responser på en strukturert måte.
Denne referansen dekker best practices for:
- Error handling — Strukturert feilhåndtering med Azure SDK exception hierarchy
- Retry logic — Eksponentiell backoff, rate limiting og retry storms
- Rate limiting — Throttling-håndtering og quota management
- Batching — Effektiv bruk av Batch API for høyvolum-operasjoner
- Connection management — Connection pooling og timeout-konfigurering
- Idempotency — Design for at identiske requests kan håndteres trygt
- Authentication patterns — Managed Identity vs. API keys
Kilde: Microsoft Learn (verified via MCP 2026-02)
Kjernekomponenter / Nøkkelegenskaper
1. Azure SDK Exception Hierarchy
Azure SDK for Python og .NET bruker en hierarkisk exception-modell som gir både generiske og spesifikke error-handling capabilities.
Exception-hierarki:
AzureError (base)
├── ClientAuthenticationError
├── ResourceNotFoundError
├── ResourceExistsError
├── ResourceModifiedError
├── ResourceNotModifiedError
├── ServiceRequestError
├── ServiceResponseError
└── HttpResponseError
Viktige exception-typer:
| Exception | HTTP Status | Når den kastes | Retry? |
|---|---|---|---|
ClientAuthenticationError |
401 | Authentication failure | ❌ Nei — fix credentials |
ResourceNotFoundError |
404 | Resource doesn't exist | ❌ Nei (unless transient) |
ResourceExistsError |
409 | Resource already exists | ❌ Nei — handle duplicate |
HttpResponseError (429) |
429 | Rate limit exceeded | ✅ Ja — med backoff |
HttpResponseError (500-504) |
500-504 | Server/gateway error | ✅ Ja — transient |
ServiceRequestError |
N/A | Network/DNS failure | ✅ Ja — network transient |
2. HTTP Error Codes (Azure OpenAI)
| Status Code | Error Type | Retry Strategy |
|---|---|---|
| 400 | Bad Request | ❌ Fix input — don't retry |
| 401 | Authentication Error | ❌ Fix credentials |
| 403 | Permission Denied | ❌ Fix RBAC assignments |
| 404 | Not Found | ❌ Verify resource exists |
| 408 | Request Timeout | ✅ Retry with backoff |
| 422 | Unprocessable Entity | ❌ Fix input validation |
| 429 | Rate Limit Error | ✅ Retry with retry-after header |
| 500 | Internal Server Error | ✅ Retry with backoff |
| 502 | Bad Gateway | ✅ Retry with backoff |
| 503 | Service Unavailable | ✅ Retry with backoff |
| 504 | Gateway Timeout | ✅ Retry with backoff |
Azure OpenAI SDKs (Python, .NET, Go) retry automatisk 408, 429, 500, 502, 503, 504 — opptil 3 ganger med exponentiell backoff.
3. Retry Logic Patterns
Eksponentiell backoff (anbefalt):
from azure.core.pipeline.policies import RetryPolicy
retry_policy = RetryPolicy(
retry_total=5, # Max retry attempts
retry_backoff_factor=2, # 2^n seconds
retry_backoff_max=60, # Max backoff: 60s
retry_on_status_codes=[408, 429, 500, 502, 503, 504]
)
client = BlobServiceClient(
account_url="https://...",
credential=credential,
retry_policy=retry_policy
)
Azure OpenAI custom retry (Python):
from openai import AzureOpenAI
client = AzureOpenAI(
azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
api_key=os.getenv("AZURE_OPENAI_API_KEY"),
api_version="2024-10-21",
max_retries=5 # Default: 2
)
C# retry med Polly:
using Azure;
using Azure.AI.Inference;
try {
var response = client.Complete(requestOptions);
} catch (RequestFailedException ex) {
if (ex.ErrorCode == "content_filter") {
Console.WriteLine($"Content filter triggered: {ex.Message}");
} else if (ex.Status == 429) {
// Implement exponential backoff
Thread.Sleep(TimeSpan.FromSeconds(Math.Pow(2, retryCount)));
} else {
throw;
}
}
4. Rate Limiting og 429 Responses
Azure OpenAI Provisioned Throughput:
- 429 respons betyr at provisjonerte PTU-er er fullt benyttet
- Service returnerer
retry-afterogretry-after-msheaders - Standard SDK-oppførsel: Respekterer
retry-afterog retrier automatisk
Håndtering av 429:
| Strategi | Når bruke | Latency Impact |
|---|---|---|
| Client-side retry | OK med høyere latency | ⬆️ Høyere (venter på retry-after) |
| Fallback til annen deployment | Low-latency krav | ⬇️ Lavere (umiddelbar failover) |
| Fallback til global-standard | Cost/availability balance | ➡️ Moderat (noe høyere cost) |
Rate limiting pattern (for bulk operations):
# Bad practice: Naive retry storm
for record in records:
try:
client.process(record)
except RateLimitError:
time.sleep(1) # Fixed delay — overwhelms service
# Good practice: Rate limiter + durable queue
# 1. Enqueue to Azure Event Hubs/Service Bus
# 2. Job processor dequeues at controlled rate
# 3. Tracks PTU utilization via Azure Monitor
5. Batching (Azure OpenAI Batch API)
Batch API: Asynkrone batch-operasjoner med 50% lavere kostnad enn real-time API.
Bruksområder:
- Large-scale data processing (embeddings, summarization)
- Content generation (product descriptions, translations)
- Document review (legal, compliance)
- NLP tasks (sentiment analysis, classification)
Batch limits:
| Parameter | Limit |
|---|---|
| Max batch files (no expiration) | 500 |
| Max batch files (with expiration) | 10,000 |
| Max input file size | 200 MB (BYOS: 1 GB) |
| Max requests per file | 100,000 |
Queueing with exponential backoff (Python):
import time
max_retries = 10
retry_count = 0
batch_job = None
while retry_count < max_retries:
try:
batch_job = client.batches.create(
input_file_id=file_id,
endpoint="/chat/completions",
completion_window="24h"
)
break # Success
except Exception as e:
if "token limit exceeded" in str(e):
retry_count += 1
wait_time = 2 ** retry_count
time.sleep(wait_time)
else:
raise
Fail-fast regions (for batching): Australia East, East US, Germany West Central, Italy North, North Central US, Poland Central, Sweden Central, Switzerland North, East US 2, West US.
6. Connection Pooling og Timeouts
HTTP connection pooling (Python):
import requests
# Keep-alive enabled by default
session = requests.Session()
response = session.get("https://api.example.com")
Azure OpenAI timeout configuration (Python):
from openai import AzureOpenAI
client = AzureOpenAI(
azure_endpoint="...",
api_key="...",
timeout=300.0 # 5 minutes (default: 600s/10 min)
)
Connection pooling for database SDKs:
| SDK | Module |
|---|---|
| MySQL | mysql.connector.pooling |
| PostgreSQL | psycopg2.pool |
| SQLAlchemy | sqlalchemy.pool |
| Pyodbc | Built-in pooling |
Best practice:
- ✅ Bruk connection pools for database/HTTP clients
- ✅ Sett realistiske timeouts (ikke 10 min for user-facing apps)
- ✅ Implementer keepalives for long-running connections
- ❌ IKKE opprett nye connections for hver request
7. Idempotency
Definisjon: En operasjon er idempotent hvis den kan kalles flere ganger uten å produsere flere side-effekter etter første kall.
HTTP idempotency:
| HTTP Method | Idempotent? | Beskrivelse |
|---|---|---|
GET |
✅ Ja | Read-only, ingen side-effekter |
PUT |
✅ Ja | Replaces resource at URI |
DELETE |
✅ Ja | Deletes resource (samme outcome) |
POST |
❌ Nei | Creates new resource hver gang |
PATCH |
❌ Nei | Partial update (depends) |
Idempotency-teknikker for Azure AI Services:
# 1. Check if already processed (database lookup)
def process_document(doc_id):
if already_processed(doc_id):
return cached_result(doc_id)
result = client.analyze_document(...)
save_result(doc_id, result)
return result
# 2. Event-carried state transfer (Event Hubs)
event = {
"doc_id": "12345",
"operation": "set_status",
"status": "completed", # Not "increment_count" — idempotent
"timestamp": "2026-02-03T10:00:00Z"
}
# 3. Deduplication window (Service Bus)
# Enable duplicate detection with MessageId
message.message_id = f"{order_id}-{timestamp}"
Duplicate detection (Azure Service Bus):
- Default deduplication window: 10 minutes
- Min: 20 seconds, Max: 7 days
- Based on
MessageId(orMessageId + PartitionKeyif partitioned)
Arkitekturmønstre
Pattern 1: Rate Limiting med Durable Messaging
Problem: Bulk ingestion til throttled service (Azure Cosmos DB, Azure AI Search) resulterer i retry storms og høy feilrate.
Løsning: Bruk Azure Event Hubs/Service Bus som buffer + job processor med rate limiting.
User API → Event Hubs → Job Processor (rate-limited) → Azure AI Service
(buffer) (100 req/s controlled)
Implementering:
- API enqueues messages (millions per second capacity)
- Job processor leases partitions from blob storage (15s lease)
- Each partition = 100 PTUs (requests/s)
- Process dequeues only what it can handle in 1s
- Monitor utilization via Azure Monitor (
Provisioned-Managed Utilization V2)
Fordeler:
- ✅ Reduserer 429 errors fra 80% til <5%
- ✅ Predikterbar throughput
- ✅ Ingen data loss ved crash (durable queue)
- ✅ Skalerer horisontalt (multiple job processors)
Pattern 2: Circuit Breaker (for transient faults)
Problem: Gjentatte kall til utilgjengelig service forverrer problemet (thundering herd).
Løsning: Circuit Breaker pattern.
States:
| State | Oppførsel |
|---|---|
| Closed | Normal operation — forwards requests |
| Open | Service unavailable — fails fast (no requests) |
| Half-open | Test if service recovered — 1 request |
Implementering (Python):
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=60):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.failure_count = 0
self.state = 'closed'
self.last_failure_time = None
def call(self, func, *args, **kwargs):
if self.state == 'open':
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = 'half-open'
else:
raise Exception("Circuit breaker open")
try:
result = func(*args, **kwargs)
if self.state == 'half-open':
self.state = 'closed'
self.failure_count = 0
return result
except Exception:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = 'open'
raise
Pattern 3: Idempotent Consumer (Event Hubs + Functions)
Problem: Event Hubs garanterer at-least-once delivery — events kan prosesseres flere ganger.
Løsning: Idempotent function design.
Teknikker:
-
Duplicate detection via database:
def process_event(event): if db.exists(event.id): return # Already processed result = ai_client.analyze(event.data) db.save(event.id, result) -
Event-carried state transfer:
{ "account_id": "12345", "operation": "set_balance", "new_balance": 1000 // Not "withdraw 100" — idempotent } -
PeekLock receive mode (Service Bus):
- Consumer får exclusive lock (configurable duration)
- Sender acknowledgment ved success
- Message returneres til queue ved timeout/failure
Pattern 4: Fallback Strategy (429 Handling)
Multi-tier fallback:
from openai import AzureOpenAI
def generate_completion(prompt):
try:
# 1. Try provisioned deployment (lowest latency)
return provisioned_client.chat.completions.create(...)
except Exception as e:
if e.status_code == 429:
# 2. Fallback to standard deployment
return standard_client.chat.completions.create(...)
raise
# Alternative: Retry with backoff
client = AzureOpenAI(
max_retries=5,
timeout=300.0
)
response = client.with_options(max_retries=5).chat.completions.create(...)
Beslutningsveiledning
Når bruke Batch API vs. Real-time API?
| Kriterium | Batch API | Real-time API |
|---|---|---|
| Latency krav | >24 timer OK | <1 sekund nødvendig |
| Volume | >10,000 requests | <1,000 requests |
| Cost sensitivity | Høy (50% saving) | Moderat |
| Use case | Offline analytics, bulk processing | User-facing chat, real-time translation |
Retry Strategy Decision Tree
429 Error?
├─ Ja → Sjekk retry-after header → Vent og retry (max 5x)
│ └─ Hvis fortsatt 429 → Fallback til annen deployment
│
└─ 500-504? → Exponential backoff (2^n seconds, max 60s)
├─ Transient → Retry opptil 5 ganger
└─ Persistent → Log error + alert ops team
401/403? → IKKE retry → Fix authentication/RBAC
400/422? → IKKE retry → Fix input validation
Rate Limiting Strategy
| Scenario | Anbefalt Løsning |
|---|---|
| Single client, moderate load | SDK default retry logic (max_retries=5) |
| Multiple uncoordinated clients | Distributed lease system (blob storage) + partitions |
| Bulk ingestion | Event Hubs + job processor med rate limiter |
| User-facing app | Fallback til standard deployment ved 429 |
Integrasjon med Microsoft-stakken
Azure AI Foundry Integration
SDK-er som støtter Azure AI Foundry:
- Python:
azure-ai-inference,openai(Azure variant) - .NET:
Azure.AI.Inference,Azure.AI.OpenAI - JavaScript/TypeScript:
@azure/openai,@azure/ai-inference - Go:
github.com/openai/openai-go(med Azure endpoint)
Authentication patterns:
# 1. DefaultAzureCredential (anbefalt for prod)
from azure.identity import DefaultAzureCredential
from azure.ai.inference import ChatCompletionsClient
credential = DefaultAzureCredential()
client = ChatCompletionsClient(
endpoint="https://<resource>.openai.azure.com",
credential=credential
)
# 2. Managed Identity (Azure-hosted apps)
from azure.identity import ManagedIdentityCredential
credential = ManagedIdentityCredential()
# 3. API Key (development only)
from azure.core.credentials import AzureKeyCredential
credential = AzureKeyCredential(os.getenv("AZURE_OPENAI_API_KEY"))
Azure Monitor Integration
Metrics å overvåke:
| Metric | Threshold | Alert |
|---|---|---|
Provisioned-Managed Utilization V2 |
>95% | Scale up PTUs |
Dependency failures |
>10% | Check retry logic |
Request duration |
>10s | Optimize prompts/batching |
429 error rate |
>5% | Increase quota or add fallback |
Kusto query (Log Analytics):
AzureDiagnostics
| where ResourceType == "COGNITIVE-SERVICES"
| where Category == "RequestResponse"
| where resultCode_d == 429
| summarize count() by bin(TimeGenerated, 5m), clientIp_s
| order by count_ desc
Power Automate / Logic Apps Integration
Error handling i flows:
-
Configure retry policy:
- Retry count: 4
- Retry interval: Exponential (PT10S, PT20S, PT40S, PT80S)
- Retry on: 408, 429, 500, 502, 503, 504
-
Handle 429 with condition:
{ "condition": "@equals(actions('Call_Azure_AI').statusCode, 429)", "ifTrue": { "Wait": "@actions('Call_Azure_AI').outputs.headers['retry-after']" } }
Offentlig sektor (Norge)
Compliance og Error Handling
GDPR/Personopplysningsloven:
- ✅ Logg ALDRI personidentifiserende informasjon i error logs
- ✅ Bruk correlation IDs (ikke bruker-ID) i telemetry
- ✅ Respekter
retry-afterheaders (ikke spam API-er)
Eksempel (sanitized logging):
import logging
logger = logging.getLogger(__name__)
try:
result = client.analyze_document(doc_id)
except HttpResponseError as e:
logger.error(
"Document analysis failed",
extra={
"correlation_id": e.response.headers.get('x-ms-request-id'),
"status_code": e.status_code,
"doc_id": hash(doc_id), # Hash, not plaintext
"error_code": e.error.code if e.error else None
}
)
Idempotency for Offentlig Sektor Use Cases
Saksbehandlingssystemer:
- ✅ Bruk MessageId =
{saksID}-{operasjon}-{timestamp} - ✅ Aktiver duplicate detection (Service Bus)
- ✅ Check database før processing (deduplication table)
E-post varsling (som må være idempotent):
def send_notification(case_id, notification_type):
message_id = f"{case_id}-{notification_type}"
if already_sent(message_id):
return # Idempotent — don't resend
send_email(...)
mark_sent(message_id)
Kostnad og lisensiering
Kostnad-konsekvenser av API Design
429 Errors kosten ingenting (ingen PTU consumption), MEN:
- ❌ 400 errors (content filter) koster (prompt ble prosessert)
- ❌ 408 timeout koster (delvis processing)
- ❌
finish_reason: content_filterkoster (completion ble filtrert)
Batch API savings:
| Scenario | Real-time Cost | Batch Cost | Savings |
|---|---|---|---|
| 1M tokens (GPT-4o) | ~$10 | ~$5 | 50% |
| Embeddings (1M tokens) | ~$0.13 | ~$0.065 | 50% |
Provisioned vs. Standard:
- Provisioned: Fast kostnad (per PTU/hour), predictable latency
- Standard: Pay-per-token, ingen garantier ved high traffic
Reservation discounts (Provisioned):
- 1-årig commitment: ~37% discount
- 3-årig commitment: ~57% discount
For arkitekten (Cosmo)
Design Principles for Robust API Integration
-
Error Handling Hierarchy:
Try specific exceptions first → HttpResponseError → AzureError → generic Exception -
Retry Decision Matrix:
- Transient (retry): 408, 429, 500-504, network errors
- Permanent (don't retry): 400, 401, 403, 404, 422
- Custom logic: 429 with fallback
-
Rate Limiting Strategy:
- Low volume (<100 req/s): SDK default retry
- High volume (>1000 req/s): Event Hubs + job processor
- Provisioned deployments: Monitor utilization, implement fallback
-
Batching Decision:
- Latency >1 min? → Batch API
- Volume >10k requests? → Batch API
- Cost critical? → Batch API
-
Idempotency Checklist:
- Operations designed for identical input?
- Duplicate detection enabled (if using Service Bus)?
- Database check before processing?
- Correlation IDs for tracing?
Common Anti-Patterns (og hvordan unngå dem)
| Anti-Pattern | Problem | Løsning |
|---|---|---|
| while(true) retry loop | Retry storm → overwhelms service | Max retries + exponential backoff |
| Fixed 1-second delays | Ignores retry-after header |
Use SDK retry eller respekter header |
| Ingen connection pooling | SNAT port exhaustion | Enable connection pooling |
| Hardcoded API keys | Security risk | Use Managed Identity + Key Vault |
| No timeout configuration | Hanging requests (10 min default) | Set realistic timeouts (30-300s) |
| Logging sensitive data | GDPR violation | Hash/mask PII in logs |
Monitoring og Alerting
Kritiske metrics:
# Azure Monitor query for error rate trends
AzureDiagnostics
| where ResourceType == "COGNITIVE-SERVICES"
| where TimeGenerated > ago(1h)
| summarize
total_requests = count(),
errors = countif(resultCode_d >= 400)
by bin(TimeGenerated, 5m)
| extend error_rate = (errors * 100.0) / total_requests
| where error_rate > 5 # Alert if >5% error rate
Alert rules:
- 429 rate >5% → Scale PTUs eller enable fallback
- 500-504 errors → Check service health dashboard
- Average latency >5s → Optimize prompts eller batch processing
Architecture Decision Records (ADR) Triggers
Når skal du lage en ADR?
- Velger Batch API over real-time API for produksjon
- Implementerer custom retry logic (avviker fra SDK defaults)
- Bruker distributed rate limiting (blob leases)
- Velger Provisioned over Standard (cost/latency trade-off)
- Implementerer multi-region fallback strategy
Kilder og verifisering
Verification status: ✅ Verified via Microsoft Learn MCP (2026-02)
Primary sources (fetched):
-
Handle errors produced by the Azure SDK for Python
- URL: https://learn.microsoft.com/en-us/azure/developer/python/sdk/fundamentals/errors
- Confidence: Verified (MCP fetch)
-
Rate Limiting pattern
- URL: https://learn.microsoft.com/en-us/azure/architecture/patterns/rate-limiting-pattern
- Confidence: Verified (MCP fetch)
-
Retry Storm antipattern
- URL: https://learn.microsoft.com/en-us/azure/architecture/antipatterns/retry-storm
- Confidence: Verified (MCP fetch)
-
Get started using provisioned deployments on Azure OpenAI
- URL: https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/provisioned-get-started
- Confidence: Verified (MCP fetch)
-
Getting started with Azure OpenAI batch deployments
- URL: https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/batch
- Confidence: Verified (MCP search)
-
Azure AI services authentication and authorization using .NET
- URL: https://learn.microsoft.com/en-us/dotnet/ai/azure-ai-services-authentication
- Confidence: Verified (MCP search)
-
Designing Azure Functions for identical input (idempotency)
- URL: https://learn.microsoft.com/en-us/azure/azure-functions/functions-idempotent
- Confidence: Verified (MCP search)
-
Duplicate detection (Azure Service Bus)
- URL: https://learn.microsoft.com/en-us/azure/service-bus-messaging/duplicate-detection
- Confidence: Verified (MCP search)
Code samples (verified):
- Azure.AI.Inference (C#) error handling
- Azure SDK Python retry policies
- OpenAI Python SDK custom retry configuration
Related documentation:
- Azure Monitor metrics and logging
- Circuit Breaker pattern (Azure Architecture Center)
- Connection pooling (Azure App Service best practices)
Baseline knowledge (model):
- HTTP idempotency semantics (RFC 7231)
- Exponential backoff algorithms
- Connection pooling concepts
MCP call summary: 7 microsoft_docs_search + 4 microsoft_docs_fetch + 1 microsoft_code_sample_search = 12 total MCP calls