# Azure AI Services - API Design and Best Practices

**Last updated:** 2026-04 | Verified: MCP 2026-04
**Status:** GA
**Category:** Azure AI Services (Foundry Tools)

---

## Introduksjon

Når du bygger produksjonsklare applikasjoner med Azure AI Services (Azure OpenAI, Content Safety, Translator, Document Intelligence, Computer Vision, etc.), er robust API-design og feilhåndtering kritisk. Distribuerte skytjenester krever at applikasjoner håndterer midlertidige feil, throttling, nettverksproblemer og uventede responser på en strukturert måte.

Denne referansen dekker best practices for:
- **Error handling** — Strukturert feilhåndtering med Azure SDK exception hierarchy
- **Retry logic** — Eksponentiell backoff, rate limiting og retry storms
- **Rate limiting** — Throttling-håndtering og quota management
- **Batching** — Effektiv bruk av Batch API for høyvolum-operasjoner
- **Connection management** — Connection pooling og timeout-konfigurering
- **Idempotency** — Design for at identiske requests kan håndteres trygt
- **Authentication patterns** — Managed Identity vs. API keys

**Kilde:** Microsoft Learn (verified via MCP 2026-02)

---

## Kjernekomponenter / Nøkkelegenskaper

### 1. Azure SDK Exception Hierarchy

Azure SDK for Python og .NET bruker en hierarkisk exception-modell som gir både generiske og spesifikke error-handling capabilities.

**Exception-hierarki:**

```
AzureError (base)
├── ClientAuthenticationError
├── ResourceNotFoundError
├── ResourceExistsError
├── ResourceModifiedError
├── ResourceNotModifiedError
├── ServiceRequestError
├── ServiceResponseError
└── HttpResponseError
```

**Viktige exception-typer:**

| Exception | HTTP Status | Når den kastes | Retry? |
|-----------|-------------|----------------|--------|
| `ClientAuthenticationError` | 401 | Authentication failure | ❌ Nei — fix credentials |
| `ResourceNotFoundError` | 404 | Resource doesn't exist | ❌ Nei (unless transient) |
| `ResourceExistsError` | 409 | Resource already exists | ❌ Nei — handle duplicate |
| `HttpResponseError` (429) | 429 | Rate limit exceeded | ✅ Ja — med backoff |
| `HttpResponseError` (500-504) | 500-504 | Server/gateway error | ✅ Ja — transient |
| `ServiceRequestError` | N/A | Network/DNS failure | ✅ Ja — network transient |

### 2. HTTP Error Codes (Azure OpenAI)

| Status Code | Error Type | Retry Strategy |
|-------------|-----------|----------------|
| 400 | Bad Request | ❌ Fix input — don't retry |
| 401 | Authentication Error | ❌ Fix credentials |
| 403 | Permission Denied | ❌ Fix RBAC assignments |
| 404 | Not Found | ❌ Verify resource exists |
| 408 | Request Timeout | ✅ Retry with backoff |
| 422 | Unprocessable Entity | ❌ Fix input validation |
| 429 | Rate Limit Error | ✅ Retry with `retry-after` header |
| 500 | Internal Server Error | ✅ Retry with backoff |
| 502 | Bad Gateway | ✅ Retry with backoff |
| 503 | Service Unavailable | ✅ Retry with backoff |
| 504 | Gateway Timeout | ✅ Retry with backoff |

**Azure OpenAI SDKs** (Python, .NET, Go) retry automatisk 408, 429, 500, 502, 503, 504 — opptil 3 ganger med exponentiell backoff.

### 3. Retry Logic Patterns

**Eksponentiell backoff (anbefalt):**

```python
from azure.core.pipeline.policies import RetryPolicy

retry_policy = RetryPolicy(
    retry_total=5,                  # Max retry attempts
    retry_backoff_factor=2,         # 2^n seconds
    retry_backoff_max=60,           # Max backoff: 60s
    retry_on_status_codes=[408, 429, 500, 502, 503, 504]
)

client = BlobServiceClient(
    account_url="https://...",
    credential=credential,
    retry_policy=retry_policy
)
```

**Azure OpenAI custom retry (Python):**

```python
from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2024-10-21",
    max_retries=5  # Default: 2
)
```

**C# retry med Polly:**

```csharp
using Azure;
using Azure.AI.Inference;

try {
    var response = client.Complete(requestOptions);
} catch (RequestFailedException ex) {
    if (ex.ErrorCode == "content_filter") {
        Console.WriteLine($"Content filter triggered: {ex.Message}");
    } else if (ex.Status == 429) {
        // Implement exponential backoff
        Thread.Sleep(TimeSpan.FromSeconds(Math.Pow(2, retryCount)));
    } else {
        throw;
    }
}
```

### 4. Rate Limiting og 429 Responses

**Azure OpenAI Provisioned Throughput:**

- **429 respons** betyr at provisjonerte PTU-er er fullt benyttet
- Service returnerer `retry-after` og `retry-after-ms` headers
- **Standard SDK-oppførsel:** Respekterer `retry-after` og retrier automatisk

**Håndtering av 429:**

| Strategi | Når bruke | Latency Impact |
|----------|-----------|----------------|
| **Client-side retry** | OK med høyere latency | ⬆️ Høyere (venter på retry-after) |
| **Fallback til annen deployment** | Low-latency krav | ⬇️ Lavere (umiddelbar failover) |
| **Fallback til global-standard** | Cost/availability balance | ➡️ Moderat (noe høyere cost) |

**Rate limiting pattern (for bulk operations):**

```python
# Bad practice: Naive retry storm
for record in records:
    try:
        client.process(record)
    except RateLimitError:
        time.sleep(1)  # Fixed delay — overwhelms service

# Good practice: Rate limiter + durable queue
# 1. Enqueue to Azure Event Hubs/Service Bus
# 2. Job processor dequeues at controlled rate
# 3. Tracks PTU utilization via Azure Monitor
```

### 5. Batching (Azure OpenAI Batch API)

**Batch API:** Asynkrone batch-operasjoner med 50% lavere kostnad enn real-time API.

**Bruksområder:**
- Large-scale data processing (embeddings, summarization)
- Content generation (product descriptions, translations)
- Document review (legal, compliance)
- NLP tasks (sentiment analysis, classification)

**Batch limits:**

| Parameter | Limit |
|-----------|-------|
| Max batch files (no expiration) | 500 |
| Max batch files (with expiration) | 10,000 |
| Max input file size | 200 MB (BYOS: 1 GB) |
| Max requests per file | 100,000 |

**Queueing with exponential backoff (Python):**

```python
import time

max_retries = 10
retry_count = 0
batch_job = None

while retry_count < max_retries:
    try:
        batch_job = client.batches.create(
            input_file_id=file_id,
            endpoint="/chat/completions",
            completion_window="24h"
        )
        break  # Success
    except Exception as e:
        if "token limit exceeded" in str(e):
            retry_count += 1
            wait_time = 2 ** retry_count
            time.sleep(wait_time)
        else:
            raise
```

**Fail-fast regions (for batching):** Australia East, East US, Germany West Central, Italy North, North Central US, Poland Central, Sweden Central, Switzerland North, East US 2, West US.

### 6. Connection Pooling og Timeouts

**HTTP connection pooling (Python):**

```python
import requests

# Keep-alive enabled by default
session = requests.Session()
response = session.get("https://api.example.com")
```

**Azure OpenAI timeout configuration (Python):**

```python
from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint="...",
    api_key="...",
    timeout=300.0  # 5 minutes (default: 600s/10 min)
)
```

**Connection pooling for database SDKs:**

| SDK | Module |
|-----|--------|
| MySQL | `mysql.connector.pooling` |
| PostgreSQL | `psycopg2.pool` |
| SQLAlchemy | `sqlalchemy.pool` |
| Pyodbc | Built-in pooling |

**Best practice:**
- ✅ Bruk connection pools for database/HTTP clients
- ✅ Sett realistiske timeouts (ikke 10 min for user-facing apps)
- ✅ Implementer keepalives for long-running connections
- ❌ IKKE opprett nye connections for hver request

### 7. Idempotency

**Definisjon:** En operasjon er idempotent hvis den kan kalles flere ganger uten å produsere flere side-effekter etter første kall.

**HTTP idempotency:**

| HTTP Method | Idempotent? | Beskrivelse |
|-------------|-------------|-------------|
| `GET` | ✅ Ja | Read-only, ingen side-effekter |
| `PUT` | ✅ Ja | Replaces resource at URI |
| `DELETE` | ✅ Ja | Deletes resource (samme outcome) |
| `POST` | ❌ Nei | Creates new resource hver gang |
| `PATCH` | ❌ Nei | Partial update (depends) |

**Idempotency-teknikker for Azure AI Services:**

```python
# 1. Check if already processed (database lookup)
def process_document(doc_id):
    if already_processed(doc_id):
        return cached_result(doc_id)

    result = client.analyze_document(...)
    save_result(doc_id, result)
    return result

# 2. Event-carried state transfer (Event Hubs)
event = {
    "doc_id": "12345",
    "operation": "set_status",
    "status": "completed",  # Not "increment_count" — idempotent
    "timestamp": "2026-02-03T10:00:00Z"
}

# 3. Deduplication window (Service Bus)
# Enable duplicate detection with MessageId
message.message_id = f"{order_id}-{timestamp}"
```

**Duplicate detection (Azure Service Bus):**
- Default deduplication window: 10 minutes
- Min: 20 seconds, Max: 7 days
- Based on `MessageId` (or `MessageId + PartitionKey` if partitioned)

---

## Arkitekturmønstre

### Pattern 1: Rate Limiting med Durable Messaging

**Problem:** Bulk ingestion til throttled service (Azure Cosmos DB, Azure AI Search) resulterer i retry storms og høy feilrate.

**Løsning:** Bruk Azure Event Hubs/Service Bus som buffer + job processor med rate limiting.

```
User API → Event Hubs → Job Processor (rate-limited) → Azure AI Service
             (buffer)      (100 req/s controlled)
```

**Implementering:**

1. **API enqueues messages** (millions per second capacity)
2. **Job processor** leases partitions from blob storage (15s lease)
   - Each partition = 100 PTUs (requests/s)
   - Process dequeues only what it can handle in 1s
3. **Monitor utilization** via Azure Monitor (`Provisioned-Managed Utilization V2`)

**Fordeler:**
- ✅ Reduserer 429 errors fra 80% til <5%
- ✅ Predikterbar throughput
- ✅ Ingen data loss ved crash (durable queue)
- ✅ Skalerer horisontalt (multiple job processors)

### Pattern 2: Circuit Breaker (for transient faults)

**Problem:** Gjentatte kall til utilgjengelig service forverrer problemet (thundering herd).

**Løsning:** Circuit Breaker pattern.

**States:**

| State | Oppførsel |
|-------|-----------|
| **Closed** | Normal operation — forwards requests |
| **Open** | Service unavailable — fails fast (no requests) |
| **Half-open** | Test if service recovered — 1 request |

**Implementering (Python):**

```python
class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=60):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failure_count = 0
        self.state = 'closed'
        self.last_failure_time = None

    def call(self, func, *args, **kwargs):
        if self.state == 'open':
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = 'half-open'
            else:
                raise Exception("Circuit breaker open")

        try:
            result = func(*args, **kwargs)
            if self.state == 'half-open':
                self.state = 'closed'
                self.failure_count = 0
            return result
        except Exception:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = 'open'
            raise
```

### Pattern 3: Idempotent Consumer (Event Hubs + Functions)

**Problem:** Event Hubs garanterer at-least-once delivery — events kan prosesseres flere ganger.

**Løsning:** Idempotent function design.

**Teknikker:**

1. **Duplicate detection via database:**
   ```python
   def process_event(event):
       if db.exists(event.id):
           return  # Already processed

       result = ai_client.analyze(event.data)
       db.save(event.id, result)
   ```

2. **Event-carried state transfer:**
   ```json
   {
     "account_id": "12345",
     "operation": "set_balance",
     "new_balance": 1000  // Not "withdraw 100" — idempotent
   }
   ```

3. **PeekLock receive mode (Service Bus):**
   - Consumer får exclusive lock (configurable duration)
   - Sender acknowledgment ved success
   - Message returneres til queue ved timeout/failure

### Pattern 4: Fallback Strategy (429 Handling)

**Multi-tier fallback:**

```python
from openai import AzureOpenAI

def generate_completion(prompt):
    try:
        # 1. Try provisioned deployment (lowest latency)
        return provisioned_client.chat.completions.create(...)
    except Exception as e:
        if e.status_code == 429:
            # 2. Fallback to standard deployment
            return standard_client.chat.completions.create(...)
        raise

# Alternative: Retry with backoff
client = AzureOpenAI(
    max_retries=5,
    timeout=300.0
)
response = client.with_options(max_retries=5).chat.completions.create(...)
```

---

## Beslutningsveiledning

### Når bruke Batch API vs. Real-time API?

| Kriterium | Batch API | Real-time API |
|-----------|-----------|---------------|
| **Latency krav** | >24 timer OK | <1 sekund nødvendig |
| **Volume** | >10,000 requests | <1,000 requests |
| **Cost sensitivity** | Høy (50% saving) | Moderat |
| **Use case** | Offline analytics, bulk processing | User-facing chat, real-time translation |

### Retry Strategy Decision Tree

```
429 Error?
├─ Ja → Sjekk retry-after header → Vent og retry (max 5x)
│       └─ Hvis fortsatt 429 → Fallback til annen deployment
│
└─ 500-504? → Exponential backoff (2^n seconds, max 60s)
    ├─ Transient → Retry opptil 5 ganger
    └─ Persistent → Log error + alert ops team

401/403? → IKKE retry → Fix authentication/RBAC
400/422? → IKKE retry → Fix input validation
```

### Rate Limiting Strategy

| Scenario | Anbefalt Løsning |
|----------|------------------|
| **Single client, moderate load** | SDK default retry logic (max_retries=5) |
| **Multiple uncoordinated clients** | Distributed lease system (blob storage) + partitions |
| **Bulk ingestion** | Event Hubs + job processor med rate limiter |
| **User-facing app** | Fallback til standard deployment ved 429 |

---

## Integrasjon med Microsoft-stakken

### Azure AI Foundry Integration

**SDK-er som støtter Azure AI Foundry:**

- **Python:** `azure-ai-inference`, `openai` (Azure variant)
- **.NET:** `Azure.AI.Inference`, `Azure.AI.OpenAI`
- **JavaScript/TypeScript:** `@azure/openai`, `@azure/ai-inference`
- **Go:** `github.com/openai/openai-go` (med Azure endpoint)

**Authentication patterns:**

```python
# 1. DefaultAzureCredential (anbefalt for prod)
from azure.identity import DefaultAzureCredential
from azure.ai.inference import ChatCompletionsClient

credential = DefaultAzureCredential()
client = ChatCompletionsClient(
    endpoint="https://<resource>.openai.azure.com",
    credential=credential
)

# 2. Managed Identity (Azure-hosted apps)
from azure.identity import ManagedIdentityCredential

credential = ManagedIdentityCredential()

# 3. API Key (development only)
from azure.core.credentials import AzureKeyCredential

credential = AzureKeyCredential(os.getenv("AZURE_OPENAI_API_KEY"))
```

### Azure Monitor Integration

**Metrics å overvåke:**

| Metric | Threshold | Alert |
|--------|-----------|-------|
| `Provisioned-Managed Utilization V2` | >95% | Scale up PTUs |
| `Dependency failures` | >10% | Check retry logic |
| `Request duration` | >10s | Optimize prompts/batching |
| `429 error rate` | >5% | Increase quota or add fallback |

**Kusto query (Log Analytics):**

```kusto
AzureDiagnostics
| where ResourceType == "COGNITIVE-SERVICES"
| where Category == "RequestResponse"
| where resultCode_d == 429
| summarize count() by bin(TimeGenerated, 5m), clientIp_s
| order by count_ desc
```

### Power Automate / Logic Apps Integration

**Error handling i flows:**

1. **Configure retry policy:**
   - Retry count: 4
   - Retry interval: Exponential (PT10S, PT20S, PT40S, PT80S)
   - Retry on: 408, 429, 500, 502, 503, 504

2. **Handle 429 with condition:**
   ```json
   {
     "condition": "@equals(actions('Call_Azure_AI').statusCode, 429)",
     "ifTrue": {
       "Wait": "@actions('Call_Azure_AI').outputs.headers['retry-after']"
     }
   }
   ```

---

## Offentlig sektor (Norge)

### Compliance og Error Handling

**GDPR/Personopplysningsloven:**
- ✅ Logg ALDRI personidentifiserende informasjon i error logs
- ✅ Bruk correlation IDs (ikke bruker-ID) i telemetry
- ✅ Respekter `retry-after` headers (ikke spam API-er)

**Eksempel (sanitized logging):**

```python
import logging

logger = logging.getLogger(__name__)

try:
    result = client.analyze_document(doc_id)
except HttpResponseError as e:
    logger.error(
        "Document analysis failed",
        extra={
            "correlation_id": e.response.headers.get('x-ms-request-id'),
            "status_code": e.status_code,
            "doc_id": hash(doc_id),  # Hash, not plaintext
            "error_code": e.error.code if e.error else None
        }
    )
```

### Idempotency for Offentlig Sektor Use Cases

**Saksbehandlingssystemer:**
- ✅ Bruk MessageId = `{saksID}-{operasjon}-{timestamp}`
- ✅ Aktiver duplicate detection (Service Bus)
- ✅ Check database før processing (deduplication table)

**E-post varsling (som må være idempotent):**
```python
def send_notification(case_id, notification_type):
    message_id = f"{case_id}-{notification_type}"

    if already_sent(message_id):
        return  # Idempotent — don't resend

    send_email(...)
    mark_sent(message_id)
```

---

## Kostnad og lisensiering

### Kostnad-konsekvenser av API Design

**429 Errors kosten ingenting** (ingen PTU consumption), MEN:
- ❌ 400 errors (content filter) **koster** (prompt ble prosessert)
- ❌ 408 timeout **koster** (delvis processing)
- ❌ `finish_reason: content_filter` **koster** (completion ble filtrert)

**Batch API savings:**

| Scenario | Real-time Cost | Batch Cost | Savings |
|----------|----------------|------------|---------|
| 1M tokens (GPT-4o) | ~$10 | ~$5 | 50% |
| Embeddings (1M tokens) | ~$0.13 | ~$0.065 | 50% |

**Provisioned vs. Standard:**

- **Provisioned:** Fast kostnad (per PTU/hour), predictable latency
- **Standard:** Pay-per-token, ingen garantier ved high traffic

**Reservation discounts (Provisioned):**
- 1-årig commitment: ~37% discount
- 3-årig commitment: ~57% discount

---

## For arkitekten (Cosmo)

### Design Principles for Robust API Integration

1. **Error Handling Hierarchy:**
   ```
   Try specific exceptions first → HttpResponseError → AzureError → generic Exception
   ```

2. **Retry Decision Matrix:**
   - **Transient (retry):** 408, 429, 500-504, network errors
   - **Permanent (don't retry):** 400, 401, 403, 404, 422
   - **Custom logic:** 429 with fallback

3. **Rate Limiting Strategy:**
   - **Low volume (<100 req/s):** SDK default retry
   - **High volume (>1000 req/s):** Event Hubs + job processor
   - **Provisioned deployments:** Monitor utilization, implement fallback

4. **Batching Decision:**
   - Latency >1 min? → Batch API
   - Volume >10k requests? → Batch API
   - Cost critical? → Batch API

5. **Idempotency Checklist:**
   - [ ] Operations designed for identical input?
   - [ ] Duplicate detection enabled (if using Service Bus)?
   - [ ] Database check before processing?
   - [ ] Correlation IDs for tracing?

### Common Anti-Patterns (og hvordan unngå dem)

| Anti-Pattern | Problem | Løsning |
|--------------|---------|---------|
| **while(true) retry loop** | Retry storm → overwhelms service | Max retries + exponential backoff |
| **Fixed 1-second delays** | Ignores `retry-after` header | Use SDK retry eller respekter header |
| **Ingen connection pooling** | SNAT port exhaustion | Enable connection pooling |
| **Hardcoded API keys** | Security risk | Use Managed Identity + Key Vault |
| **No timeout configuration** | Hanging requests (10 min default) | Set realistic timeouts (30-300s) |
| **Logging sensitive data** | GDPR violation | Hash/mask PII in logs |

### Monitoring og Alerting

**Kritiske metrics:**

```python
# Azure Monitor query for error rate trends
AzureDiagnostics
| where ResourceType == "COGNITIVE-SERVICES"
| where TimeGenerated > ago(1h)
| summarize
    total_requests = count(),
    errors = countif(resultCode_d >= 400)
    by bin(TimeGenerated, 5m)
| extend error_rate = (errors * 100.0) / total_requests
| where error_rate > 5  # Alert if >5% error rate
```

**Alert rules:**
- **429 rate >5%** → Scale PTUs eller enable fallback
- **500-504 errors** → Check service health dashboard
- **Average latency >5s** → Optimize prompts eller batch processing

### Architecture Decision Records (ADR) Triggers

**Når skal du lage en ADR?**

- [ ] Velger Batch API over real-time API for produksjon
- [ ] Implementerer custom retry logic (avviker fra SDK defaults)
- [ ] Bruker distributed rate limiting (blob leases)
- [ ] Velger Provisioned over Standard (cost/latency trade-off)
- [ ] Implementerer multi-region fallback strategy

---

## Kilder og verifisering

**Verification status:** ✅ Verified via Microsoft Learn MCP (2026-02)

**Primary sources (fetched):**

1. **Handle errors produced by the Azure SDK for Python**
   - URL: https://learn.microsoft.com/en-us/azure/developer/python/sdk/fundamentals/errors
   - Confidence: **Verified** (MCP fetch)

2. **Rate Limiting pattern**
   - URL: https://learn.microsoft.com/en-us/azure/architecture/patterns/rate-limiting-pattern
   - Confidence: **Verified** (MCP fetch)

3. **Retry Storm antipattern**
   - URL: https://learn.microsoft.com/en-us/azure/architecture/antipatterns/retry-storm
   - Confidence: **Verified** (MCP fetch)

4. **Get started using provisioned deployments on Azure OpenAI**
   - URL: https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/provisioned-get-started
   - Confidence: **Verified** (MCP fetch)

5. **Getting started with Azure OpenAI batch deployments**
   - URL: https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/batch
   - Confidence: **Verified** (MCP search)

6. **Azure AI services authentication and authorization using .NET**
   - URL: https://learn.microsoft.com/en-us/dotnet/ai/azure-ai-services-authentication
   - Confidence: **Verified** (MCP search)

7. **Designing Azure Functions for identical input (idempotency)**
   - URL: https://learn.microsoft.com/en-us/azure/azure-functions/functions-idempotent
   - Confidence: **Verified** (MCP search)

8. **Duplicate detection (Azure Service Bus)**
   - URL: https://learn.microsoft.com/en-us/azure/service-bus-messaging/duplicate-detection
   - Confidence: **Verified** (MCP search)

**Code samples (verified):**

- Azure.AI.Inference (C#) error handling
- Azure SDK Python retry policies
- OpenAI Python SDK custom retry configuration

**Related documentation:**

- Azure Monitor metrics and logging
- Circuit Breaker pattern (Azure Architecture Center)
- Connection pooling (Azure App Service best practices)

**Baseline knowledge (model):**
- HTTP idempotency semantics (RFC 7231)
- Exponential backoff algorithms
- Connection pooling concepts

**MCP call summary:** 7 microsoft_docs_search + 4 microsoft_docs_fetch + 1 microsoft_code_sample_search = 12 total MCP calls