ktg-plugin-marketplace/plugins/ms-ai-architect/skills/ms-ai-engineering/references/azure-ai-services/ai-services-api-best-practices.md
Kjell Tore Guttormsen ff6a50d14f docs(architect): weekly KB update — 106 files refreshed (2026-04)
Updates across all 5 skills: ms-ai-advisor, ms-ai-engineering,
ms-ai-governance, ms-ai-security, ms-ai-infrastructure.

Key changes:
- Language Services (Custom Text Classification, Text Analytics, QnA):
  retirement warning 2029-03-31, migration guides to Foundry/GPT-4o
- Agentic Retrieval: 50M free reasoning tokens/month (Public Preview)
- Computer Use: Claude Sonnet 4.5 (preview) + OpenAI CUA models
- Agent Registry: Risks column (M365 E7), user-shared/org-published types
- Declarative agents: schema v1.5 → v1.6, Store validation requirements
- MLflow 3: 13 built-in LLM judges, production monitoring, Genie Code
- AG-UI HITL: ApprovalRequiredAIFunction (C#) + @tool(approval_mode) (Python)
- Entra ID Ignite 2025: Agent ID Admin/Developer RBAC roles, Conditional Access
- Security Copilot: 400 SCU/month per 1000 M365 E5 licenses, auto-provisioned
- Fast Transcription API: phrase lists, 14-language multi-lingual transcription
- Azure Monitor Workbooks: Bicep support, RBAC specifics
- Power Platform Copilot: data residency (Norway/Europe → EU DB, Bing → USA)
- RAG security-rbac: 4-approach table (GA + 3 preview access control methods)
- IaC MLOps: Well-Architected OE:05 principles, Bicep/Terraform patterns
- Translator: image file batch translation Preview (JPEG/PNG/BMP/WebP)

All 106 files: Last updated 2026-04 | Verified: MCP 2026-04

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 09:13:24 +02:00

23 KiB

Azure AI Services - API Design and Best Practices

Last updated: 2026-04 | Verified: MCP 2026-04 Status: GA Category: Azure AI Services (Foundry Tools)


Introduksjon

Når du bygger produksjonsklare applikasjoner med Azure AI Services (Azure OpenAI, Content Safety, Translator, Document Intelligence, Computer Vision, etc.), er robust API-design og feilhåndtering kritisk. Distribuerte skytjenester krever at applikasjoner håndterer midlertidige feil, throttling, nettverksproblemer og uventede responser på en strukturert måte.

Denne referansen dekker best practices for:

  • Error handling — Strukturert feilhåndtering med Azure SDK exception hierarchy
  • Retry logic — Eksponentiell backoff, rate limiting og retry storms
  • Rate limiting — Throttling-håndtering og quota management
  • Batching — Effektiv bruk av Batch API for høyvolum-operasjoner
  • Connection management — Connection pooling og timeout-konfigurering
  • Idempotency — Design for at identiske requests kan håndteres trygt
  • Authentication patterns — Managed Identity vs. API keys

Kilde: Microsoft Learn (verified via MCP 2026-02)


Kjernekomponenter / Nøkkelegenskaper

1. Azure SDK Exception Hierarchy

Azure SDK for Python og .NET bruker en hierarkisk exception-modell som gir både generiske og spesifikke error-handling capabilities.

Exception-hierarki:

AzureError (base)
├── ClientAuthenticationError
├── ResourceNotFoundError
├── ResourceExistsError
├── ResourceModifiedError
├── ResourceNotModifiedError
├── ServiceRequestError
├── ServiceResponseError
└── HttpResponseError

Viktige exception-typer:

Exception HTTP Status Når den kastes Retry?
ClientAuthenticationError 401 Authentication failure Nei — fix credentials
ResourceNotFoundError 404 Resource doesn't exist Nei (unless transient)
ResourceExistsError 409 Resource already exists Nei — handle duplicate
HttpResponseError (429) 429 Rate limit exceeded Ja — med backoff
HttpResponseError (500-504) 500-504 Server/gateway error Ja — transient
ServiceRequestError N/A Network/DNS failure Ja — network transient

2. HTTP Error Codes (Azure OpenAI)

Status Code Error Type Retry Strategy
400 Bad Request Fix input — don't retry
401 Authentication Error Fix credentials
403 Permission Denied Fix RBAC assignments
404 Not Found Verify resource exists
408 Request Timeout Retry with backoff
422 Unprocessable Entity Fix input validation
429 Rate Limit Error Retry with retry-after header
500 Internal Server Error Retry with backoff
502 Bad Gateway Retry with backoff
503 Service Unavailable Retry with backoff
504 Gateway Timeout Retry with backoff

Azure OpenAI SDKs (Python, .NET, Go) retry automatisk 408, 429, 500, 502, 503, 504 — opptil 3 ganger med exponentiell backoff.

3. Retry Logic Patterns

Eksponentiell backoff (anbefalt):

from azure.core.pipeline.policies import RetryPolicy

retry_policy = RetryPolicy(
    retry_total=5,                  # Max retry attempts
    retry_backoff_factor=2,         # 2^n seconds
    retry_backoff_max=60,           # Max backoff: 60s
    retry_on_status_codes=[408, 429, 500, 502, 503, 504]
)

client = BlobServiceClient(
    account_url="https://...",
    credential=credential,
    retry_policy=retry_policy
)

Azure OpenAI custom retry (Python):

from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2024-10-21",
    max_retries=5  # Default: 2
)

C# retry med Polly:

using Azure;
using Azure.AI.Inference;

try {
    var response = client.Complete(requestOptions);
} catch (RequestFailedException ex) {
    if (ex.ErrorCode == "content_filter") {
        Console.WriteLine($"Content filter triggered: {ex.Message}");
    } else if (ex.Status == 429) {
        // Implement exponential backoff
        Thread.Sleep(TimeSpan.FromSeconds(Math.Pow(2, retryCount)));
    } else {
        throw;
    }
}

4. Rate Limiting og 429 Responses

Azure OpenAI Provisioned Throughput:

  • 429 respons betyr at provisjonerte PTU-er er fullt benyttet
  • Service returnerer retry-after og retry-after-ms headers
  • Standard SDK-oppførsel: Respekterer retry-after og retrier automatisk

Håndtering av 429:

Strategi Når bruke Latency Impact
Client-side retry OK med høyere latency ⬆️ Høyere (venter på retry-after)
Fallback til annen deployment Low-latency krav ⬇️ Lavere (umiddelbar failover)
Fallback til global-standard Cost/availability balance ➡️ Moderat (noe høyere cost)

Rate limiting pattern (for bulk operations):

# Bad practice: Naive retry storm
for record in records:
    try:
        client.process(record)
    except RateLimitError:
        time.sleep(1)  # Fixed delay — overwhelms service

# Good practice: Rate limiter + durable queue
# 1. Enqueue to Azure Event Hubs/Service Bus
# 2. Job processor dequeues at controlled rate
# 3. Tracks PTU utilization via Azure Monitor

5. Batching (Azure OpenAI Batch API)

Batch API: Asynkrone batch-operasjoner med 50% lavere kostnad enn real-time API.

Bruksområder:

  • Large-scale data processing (embeddings, summarization)
  • Content generation (product descriptions, translations)
  • Document review (legal, compliance)
  • NLP tasks (sentiment analysis, classification)

Batch limits:

Parameter Limit
Max batch files (no expiration) 500
Max batch files (with expiration) 10,000
Max input file size 200 MB (BYOS: 1 GB)
Max requests per file 100,000

Queueing with exponential backoff (Python):

import time

max_retries = 10
retry_count = 0
batch_job = None

while retry_count < max_retries:
    try:
        batch_job = client.batches.create(
            input_file_id=file_id,
            endpoint="/chat/completions",
            completion_window="24h"
        )
        break  # Success
    except Exception as e:
        if "token limit exceeded" in str(e):
            retry_count += 1
            wait_time = 2 ** retry_count
            time.sleep(wait_time)
        else:
            raise

Fail-fast regions (for batching): Australia East, East US, Germany West Central, Italy North, North Central US, Poland Central, Sweden Central, Switzerland North, East US 2, West US.

6. Connection Pooling og Timeouts

HTTP connection pooling (Python):

import requests

# Keep-alive enabled by default
session = requests.Session()
response = session.get("https://api.example.com")

Azure OpenAI timeout configuration (Python):

from openai import AzureOpenAI

client = AzureOpenAI(
    azure_endpoint="...",
    api_key="...",
    timeout=300.0  # 5 minutes (default: 600s/10 min)
)

Connection pooling for database SDKs:

SDK Module
MySQL mysql.connector.pooling
PostgreSQL psycopg2.pool
SQLAlchemy sqlalchemy.pool
Pyodbc Built-in pooling

Best practice:

  • Bruk connection pools for database/HTTP clients
  • Sett realistiske timeouts (ikke 10 min for user-facing apps)
  • Implementer keepalives for long-running connections
  • IKKE opprett nye connections for hver request

7. Idempotency

Definisjon: En operasjon er idempotent hvis den kan kalles flere ganger uten å produsere flere side-effekter etter første kall.

HTTP idempotency:

HTTP Method Idempotent? Beskrivelse
GET Ja Read-only, ingen side-effekter
PUT Ja Replaces resource at URI
DELETE Ja Deletes resource (samme outcome)
POST Nei Creates new resource hver gang
PATCH Nei Partial update (depends)

Idempotency-teknikker for Azure AI Services:

# 1. Check if already processed (database lookup)
def process_document(doc_id):
    if already_processed(doc_id):
        return cached_result(doc_id)

    result = client.analyze_document(...)
    save_result(doc_id, result)
    return result

# 2. Event-carried state transfer (Event Hubs)
event = {
    "doc_id": "12345",
    "operation": "set_status",
    "status": "completed",  # Not "increment_count" — idempotent
    "timestamp": "2026-02-03T10:00:00Z"
}

# 3. Deduplication window (Service Bus)
# Enable duplicate detection with MessageId
message.message_id = f"{order_id}-{timestamp}"

Duplicate detection (Azure Service Bus):

  • Default deduplication window: 10 minutes
  • Min: 20 seconds, Max: 7 days
  • Based on MessageId (or MessageId + PartitionKey if partitioned)

Arkitekturmønstre

Pattern 1: Rate Limiting med Durable Messaging

Problem: Bulk ingestion til throttled service (Azure Cosmos DB, Azure AI Search) resulterer i retry storms og høy feilrate.

Løsning: Bruk Azure Event Hubs/Service Bus som buffer + job processor med rate limiting.

User API → Event Hubs → Job Processor (rate-limited) → Azure AI Service
             (buffer)      (100 req/s controlled)

Implementering:

  1. API enqueues messages (millions per second capacity)
  2. Job processor leases partitions from blob storage (15s lease)
    • Each partition = 100 PTUs (requests/s)
    • Process dequeues only what it can handle in 1s
  3. Monitor utilization via Azure Monitor (Provisioned-Managed Utilization V2)

Fordeler:

  • Reduserer 429 errors fra 80% til <5%
  • Predikterbar throughput
  • Ingen data loss ved crash (durable queue)
  • Skalerer horisontalt (multiple job processors)

Pattern 2: Circuit Breaker (for transient faults)

Problem: Gjentatte kall til utilgjengelig service forverrer problemet (thundering herd).

Løsning: Circuit Breaker pattern.

States:

State Oppførsel
Closed Normal operation — forwards requests
Open Service unavailable — fails fast (no requests)
Half-open Test if service recovered — 1 request

Implementering (Python):

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=60):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failure_count = 0
        self.state = 'closed'
        self.last_failure_time = None

    def call(self, func, *args, **kwargs):
        if self.state == 'open':
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = 'half-open'
            else:
                raise Exception("Circuit breaker open")

        try:
            result = func(*args, **kwargs)
            if self.state == 'half-open':
                self.state = 'closed'
                self.failure_count = 0
            return result
        except Exception:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = 'open'
            raise

Pattern 3: Idempotent Consumer (Event Hubs + Functions)

Problem: Event Hubs garanterer at-least-once delivery — events kan prosesseres flere ganger.

Løsning: Idempotent function design.

Teknikker:

  1. Duplicate detection via database:

    def process_event(event):
        if db.exists(event.id):
            return  # Already processed
    
        result = ai_client.analyze(event.data)
        db.save(event.id, result)
    
  2. Event-carried state transfer:

    {
      "account_id": "12345",
      "operation": "set_balance",
      "new_balance": 1000  // Not "withdraw 100" — idempotent
    }
    
  3. PeekLock receive mode (Service Bus):

    • Consumer får exclusive lock (configurable duration)
    • Sender acknowledgment ved success
    • Message returneres til queue ved timeout/failure

Pattern 4: Fallback Strategy (429 Handling)

Multi-tier fallback:

from openai import AzureOpenAI

def generate_completion(prompt):
    try:
        # 1. Try provisioned deployment (lowest latency)
        return provisioned_client.chat.completions.create(...)
    except Exception as e:
        if e.status_code == 429:
            # 2. Fallback to standard deployment
            return standard_client.chat.completions.create(...)
        raise

# Alternative: Retry with backoff
client = AzureOpenAI(
    max_retries=5,
    timeout=300.0
)
response = client.with_options(max_retries=5).chat.completions.create(...)

Beslutningsveiledning

Når bruke Batch API vs. Real-time API?

Kriterium Batch API Real-time API
Latency krav >24 timer OK <1 sekund nødvendig
Volume >10,000 requests <1,000 requests
Cost sensitivity Høy (50% saving) Moderat
Use case Offline analytics, bulk processing User-facing chat, real-time translation

Retry Strategy Decision Tree

429 Error?
├─ Ja → Sjekk retry-after header → Vent og retry (max 5x)
│       └─ Hvis fortsatt 429 → Fallback til annen deployment
│
└─ 500-504? → Exponential backoff (2^n seconds, max 60s)
    ├─ Transient → Retry opptil 5 ganger
    └─ Persistent → Log error + alert ops team

401/403? → IKKE retry → Fix authentication/RBAC
400/422? → IKKE retry → Fix input validation

Rate Limiting Strategy

Scenario Anbefalt Løsning
Single client, moderate load SDK default retry logic (max_retries=5)
Multiple uncoordinated clients Distributed lease system (blob storage) + partitions
Bulk ingestion Event Hubs + job processor med rate limiter
User-facing app Fallback til standard deployment ved 429

Integrasjon med Microsoft-stakken

Azure AI Foundry Integration

SDK-er som støtter Azure AI Foundry:

  • Python: azure-ai-inference, openai (Azure variant)
  • .NET: Azure.AI.Inference, Azure.AI.OpenAI
  • JavaScript/TypeScript: @azure/openai, @azure/ai-inference
  • Go: github.com/openai/openai-go (med Azure endpoint)

Authentication patterns:

# 1. DefaultAzureCredential (anbefalt for prod)
from azure.identity import DefaultAzureCredential
from azure.ai.inference import ChatCompletionsClient

credential = DefaultAzureCredential()
client = ChatCompletionsClient(
    endpoint="https://<resource>.openai.azure.com",
    credential=credential
)

# 2. Managed Identity (Azure-hosted apps)
from azure.identity import ManagedIdentityCredential

credential = ManagedIdentityCredential()

# 3. API Key (development only)
from azure.core.credentials import AzureKeyCredential

credential = AzureKeyCredential(os.getenv("AZURE_OPENAI_API_KEY"))

Azure Monitor Integration

Metrics å overvåke:

Metric Threshold Alert
Provisioned-Managed Utilization V2 >95% Scale up PTUs
Dependency failures >10% Check retry logic
Request duration >10s Optimize prompts/batching
429 error rate >5% Increase quota or add fallback

Kusto query (Log Analytics):

AzureDiagnostics
| where ResourceType == "COGNITIVE-SERVICES"
| where Category == "RequestResponse"
| where resultCode_d == 429
| summarize count() by bin(TimeGenerated, 5m), clientIp_s
| order by count_ desc

Power Automate / Logic Apps Integration

Error handling i flows:

  1. Configure retry policy:

    • Retry count: 4
    • Retry interval: Exponential (PT10S, PT20S, PT40S, PT80S)
    • Retry on: 408, 429, 500, 502, 503, 504
  2. Handle 429 with condition:

    {
      "condition": "@equals(actions('Call_Azure_AI').statusCode, 429)",
      "ifTrue": {
        "Wait": "@actions('Call_Azure_AI').outputs.headers['retry-after']"
      }
    }
    

Offentlig sektor (Norge)

Compliance og Error Handling

GDPR/Personopplysningsloven:

  • Logg ALDRI personidentifiserende informasjon i error logs
  • Bruk correlation IDs (ikke bruker-ID) i telemetry
  • Respekter retry-after headers (ikke spam API-er)

Eksempel (sanitized logging):

import logging

logger = logging.getLogger(__name__)

try:
    result = client.analyze_document(doc_id)
except HttpResponseError as e:
    logger.error(
        "Document analysis failed",
        extra={
            "correlation_id": e.response.headers.get('x-ms-request-id'),
            "status_code": e.status_code,
            "doc_id": hash(doc_id),  # Hash, not plaintext
            "error_code": e.error.code if e.error else None
        }
    )

Idempotency for Offentlig Sektor Use Cases

Saksbehandlingssystemer:

  • Bruk MessageId = {saksID}-{operasjon}-{timestamp}
  • Aktiver duplicate detection (Service Bus)
  • Check database før processing (deduplication table)

E-post varsling (som må være idempotent):

def send_notification(case_id, notification_type):
    message_id = f"{case_id}-{notification_type}"

    if already_sent(message_id):
        return  # Idempotent — don't resend

    send_email(...)
    mark_sent(message_id)

Kostnad og lisensiering

Kostnad-konsekvenser av API Design

429 Errors kosten ingenting (ingen PTU consumption), MEN:

  • 400 errors (content filter) koster (prompt ble prosessert)
  • 408 timeout koster (delvis processing)
  • finish_reason: content_filter koster (completion ble filtrert)

Batch API savings:

Scenario Real-time Cost Batch Cost Savings
1M tokens (GPT-4o) ~$10 ~$5 50%
Embeddings (1M tokens) ~$0.13 ~$0.065 50%

Provisioned vs. Standard:

  • Provisioned: Fast kostnad (per PTU/hour), predictable latency
  • Standard: Pay-per-token, ingen garantier ved high traffic

Reservation discounts (Provisioned):

  • 1-årig commitment: ~37% discount
  • 3-årig commitment: ~57% discount

For arkitekten (Cosmo)

Design Principles for Robust API Integration

  1. Error Handling Hierarchy:

    Try specific exceptions first → HttpResponseError → AzureError → generic Exception
    
  2. Retry Decision Matrix:

    • Transient (retry): 408, 429, 500-504, network errors
    • Permanent (don't retry): 400, 401, 403, 404, 422
    • Custom logic: 429 with fallback
  3. Rate Limiting Strategy:

    • Low volume (<100 req/s): SDK default retry
    • High volume (>1000 req/s): Event Hubs + job processor
    • Provisioned deployments: Monitor utilization, implement fallback
  4. Batching Decision:

    • Latency >1 min? → Batch API
    • Volume >10k requests? → Batch API
    • Cost critical? → Batch API
  5. Idempotency Checklist:

    • Operations designed for identical input?
    • Duplicate detection enabled (if using Service Bus)?
    • Database check before processing?
    • Correlation IDs for tracing?

Common Anti-Patterns (og hvordan unngå dem)

Anti-Pattern Problem Løsning
while(true) retry loop Retry storm → overwhelms service Max retries + exponential backoff
Fixed 1-second delays Ignores retry-after header Use SDK retry eller respekter header
Ingen connection pooling SNAT port exhaustion Enable connection pooling
Hardcoded API keys Security risk Use Managed Identity + Key Vault
No timeout configuration Hanging requests (10 min default) Set realistic timeouts (30-300s)
Logging sensitive data GDPR violation Hash/mask PII in logs

Monitoring og Alerting

Kritiske metrics:

# Azure Monitor query for error rate trends
AzureDiagnostics
| where ResourceType == "COGNITIVE-SERVICES"
| where TimeGenerated > ago(1h)
| summarize
    total_requests = count(),
    errors = countif(resultCode_d >= 400)
    by bin(TimeGenerated, 5m)
| extend error_rate = (errors * 100.0) / total_requests
| where error_rate > 5  # Alert if >5% error rate

Alert rules:

  • 429 rate >5% → Scale PTUs eller enable fallback
  • 500-504 errors → Check service health dashboard
  • Average latency >5s → Optimize prompts eller batch processing

Architecture Decision Records (ADR) Triggers

Når skal du lage en ADR?

  • Velger Batch API over real-time API for produksjon
  • Implementerer custom retry logic (avviker fra SDK defaults)
  • Bruker distributed rate limiting (blob leases)
  • Velger Provisioned over Standard (cost/latency trade-off)
  • Implementerer multi-region fallback strategy

Kilder og verifisering

Verification status: Verified via Microsoft Learn MCP (2026-02)

Primary sources (fetched):

  1. Handle errors produced by the Azure SDK for Python

  2. Rate Limiting pattern

  3. Retry Storm antipattern

  4. Get started using provisioned deployments on Azure OpenAI

  5. Getting started with Azure OpenAI batch deployments

  6. Azure AI services authentication and authorization using .NET

  7. Designing Azure Functions for identical input (idempotency)

  8. Duplicate detection (Azure Service Bus)

Code samples (verified):

  • Azure.AI.Inference (C#) error handling
  • Azure SDK Python retry policies
  • OpenAI Python SDK custom retry configuration

Related documentation:

  • Azure Monitor metrics and logging
  • Circuit Breaker pattern (Azure Architecture Center)
  • Connection pooling (Azure App Service best practices)

Baseline knowledge (model):

  • HTTP idempotency semantics (RFC 7231)
  • Exponential backoff algorithms
  • Connection pooling concepts

MCP call summary: 7 microsoft_docs_search + 4 microsoft_docs_fetch + 1 microsoft_code_sample_search = 12 total MCP calls