# Chaos Engineering for AI Systems **Last updated:** 2026-02 **Status:** GA **Category:** Business Continuity & Disaster Recovery --- ## Introduksjon Chaos engineering er praksisen med å bevisst injisere feil i et system for å teste dets resiliens og avdekke svakheter før de forårsaker produksjonshendelser. For AI-systemer er dette spesielt verdifullt fordi AI-workloads har komplekse avhengighetskjeder (modell-endpoints, search-indekser, embedding-pipelines, datastores) der en feil i ett komponent kan kaskadere uforutsigbart. Azure Chaos Studio er Azures native plattform for chaos engineering, og tilbyr både agentbasert og tjenestenivå feilinjeksjon. For AI-systemer kan Chaos Studio simulere alt fra nettverkspartisjonering til CPU-press og DNS-feil, noe som lar team validere at circuit breakers, retry-logikk og graceful degradation fungerer som forventet. For norsk offentlig sektor er chaos engineering en viktig del av NSMs krav om regelmessig testing av sikkerhetstiltak (grunnprinsipp 4.3). Det anbefales at organisasjoner gjennomfører strukturerte feilinjeksjonstester minst kvartalsvis, og etter alle større endringer i AI-arkitekturen. ## Feilinjeksjonsstrategier for AI-tjenester ### Feilkatalog for AI-workloads | Feiltype | Simulering | Påvirket komponent | Forventet respons | |----------|-----------|-------------------|-------------------| | Regional outage | DNS-feil eller nettverksblokk | Azure OpenAI | Failover til sekundær region | | API throttling | Kunstig 429-respons | Azure OpenAI | Retry med backoff, graceful degradation | | Search unavailable | Nettverksblokk til search | AI Search | Fallback til keyword search | | High latency | Nettverksforsinkelse | Alle API-kall | Timeout → circuit breaker | | Data corruption | Feil embedding-verdier | Cosmos DB / Search | Validering og rebuild | | Memory pressure | VM memory stress | App Service | Auto-restart, scaling | | Dependency failure | DNS poisoning | Key Vault, App Config | Cached config, graceful degradation | ### Azure Chaos Studio eksperimenter ```bash # Aktiver Chaos Studio for ressurser # Steg 1: Registrer target az rest --method PUT \ --url "https://management.azure.com/subscriptions/{sub}/resourceGroups/rg-ai-prod/providers/Microsoft.Web/sites/ai-app-prod/providers/Microsoft.Chaos/targets/Microsoft-AppService?api-version=2024-01-01" \ --body '{"properties":{}}' # Steg 2: Aktiver capability (App Service Stop) az rest --method PUT \ --url "https://management.azure.com/subscriptions/{sub}/resourceGroups/rg-ai-prod/providers/Microsoft.Web/sites/ai-app-prod/providers/Microsoft.Chaos/targets/Microsoft-AppService/capabilities/Stop-1.0?api-version=2024-01-01" \ --body '{"properties":{}}' ``` ### Chaos Experiment: Simuler Azure OpenAI Regional Outage ```json { "identity": { "type": "SystemAssigned" }, "location": "norwayeast", "properties": { "selectors": [ { "id": "selector-nsg-block-openai", "type": "List", "targets": [ { "id": "/subscriptions/{sub}/resourceGroups/rg-ai-prod/providers/Microsoft.Network/networkSecurityGroups/nsg-ai-app/providers/Microsoft.Chaos/targets/Microsoft-NetworkSecurityGroup", "type": "ChaosTarget" } ] } ], "steps": [ { "name": "Block-OpenAI-Traffic", "branches": [ { "name": "branch-1", "actions": [ { "name": "urn:csci:microsoft:networkSecurityGroup:securityRule/1.1", "type": "continuous", "selectorId": "selector-nsg-block-openai", "duration": "PT10M", "parameters": [ { "key": "name", "value": "chaos-block-openai" }, { "key": "protocol", "value": "*" }, { "key": "sourceAddresses", "value": "[\"*\"]" }, { "key": "destinationAddresses", "value": "[\"CognitiveServicesManagement\"]" }, { "key": "destinationPortRanges", "value": "[\"443\"]" }, { "key": "access", "value": "Deny" }, { "key": "priority", "value": "100" }, { "key": "direction", "value": "Outbound" } ] } ] } ] } ] } } ``` ## Nettverkspartisjonssimulering ### Simuler cross-region nettverkspartisjon ```bash # Chaos experiment: Simuler nettverkspartisjon mellom regioner # Blokkerer VNet peering-trafikk for å teste failover # Metode 1: NSG-basert blokkering az network nsg rule create \ --resource-group "rg-networking" \ --nsg-name "nsg-ai-app" \ --name "chaos-block-cross-region" \ --priority 50 \ --direction Outbound \ --access Deny \ --protocol "*" \ --destination-address-prefixes "10.2.0.0/16" \ --description "CHAOS TEST: Block cross-region traffic" # Vent og observer (10 minutter) sleep 600 # Fjern blokkeringen az network nsg rule delete \ --resource-group "rg-networking" \ --nsg-name "nsg-ai-app" \ --name "chaos-block-cross-region" ``` ### DNS-feil simulering ```python # Python: Simuler DNS-feil for testing # Bruk Azure Private DNS zone override for å simulere DNS-feil import subprocess def simulate_dns_failure(target_fqdn: str, duration_minutes: int = 10): """Simulate DNS failure by overriding DNS resolution.""" print(f"Simulating DNS failure for {target_fqdn} for {duration_minutes} min") # Opprett en DNS record som peker til en ikke-eksisterende IP subprocess.run([ "az", "network", "private-dns", "record-set", "a", "add-record", "--resource-group", "rg-networking", "--zone-name", "privatelink.openai.azure.com", "--record-set-name", "chaos-test", "--ipv4-address", "10.255.255.255" # Ikke-ruterbar IP ]) print(f"DNS poisoned. Observing for {duration_minutes} minutes...") import time time.sleep(duration_minutes * 60) # Rydd opp subprocess.run([ "az", "network", "private-dns", "record-set", "a", "remove-record", "--resource-group", "rg-networking", "--zone-name", "privatelink.openai.azure.com", "--record-set-name", "chaos-test", "--ipv4-address", "10.255.255.255" ]) print("DNS restored.") ``` ## Last- og stresstesting ### Load testing med Azure Load Testing ```yaml # JMeter test plan for AI API stress testing # azure-load-test-config.yaml version: v0.1 testId: ai-stress-test testPlan: ai-load-test.jmx engineInstances: 5 configurationFiles: - ai-load-test.jmx failureCriteria: - avg(response_time_ms) > 5000 - percentage(error) > 5 env: - name: AOAI_ENDPOINT value: https://aoai-prod.openai.azure.com - name: SEARCH_ENDPOINT value: https://search-prod.search.windows.net ``` ```bash # Opprett og kjør load test az load test create \ --name "ai-stress-test" \ --resource-group "rg-ai-test" \ --load-test-resource "lt-ai-prod" \ --test-plan "ai-load-test.jmx" \ --engine-instances 5 # Kjør test med failover-scenario az load test-run create \ --name "failover-stress-run" \ --resource-group "rg-ai-test" \ --load-test-resource "lt-ai-prod" \ --test-id "ai-stress-test" \ --description "Stress test during simulated failover" ``` ### Gradvis belastningsøkning ```python # Gradvis belastningsøkning for å finne breaking point import asyncio import aiohttp import time async def ramp_up_test( endpoint: str, start_rps: int = 10, end_rps: int = 500, step_rps: int = 10, step_duration_seconds: int = 60 ): """Gradually increase load to find service breaking point.""" current_rps = start_rps results = [] while current_rps <= end_rps: print(f"Testing at {current_rps} RPS for {step_duration_seconds}s...") interval = 1.0 / current_rps success_count = 0 error_count = 0 total_latency = 0 start = time.time() while time.time() - start < step_duration_seconds: try: req_start = time.time() async with aiohttp.ClientSession() as session: async with session.post(endpoint, json={"query": "test"}) as resp: if resp.status < 400: success_count += 1 else: error_count += 1 total_latency += (time.time() - req_start) * 1000 except Exception: error_count += 1 await asyncio.sleep(interval) total = success_count + error_count error_rate = error_count / max(total, 1) * 100 avg_latency = total_latency / max(total, 1) results.append({ "rps": current_rps, "success": success_count, "errors": error_count, "error_rate": round(error_rate, 2), "avg_latency_ms": round(avg_latency, 1) }) print(f" Results: {error_rate:.1f}% errors, {avg_latency:.0f}ms avg latency") # Stop hvis error rate er for høy if error_rate > 20: print(f"Breaking point found at {current_rps} RPS") break current_rps += step_rps return results ``` ## Recovery time-måling og validering ### RTO-måling under chaos testing ```python # Automatisk RTO-måling under failover-test import time import requests from datetime import datetime class RTOMeasurement: """Measure actual RTO during failover tests.""" def __init__(self, health_endpoint: str, check_interval_seconds: float = 1.0): self.health_endpoint = health_endpoint self.check_interval = check_interval_seconds self.measurements = [] def measure_rto(self, max_wait_seconds: int = 600) -> dict: """Continuously check health and measure recovery time.""" failure_detected = None recovery_detected = None was_healthy = True checks = [] start = time.time() while time.time() - start < max_wait_seconds: try: resp = requests.get(self.health_endpoint, timeout=5) is_healthy = resp.status_code == 200 except Exception: is_healthy = False check = { "timestamp": datetime.utcnow().isoformat(), "elapsed_seconds": round(time.time() - start, 1), "healthy": is_healthy } checks.append(check) if was_healthy and not is_healthy and failure_detected is None: failure_detected = time.time() print(f"Failure detected at {check['elapsed_seconds']}s") if not was_healthy and is_healthy and failure_detected and recovery_detected is None: recovery_detected = time.time() rto = recovery_detected - failure_detected print(f"Recovery detected at {check['elapsed_seconds']}s — RTO: {rto:.1f}s") was_healthy = is_healthy time.sleep(self.check_interval) result = { "failure_detected": failure_detected is not None, "recovery_detected": recovery_detected is not None, "rto_seconds": round(recovery_detected - failure_detected, 1) if recovery_detected and failure_detected else None, "total_checks": len(checks), "healthy_checks": sum(1 for c in checks if c["healthy"]), "unhealthy_checks": sum(1 for c in checks if not c["healthy"]), "availability_pct": round( sum(1 for c in checks if c["healthy"]) / max(len(checks), 1) * 100, 2 ) } self.measurements.append(result) return result # Bruk rto_meter = RTOMeasurement("https://ai-app-prod.azurewebsites.net/health") result = rto_meter.measure_rto(max_wait_seconds=600) print(f"Measured RTO: {result['rto_seconds']}s") ``` ## Verktøy og plattformer for chaos engineering ### Azure Chaos Studio | Funksjon | Beskrivelse | Støttede ressurser | |----------|-------------|-------------------| | Service-direct faults | Feil injisert via Azure API | App Service, AKS, Cosmos DB, NSG | | Agent-based faults | Feil injisert via VM-agent | CPU/memory stress, network faults | | Experiments | Strukturerte feilsekvenser | Alle støttede resurser | | Permissions | RBAC-basert tilgangskontroll | Dedicated Chaos role | ### Komplementære verktøy | Verktøy | Bruksområde | Integrasjon med Azure | |---------|-------------|----------------------| | Azure Chaos Studio | Native Azure fault injection | Innebygd | | Azure Load Testing | Lasttesting | Innebygd, JMeter-basert | | Litmus Chaos | Kubernetes chaos testing | AKS-kompatibel | | Toxiproxy | Nettverksfeil for utvikling | Manuell oppsett | | PYRIT | AI-spesifikk red teaming | Azure AI | ### Chaos Testing CI/CD-integrasjon ```yaml # Azure DevOps Pipeline: Chaos testing som del av release trigger: none stages: - stage: DeployToStaging displayName: 'Deploy to Staging' jobs: - job: Deploy steps: - task: AzureWebApp@1 inputs: appName: 'ai-app-staging' - stage: ChaosTests displayName: 'Run Chaos Experiments' dependsOn: DeployToStaging jobs: - job: RunChaosExperiment steps: - task: AzureCLI@2 displayName: 'Start chaos experiment' inputs: azureSubscription: 'chaos-service-connection' scriptType: 'bash' scriptLocation: 'inlineScript' inlineScript: | # Start chaos experiment EXPERIMENT_ID=$(az rest --method POST \ --url "https://management.azure.com/subscriptions/{sub}/resourceGroups/rg-ai-test/providers/Microsoft.Chaos/experiments/openai-failover-test/start?api-version=2024-01-01" \ --query "statusUrl" -o tsv) echo "Experiment started: $EXPERIMENT_ID" # Vent og mål RTO python measure_rto.py \ --endpoint "https://ai-app-staging.azurewebsites.net/health" \ --max-wait 300 - task: AzureCLI@2 displayName: 'Validate results' inputs: azureSubscription: 'chaos-service-connection' scriptType: 'bash' scriptLocation: 'inlineScript' inlineScript: | # Sjekk at RTO er innenfor mål RTO=$(python -c "import json; print(json.load(open('rto_result.json'))['rto_seconds'])") if [ $(echo "$RTO > 900" | bc) -eq 1 ]; then echo "##vso[task.logissue type=error]RTO exceeded 15 minutes: ${RTO}s" exit 1 fi echo "RTO within target: ${RTO}s" ``` ## Referanser - [What is Azure Chaos Studio?](https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-overview) — Chaos Studio oversikt - [Understand chaos engineering and resilience](https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-chaos-engineering-overview) — Chaos engineering konsepter - [Architecture strategies for designing a reliability testing strategy](https://learn.microsoft.com/en-us/azure/well-architected/reliability/testing-strategy) — WAF testing-strategi - [Continuous validation with Azure Load Testing and Chaos Studio](https://learn.microsoft.com/en-us/azure/architecture/guide/testing/mission-critical-deployment-testing) — Kombinert testing - [Shift right to test in production](https://learn.microsoft.com/en-us/devops/deliver/shift-right-test-production) — Fault injection i produksjon - [Chaos Agent overview](https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-agent-overview) — Agent-basert feilinjeksjon ## For Cosmo - **Bruk denne referansen** når kunden ønsker å implementere chaos engineering for AI-systemer, eller når de trenger å validere sine DR-prosedyrer. - Start med tabletop-øvelser før reelle feilinjeksjoner — forstå forventet oppførsel før du bryter ting. - Bruk Azure Chaos Studio i staging-miljøer først, deretter gradvis i produksjon med begrenset blast radius. - Integrer chaos testing i CI/CD — automatiserte failover-tester bør kjøres etter hver infrastrukturendring. - RTO-måling er den viktigste outputen — dokumenter faktisk vs. planlagt RTO for å identifisere gap.