Initial addition of ms-ai-architect plugin to the open-source marketplace. Private content excluded: orchestrator/ (Linear tooling), docs/utredning/ (client investigation), generated test reports and PDF export script. skill-gen tooling moved from orchestrator/ to scripts/skill-gen/. Security scan: WARNING (risk 20/100) — no secrets, no injection found. False positive fixed: added gitleaks:allow to Python variable reference in output-validation-grounding-verification.md line 109. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
16 KiB
Chaos Engineering for AI Systems
Last updated: 2026-02 Status: GA Category: Business Continuity & Disaster Recovery
Introduksjon
Chaos engineering er praksisen med å bevisst injisere feil i et system for å teste dets resiliens og avdekke svakheter før de forårsaker produksjonshendelser. For AI-systemer er dette spesielt verdifullt fordi AI-workloads har komplekse avhengighetskjeder (modell-endpoints, search-indekser, embedding-pipelines, datastores) der en feil i ett komponent kan kaskadere uforutsigbart.
Azure Chaos Studio er Azures native plattform for chaos engineering, og tilbyr både agentbasert og tjenestenivå feilinjeksjon. For AI-systemer kan Chaos Studio simulere alt fra nettverkspartisjonering til CPU-press og DNS-feil, noe som lar team validere at circuit breakers, retry-logikk og graceful degradation fungerer som forventet.
For norsk offentlig sektor er chaos engineering en viktig del av NSMs krav om regelmessig testing av sikkerhetstiltak (grunnprinsipp 4.3). Det anbefales at organisasjoner gjennomfører strukturerte feilinjeksjonstester minst kvartalsvis, og etter alle større endringer i AI-arkitekturen.
Feilinjeksjonsstrategier for AI-tjenester
Feilkatalog for AI-workloads
| Feiltype | Simulering | Påvirket komponent | Forventet respons |
|---|---|---|---|
| Regional outage | DNS-feil eller nettverksblokk | Azure OpenAI | Failover til sekundær region |
| API throttling | Kunstig 429-respons | Azure OpenAI | Retry med backoff, graceful degradation |
| Search unavailable | Nettverksblokk til search | AI Search | Fallback til keyword search |
| High latency | Nettverksforsinkelse | Alle API-kall | Timeout → circuit breaker |
| Data corruption | Feil embedding-verdier | Cosmos DB / Search | Validering og rebuild |
| Memory pressure | VM memory stress | App Service | Auto-restart, scaling |
| Dependency failure | DNS poisoning | Key Vault, App Config | Cached config, graceful degradation |
Azure Chaos Studio eksperimenter
# Aktiver Chaos Studio for ressurser
# Steg 1: Registrer target
az rest --method PUT \
--url "https://management.azure.com/subscriptions/{sub}/resourceGroups/rg-ai-prod/providers/Microsoft.Web/sites/ai-app-prod/providers/Microsoft.Chaos/targets/Microsoft-AppService?api-version=2024-01-01" \
--body '{"properties":{}}'
# Steg 2: Aktiver capability (App Service Stop)
az rest --method PUT \
--url "https://management.azure.com/subscriptions/{sub}/resourceGroups/rg-ai-prod/providers/Microsoft.Web/sites/ai-app-prod/providers/Microsoft.Chaos/targets/Microsoft-AppService/capabilities/Stop-1.0?api-version=2024-01-01" \
--body '{"properties":{}}'
Chaos Experiment: Simuler Azure OpenAI Regional Outage
{
"identity": {
"type": "SystemAssigned"
},
"location": "norwayeast",
"properties": {
"selectors": [
{
"id": "selector-nsg-block-openai",
"type": "List",
"targets": [
{
"id": "/subscriptions/{sub}/resourceGroups/rg-ai-prod/providers/Microsoft.Network/networkSecurityGroups/nsg-ai-app/providers/Microsoft.Chaos/targets/Microsoft-NetworkSecurityGroup",
"type": "ChaosTarget"
}
]
}
],
"steps": [
{
"name": "Block-OpenAI-Traffic",
"branches": [
{
"name": "branch-1",
"actions": [
{
"name": "urn:csci:microsoft:networkSecurityGroup:securityRule/1.1",
"type": "continuous",
"selectorId": "selector-nsg-block-openai",
"duration": "PT10M",
"parameters": [
{ "key": "name", "value": "chaos-block-openai" },
{ "key": "protocol", "value": "*" },
{ "key": "sourceAddresses", "value": "[\"*\"]" },
{ "key": "destinationAddresses", "value": "[\"CognitiveServicesManagement\"]" },
{ "key": "destinationPortRanges", "value": "[\"443\"]" },
{ "key": "access", "value": "Deny" },
{ "key": "priority", "value": "100" },
{ "key": "direction", "value": "Outbound" }
]
}
]
}
]
}
]
}
}
Nettverkspartisjonssimulering
Simuler cross-region nettverkspartisjon
# Chaos experiment: Simuler nettverkspartisjon mellom regioner
# Blokkerer VNet peering-trafikk for å teste failover
# Metode 1: NSG-basert blokkering
az network nsg rule create \
--resource-group "rg-networking" \
--nsg-name "nsg-ai-app" \
--name "chaos-block-cross-region" \
--priority 50 \
--direction Outbound \
--access Deny \
--protocol "*" \
--destination-address-prefixes "10.2.0.0/16" \
--description "CHAOS TEST: Block cross-region traffic"
# Vent og observer (10 minutter)
sleep 600
# Fjern blokkeringen
az network nsg rule delete \
--resource-group "rg-networking" \
--nsg-name "nsg-ai-app" \
--name "chaos-block-cross-region"
DNS-feil simulering
# Python: Simuler DNS-feil for testing
# Bruk Azure Private DNS zone override for å simulere DNS-feil
import subprocess
def simulate_dns_failure(target_fqdn: str, duration_minutes: int = 10):
"""Simulate DNS failure by overriding DNS resolution."""
print(f"Simulating DNS failure for {target_fqdn} for {duration_minutes} min")
# Opprett en DNS record som peker til en ikke-eksisterende IP
subprocess.run([
"az", "network", "private-dns", "record-set", "a", "add-record",
"--resource-group", "rg-networking",
"--zone-name", "privatelink.openai.azure.com",
"--record-set-name", "chaos-test",
"--ipv4-address", "10.255.255.255" # Ikke-ruterbar IP
])
print(f"DNS poisoned. Observing for {duration_minutes} minutes...")
import time
time.sleep(duration_minutes * 60)
# Rydd opp
subprocess.run([
"az", "network", "private-dns", "record-set", "a", "remove-record",
"--resource-group", "rg-networking",
"--zone-name", "privatelink.openai.azure.com",
"--record-set-name", "chaos-test",
"--ipv4-address", "10.255.255.255"
])
print("DNS restored.")
Last- og stresstesting
Load testing med Azure Load Testing
# JMeter test plan for AI API stress testing
# azure-load-test-config.yaml
version: v0.1
testId: ai-stress-test
testPlan: ai-load-test.jmx
engineInstances: 5
configurationFiles:
- ai-load-test.jmx
failureCriteria:
- avg(response_time_ms) > 5000
- percentage(error) > 5
env:
- name: AOAI_ENDPOINT
value: https://aoai-prod.openai.azure.com
- name: SEARCH_ENDPOINT
value: https://search-prod.search.windows.net
# Opprett og kjør load test
az load test create \
--name "ai-stress-test" \
--resource-group "rg-ai-test" \
--load-test-resource "lt-ai-prod" \
--test-plan "ai-load-test.jmx" \
--engine-instances 5
# Kjør test med failover-scenario
az load test-run create \
--name "failover-stress-run" \
--resource-group "rg-ai-test" \
--load-test-resource "lt-ai-prod" \
--test-id "ai-stress-test" \
--description "Stress test during simulated failover"
Gradvis belastningsøkning
# Gradvis belastningsøkning for å finne breaking point
import asyncio
import aiohttp
import time
async def ramp_up_test(
endpoint: str,
start_rps: int = 10,
end_rps: int = 500,
step_rps: int = 10,
step_duration_seconds: int = 60
):
"""Gradually increase load to find service breaking point."""
current_rps = start_rps
results = []
while current_rps <= end_rps:
print(f"Testing at {current_rps} RPS for {step_duration_seconds}s...")
interval = 1.0 / current_rps
success_count = 0
error_count = 0
total_latency = 0
start = time.time()
while time.time() - start < step_duration_seconds:
try:
req_start = time.time()
async with aiohttp.ClientSession() as session:
async with session.post(endpoint, json={"query": "test"}) as resp:
if resp.status < 400:
success_count += 1
else:
error_count += 1
total_latency += (time.time() - req_start) * 1000
except Exception:
error_count += 1
await asyncio.sleep(interval)
total = success_count + error_count
error_rate = error_count / max(total, 1) * 100
avg_latency = total_latency / max(total, 1)
results.append({
"rps": current_rps,
"success": success_count,
"errors": error_count,
"error_rate": round(error_rate, 2),
"avg_latency_ms": round(avg_latency, 1)
})
print(f" Results: {error_rate:.1f}% errors, {avg_latency:.0f}ms avg latency")
# Stop hvis error rate er for høy
if error_rate > 20:
print(f"Breaking point found at {current_rps} RPS")
break
current_rps += step_rps
return results
Recovery time-måling og validering
RTO-måling under chaos testing
# Automatisk RTO-måling under failover-test
import time
import requests
from datetime import datetime
class RTOMeasurement:
"""Measure actual RTO during failover tests."""
def __init__(self, health_endpoint: str, check_interval_seconds: float = 1.0):
self.health_endpoint = health_endpoint
self.check_interval = check_interval_seconds
self.measurements = []
def measure_rto(self, max_wait_seconds: int = 600) -> dict:
"""Continuously check health and measure recovery time."""
failure_detected = None
recovery_detected = None
was_healthy = True
checks = []
start = time.time()
while time.time() - start < max_wait_seconds:
try:
resp = requests.get(self.health_endpoint, timeout=5)
is_healthy = resp.status_code == 200
except Exception:
is_healthy = False
check = {
"timestamp": datetime.utcnow().isoformat(),
"elapsed_seconds": round(time.time() - start, 1),
"healthy": is_healthy
}
checks.append(check)
if was_healthy and not is_healthy and failure_detected is None:
failure_detected = time.time()
print(f"Failure detected at {check['elapsed_seconds']}s")
if not was_healthy and is_healthy and failure_detected and recovery_detected is None:
recovery_detected = time.time()
rto = recovery_detected - failure_detected
print(f"Recovery detected at {check['elapsed_seconds']}s — RTO: {rto:.1f}s")
was_healthy = is_healthy
time.sleep(self.check_interval)
result = {
"failure_detected": failure_detected is not None,
"recovery_detected": recovery_detected is not None,
"rto_seconds": round(recovery_detected - failure_detected, 1) if recovery_detected and failure_detected else None,
"total_checks": len(checks),
"healthy_checks": sum(1 for c in checks if c["healthy"]),
"unhealthy_checks": sum(1 for c in checks if not c["healthy"]),
"availability_pct": round(
sum(1 for c in checks if c["healthy"]) / max(len(checks), 1) * 100, 2
)
}
self.measurements.append(result)
return result
# Bruk
rto_meter = RTOMeasurement("https://ai-app-prod.azurewebsites.net/health")
result = rto_meter.measure_rto(max_wait_seconds=600)
print(f"Measured RTO: {result['rto_seconds']}s")
Verktøy og plattformer for chaos engineering
Azure Chaos Studio
| Funksjon | Beskrivelse | Støttede ressurser |
|---|---|---|
| Service-direct faults | Feil injisert via Azure API | App Service, AKS, Cosmos DB, NSG |
| Agent-based faults | Feil injisert via VM-agent | CPU/memory stress, network faults |
| Experiments | Strukturerte feilsekvenser | Alle støttede resurser |
| Permissions | RBAC-basert tilgangskontroll | Dedicated Chaos role |
Komplementære verktøy
| Verktøy | Bruksområde | Integrasjon med Azure |
|---|---|---|
| Azure Chaos Studio | Native Azure fault injection | Innebygd |
| Azure Load Testing | Lasttesting | Innebygd, JMeter-basert |
| Litmus Chaos | Kubernetes chaos testing | AKS-kompatibel |
| Toxiproxy | Nettverksfeil for utvikling | Manuell oppsett |
| PYRIT | AI-spesifikk red teaming | Azure AI |
Chaos Testing CI/CD-integrasjon
# Azure DevOps Pipeline: Chaos testing som del av release
trigger: none
stages:
- stage: DeployToStaging
displayName: 'Deploy to Staging'
jobs:
- job: Deploy
steps:
- task: AzureWebApp@1
inputs:
appName: 'ai-app-staging'
- stage: ChaosTests
displayName: 'Run Chaos Experiments'
dependsOn: DeployToStaging
jobs:
- job: RunChaosExperiment
steps:
- task: AzureCLI@2
displayName: 'Start chaos experiment'
inputs:
azureSubscription: 'chaos-service-connection'
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
# Start chaos experiment
EXPERIMENT_ID=$(az rest --method POST \
--url "https://management.azure.com/subscriptions/{sub}/resourceGroups/rg-ai-test/providers/Microsoft.Chaos/experiments/openai-failover-test/start?api-version=2024-01-01" \
--query "statusUrl" -o tsv)
echo "Experiment started: $EXPERIMENT_ID"
# Vent og mål RTO
python measure_rto.py \
--endpoint "https://ai-app-staging.azurewebsites.net/health" \
--max-wait 300
- task: AzureCLI@2
displayName: 'Validate results'
inputs:
azureSubscription: 'chaos-service-connection'
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
# Sjekk at RTO er innenfor mål
RTO=$(python -c "import json; print(json.load(open('rto_result.json'))['rto_seconds'])")
if [ $(echo "$RTO > 900" | bc) -eq 1 ]; then
echo "##vso[task.logissue type=error]RTO exceeded 15 minutes: ${RTO}s"
exit 1
fi
echo "RTO within target: ${RTO}s"
Referanser
- What is Azure Chaos Studio? — Chaos Studio oversikt
- Understand chaos engineering and resilience — Chaos engineering konsepter
- Architecture strategies for designing a reliability testing strategy — WAF testing-strategi
- Continuous validation with Azure Load Testing and Chaos Studio — Kombinert testing
- Shift right to test in production — Fault injection i produksjon
- Chaos Agent overview — Agent-basert feilinjeksjon
For Cosmo
- Bruk denne referansen når kunden ønsker å implementere chaos engineering for AI-systemer, eller når de trenger å validere sine DR-prosedyrer.
- Start med tabletop-øvelser før reelle feilinjeksjoner — forstå forventet oppførsel før du bryter ting.
- Bruk Azure Chaos Studio i staging-miljøer først, deretter gradvis i produksjon med begrenset blast radius.
- Integrer chaos testing i CI/CD — automatiserte failover-tester bør kjøres etter hver infrastrukturendring.
- RTO-måling er den viktigste outputen — dokumenter faktisk vs. planlagt RTO for å identifisere gap.