ktg-plugin-marketplace/plugins/ms-ai-architect/skills/ms-ai-infrastructure/references/bcdr/chaos-engineering-ai-systems.md
Kjell Tore Guttormsen 6a7632146e feat(ms-ai-architect): add plugin to open marketplace (v1.5.0 baseline)
Initial addition of ms-ai-architect plugin to the open-source marketplace.
Private content excluded: orchestrator/ (Linear tooling), docs/utredning/
(client investigation), generated test reports and PDF export script.
skill-gen tooling moved from orchestrator/ to scripts/skill-gen/.

Security scan: WARNING (risk 20/100) — no secrets, no injection found.
False positive fixed: added gitleaks:allow to Python variable reference
in output-validation-grounding-verification.md line 109.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-07 17:17:17 +02:00

16 KiB

Chaos Engineering for AI Systems

Last updated: 2026-02 Status: GA Category: Business Continuity & Disaster Recovery


Introduksjon

Chaos engineering er praksisen med å bevisst injisere feil i et system for å teste dets resiliens og avdekke svakheter før de forårsaker produksjonshendelser. For AI-systemer er dette spesielt verdifullt fordi AI-workloads har komplekse avhengighetskjeder (modell-endpoints, search-indekser, embedding-pipelines, datastores) der en feil i ett komponent kan kaskadere uforutsigbart.

Azure Chaos Studio er Azures native plattform for chaos engineering, og tilbyr både agentbasert og tjenestenivå feilinjeksjon. For AI-systemer kan Chaos Studio simulere alt fra nettverkspartisjonering til CPU-press og DNS-feil, noe som lar team validere at circuit breakers, retry-logikk og graceful degradation fungerer som forventet.

For norsk offentlig sektor er chaos engineering en viktig del av NSMs krav om regelmessig testing av sikkerhetstiltak (grunnprinsipp 4.3). Det anbefales at organisasjoner gjennomfører strukturerte feilinjeksjonstester minst kvartalsvis, og etter alle større endringer i AI-arkitekturen.

Feilinjeksjonsstrategier for AI-tjenester

Feilkatalog for AI-workloads

Feiltype Simulering Påvirket komponent Forventet respons
Regional outage DNS-feil eller nettverksblokk Azure OpenAI Failover til sekundær region
API throttling Kunstig 429-respons Azure OpenAI Retry med backoff, graceful degradation
Search unavailable Nettverksblokk til search AI Search Fallback til keyword search
High latency Nettverksforsinkelse Alle API-kall Timeout → circuit breaker
Data corruption Feil embedding-verdier Cosmos DB / Search Validering og rebuild
Memory pressure VM memory stress App Service Auto-restart, scaling
Dependency failure DNS poisoning Key Vault, App Config Cached config, graceful degradation

Azure Chaos Studio eksperimenter

# Aktiver Chaos Studio for ressurser
# Steg 1: Registrer target
az rest --method PUT \
  --url "https://management.azure.com/subscriptions/{sub}/resourceGroups/rg-ai-prod/providers/Microsoft.Web/sites/ai-app-prod/providers/Microsoft.Chaos/targets/Microsoft-AppService?api-version=2024-01-01" \
  --body '{"properties":{}}'

# Steg 2: Aktiver capability (App Service Stop)
az rest --method PUT \
  --url "https://management.azure.com/subscriptions/{sub}/resourceGroups/rg-ai-prod/providers/Microsoft.Web/sites/ai-app-prod/providers/Microsoft.Chaos/targets/Microsoft-AppService/capabilities/Stop-1.0?api-version=2024-01-01" \
  --body '{"properties":{}}'

Chaos Experiment: Simuler Azure OpenAI Regional Outage

{
  "identity": {
    "type": "SystemAssigned"
  },
  "location": "norwayeast",
  "properties": {
    "selectors": [
      {
        "id": "selector-nsg-block-openai",
        "type": "List",
        "targets": [
          {
            "id": "/subscriptions/{sub}/resourceGroups/rg-ai-prod/providers/Microsoft.Network/networkSecurityGroups/nsg-ai-app/providers/Microsoft.Chaos/targets/Microsoft-NetworkSecurityGroup",
            "type": "ChaosTarget"
          }
        ]
      }
    ],
    "steps": [
      {
        "name": "Block-OpenAI-Traffic",
        "branches": [
          {
            "name": "branch-1",
            "actions": [
              {
                "name": "urn:csci:microsoft:networkSecurityGroup:securityRule/1.1",
                "type": "continuous",
                "selectorId": "selector-nsg-block-openai",
                "duration": "PT10M",
                "parameters": [
                  { "key": "name", "value": "chaos-block-openai" },
                  { "key": "protocol", "value": "*" },
                  { "key": "sourceAddresses", "value": "[\"*\"]" },
                  { "key": "destinationAddresses", "value": "[\"CognitiveServicesManagement\"]" },
                  { "key": "destinationPortRanges", "value": "[\"443\"]" },
                  { "key": "access", "value": "Deny" },
                  { "key": "priority", "value": "100" },
                  { "key": "direction", "value": "Outbound" }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

Nettverkspartisjonssimulering

Simuler cross-region nettverkspartisjon

# Chaos experiment: Simuler nettverkspartisjon mellom regioner
# Blokkerer VNet peering-trafikk for å teste failover

# Metode 1: NSG-basert blokkering
az network nsg rule create \
  --resource-group "rg-networking" \
  --nsg-name "nsg-ai-app" \
  --name "chaos-block-cross-region" \
  --priority 50 \
  --direction Outbound \
  --access Deny \
  --protocol "*" \
  --destination-address-prefixes "10.2.0.0/16" \
  --description "CHAOS TEST: Block cross-region traffic"

# Vent og observer (10 minutter)
sleep 600

# Fjern blokkeringen
az network nsg rule delete \
  --resource-group "rg-networking" \
  --nsg-name "nsg-ai-app" \
  --name "chaos-block-cross-region"

DNS-feil simulering

# Python: Simuler DNS-feil for testing
# Bruk Azure Private DNS zone override for å simulere DNS-feil

import subprocess

def simulate_dns_failure(target_fqdn: str, duration_minutes: int = 10):
    """Simulate DNS failure by overriding DNS resolution."""
    print(f"Simulating DNS failure for {target_fqdn} for {duration_minutes} min")

    # Opprett en DNS record som peker til en ikke-eksisterende IP
    subprocess.run([
        "az", "network", "private-dns", "record-set", "a", "add-record",
        "--resource-group", "rg-networking",
        "--zone-name", "privatelink.openai.azure.com",
        "--record-set-name", "chaos-test",
        "--ipv4-address", "10.255.255.255"  # Ikke-ruterbar IP
    ])

    print(f"DNS poisoned. Observing for {duration_minutes} minutes...")
    import time
    time.sleep(duration_minutes * 60)

    # Rydd opp
    subprocess.run([
        "az", "network", "private-dns", "record-set", "a", "remove-record",
        "--resource-group", "rg-networking",
        "--zone-name", "privatelink.openai.azure.com",
        "--record-set-name", "chaos-test",
        "--ipv4-address", "10.255.255.255"
    ])
    print("DNS restored.")

Last- og stresstesting

Load testing med Azure Load Testing

# JMeter test plan for AI API stress testing
# azure-load-test-config.yaml
version: v0.1
testId: ai-stress-test
testPlan: ai-load-test.jmx
engineInstances: 5
configurationFiles:
  - ai-load-test.jmx
failureCriteria:
  - avg(response_time_ms) > 5000
  - percentage(error) > 5
env:
  - name: AOAI_ENDPOINT
    value: https://aoai-prod.openai.azure.com
  - name: SEARCH_ENDPOINT
    value: https://search-prod.search.windows.net
# Opprett og kjør load test
az load test create \
  --name "ai-stress-test" \
  --resource-group "rg-ai-test" \
  --load-test-resource "lt-ai-prod" \
  --test-plan "ai-load-test.jmx" \
  --engine-instances 5

# Kjør test med failover-scenario
az load test-run create \
  --name "failover-stress-run" \
  --resource-group "rg-ai-test" \
  --load-test-resource "lt-ai-prod" \
  --test-id "ai-stress-test" \
  --description "Stress test during simulated failover"

Gradvis belastningsøkning

# Gradvis belastningsøkning for å finne breaking point
import asyncio
import aiohttp
import time

async def ramp_up_test(
    endpoint: str,
    start_rps: int = 10,
    end_rps: int = 500,
    step_rps: int = 10,
    step_duration_seconds: int = 60
):
    """Gradually increase load to find service breaking point."""
    current_rps = start_rps
    results = []

    while current_rps <= end_rps:
        print(f"Testing at {current_rps} RPS for {step_duration_seconds}s...")
        interval = 1.0 / current_rps
        success_count = 0
        error_count = 0
        total_latency = 0

        start = time.time()
        while time.time() - start < step_duration_seconds:
            try:
                req_start = time.time()
                async with aiohttp.ClientSession() as session:
                    async with session.post(endpoint, json={"query": "test"}) as resp:
                        if resp.status < 400:
                            success_count += 1
                        else:
                            error_count += 1
                        total_latency += (time.time() - req_start) * 1000
            except Exception:
                error_count += 1
            await asyncio.sleep(interval)

        total = success_count + error_count
        error_rate = error_count / max(total, 1) * 100
        avg_latency = total_latency / max(total, 1)

        results.append({
            "rps": current_rps,
            "success": success_count,
            "errors": error_count,
            "error_rate": round(error_rate, 2),
            "avg_latency_ms": round(avg_latency, 1)
        })

        print(f"  Results: {error_rate:.1f}% errors, {avg_latency:.0f}ms avg latency")

        # Stop hvis error rate er for høy
        if error_rate > 20:
            print(f"Breaking point found at {current_rps} RPS")
            break

        current_rps += step_rps

    return results

Recovery time-måling og validering

RTO-måling under chaos testing

# Automatisk RTO-måling under failover-test
import time
import requests
from datetime import datetime

class RTOMeasurement:
    """Measure actual RTO during failover tests."""

    def __init__(self, health_endpoint: str, check_interval_seconds: float = 1.0):
        self.health_endpoint = health_endpoint
        self.check_interval = check_interval_seconds
        self.measurements = []

    def measure_rto(self, max_wait_seconds: int = 600) -> dict:
        """Continuously check health and measure recovery time."""
        failure_detected = None
        recovery_detected = None
        was_healthy = True
        checks = []

        start = time.time()
        while time.time() - start < max_wait_seconds:
            try:
                resp = requests.get(self.health_endpoint, timeout=5)
                is_healthy = resp.status_code == 200
            except Exception:
                is_healthy = False

            check = {
                "timestamp": datetime.utcnow().isoformat(),
                "elapsed_seconds": round(time.time() - start, 1),
                "healthy": is_healthy
            }
            checks.append(check)

            if was_healthy and not is_healthy and failure_detected is None:
                failure_detected = time.time()
                print(f"Failure detected at {check['elapsed_seconds']}s")

            if not was_healthy and is_healthy and failure_detected and recovery_detected is None:
                recovery_detected = time.time()
                rto = recovery_detected - failure_detected
                print(f"Recovery detected at {check['elapsed_seconds']}s — RTO: {rto:.1f}s")

            was_healthy = is_healthy
            time.sleep(self.check_interval)

        result = {
            "failure_detected": failure_detected is not None,
            "recovery_detected": recovery_detected is not None,
            "rto_seconds": round(recovery_detected - failure_detected, 1) if recovery_detected and failure_detected else None,
            "total_checks": len(checks),
            "healthy_checks": sum(1 for c in checks if c["healthy"]),
            "unhealthy_checks": sum(1 for c in checks if not c["healthy"]),
            "availability_pct": round(
                sum(1 for c in checks if c["healthy"]) / max(len(checks), 1) * 100, 2
            )
        }

        self.measurements.append(result)
        return result

# Bruk
rto_meter = RTOMeasurement("https://ai-app-prod.azurewebsites.net/health")
result = rto_meter.measure_rto(max_wait_seconds=600)
print(f"Measured RTO: {result['rto_seconds']}s")

Verktøy og plattformer for chaos engineering

Azure Chaos Studio

Funksjon Beskrivelse Støttede ressurser
Service-direct faults Feil injisert via Azure API App Service, AKS, Cosmos DB, NSG
Agent-based faults Feil injisert via VM-agent CPU/memory stress, network faults
Experiments Strukturerte feilsekvenser Alle støttede resurser
Permissions RBAC-basert tilgangskontroll Dedicated Chaos role

Komplementære verktøy

Verktøy Bruksområde Integrasjon med Azure
Azure Chaos Studio Native Azure fault injection Innebygd
Azure Load Testing Lasttesting Innebygd, JMeter-basert
Litmus Chaos Kubernetes chaos testing AKS-kompatibel
Toxiproxy Nettverksfeil for utvikling Manuell oppsett
PYRIT AI-spesifikk red teaming Azure AI

Chaos Testing CI/CD-integrasjon

# Azure DevOps Pipeline: Chaos testing som del av release
trigger: none

stages:
  - stage: DeployToStaging
    displayName: 'Deploy to Staging'
    jobs:
      - job: Deploy
        steps:
          - task: AzureWebApp@1
            inputs:
              appName: 'ai-app-staging'

  - stage: ChaosTests
    displayName: 'Run Chaos Experiments'
    dependsOn: DeployToStaging
    jobs:
      - job: RunChaosExperiment
        steps:
          - task: AzureCLI@2
            displayName: 'Start chaos experiment'
            inputs:
              azureSubscription: 'chaos-service-connection'
              scriptType: 'bash'
              scriptLocation: 'inlineScript'
              inlineScript: |
                # Start chaos experiment
                EXPERIMENT_ID=$(az rest --method POST \
                  --url "https://management.azure.com/subscriptions/{sub}/resourceGroups/rg-ai-test/providers/Microsoft.Chaos/experiments/openai-failover-test/start?api-version=2024-01-01" \
                  --query "statusUrl" -o tsv)

                echo "Experiment started: $EXPERIMENT_ID"

                # Vent og mål RTO
                python measure_rto.py \
                  --endpoint "https://ai-app-staging.azurewebsites.net/health" \
                  --max-wait 300

          - task: AzureCLI@2
            displayName: 'Validate results'
            inputs:
              azureSubscription: 'chaos-service-connection'
              scriptType: 'bash'
              scriptLocation: 'inlineScript'
              inlineScript: |
                # Sjekk at RTO er innenfor mål
                RTO=$(python -c "import json; print(json.load(open('rto_result.json'))['rto_seconds'])")
                if [ $(echo "$RTO > 900" | bc) -eq 1 ]; then
                  echo "##vso[task.logissue type=error]RTO exceeded 15 minutes: ${RTO}s"
                  exit 1
                fi
                echo "RTO within target: ${RTO}s"

Referanser

For Cosmo

  • Bruk denne referansen når kunden ønsker å implementere chaos engineering for AI-systemer, eller når de trenger å validere sine DR-prosedyrer.
  • Start med tabletop-øvelser før reelle feilinjeksjoner — forstå forventet oppførsel før du bryter ting.
  • Bruk Azure Chaos Studio i staging-miljøer først, deretter gradvis i produksjon med begrenset blast radius.
  • Integrer chaos testing i CI/CD — automatiserte failover-tester bør kjøres etter hver infrastrukturendring.
  • RTO-måling er den viktigste outputen — dokumenter faktisk vs. planlagt RTO for å identifisere gap.