Kjell Tore Guttormsen 6a7632146e feat(ms-ai-architect): add plugin to open marketplace (v1.5.0 baseline)

Initial addition of ms-ai-architect plugin to the open-source marketplace.
Private content excluded: orchestrator/ (Linear tooling), docs/utredning/
(client investigation), generated test reports and PDF export script.
skill-gen tooling moved from orchestrator/ to scripts/skill-gen/.

Security scan: WARNING (risk 20/100) — no secrets, no injection found.
False positive fixed: added gitleaks:allow to Python variable reference
in output-validation-grounding-verification.md line 109.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-07 17:17:17 +02:00

16 KiB

Raw Blame History

Chaos Engineering for AI Systems

Last updated: 2026-02 Status: GA Category: Business Continuity & Disaster Recovery

Introduksjon

Chaos engineering er praksisen med å bevisst injisere feil i et system for å teste dets resiliens og avdekke svakheter før de forårsaker produksjonshendelser. For AI-systemer er dette spesielt verdifullt fordi AI-workloads har komplekse avhengighetskjeder (modell-endpoints, search-indekser, embedding-pipelines, datastores) der en feil i ett komponent kan kaskadere uforutsigbart.

Azure Chaos Studio er Azures native plattform for chaos engineering, og tilbyr både agentbasert og tjenestenivå feilinjeksjon. For AI-systemer kan Chaos Studio simulere alt fra nettverkspartisjonering til CPU-press og DNS-feil, noe som lar team validere at circuit breakers, retry-logikk og graceful degradation fungerer som forventet.

For norsk offentlig sektor er chaos engineering en viktig del av NSMs krav om regelmessig testing av sikkerhetstiltak (grunnprinsipp 4.3). Det anbefales at organisasjoner gjennomfører strukturerte feilinjeksjonstester minst kvartalsvis, og etter alle større endringer i AI-arkitekturen.

Feilinjeksjonsstrategier for AI-tjenester

Feilkatalog for AI-workloads

Feiltype	Simulering	Påvirket komponent	Forventet respons
Regional outage	DNS-feil eller nettverksblokk	Azure OpenAI	Failover til sekundær region
API throttling	Kunstig 429-respons	Azure OpenAI	Retry med backoff, graceful degradation
Search unavailable	Nettverksblokk til search	AI Search	Fallback til keyword search
High latency	Nettverksforsinkelse	Alle API-kall	Timeout → circuit breaker
Data corruption	Feil embedding-verdier	Cosmos DB / Search	Validering og rebuild
Memory pressure	VM memory stress	App Service	Auto-restart, scaling
Dependency failure	DNS poisoning	Key Vault, App Config	Cached config, graceful degradation

Azure Chaos Studio eksperimenter

# Aktiver Chaos Studio for ressurser
# Steg 1: Registrer target
az rest --method PUT \
  --url "https://management.azure.com/subscriptions/{sub}/resourceGroups/rg-ai-prod/providers/Microsoft.Web/sites/ai-app-prod/providers/Microsoft.Chaos/targets/Microsoft-AppService?api-version=2024-01-01" \
  --body '{"properties":{}}'

# Steg 2: Aktiver capability (App Service Stop)
az rest --method PUT \
  --url "https://management.azure.com/subscriptions/{sub}/resourceGroups/rg-ai-prod/providers/Microsoft.Web/sites/ai-app-prod/providers/Microsoft.Chaos/targets/Microsoft-AppService/capabilities/Stop-1.0?api-version=2024-01-01" \
  --body '{"properties":{}}'

Chaos Experiment: Simuler Azure OpenAI Regional Outage

{
  "identity": {
    "type": "SystemAssigned"
  },
  "location": "norwayeast",
  "properties": {
    "selectors": [
      {
        "id": "selector-nsg-block-openai",
        "type": "List",
        "targets": [
          {
            "id": "/subscriptions/{sub}/resourceGroups/rg-ai-prod/providers/Microsoft.Network/networkSecurityGroups/nsg-ai-app/providers/Microsoft.Chaos/targets/Microsoft-NetworkSecurityGroup",
            "type": "ChaosTarget"
          }
        ]
      }
    ],
    "steps": [
      {
        "name": "Block-OpenAI-Traffic",
        "branches": [
          {
            "name": "branch-1",
            "actions": [
              {
                "name": "urn:csci:microsoft:networkSecurityGroup:securityRule/1.1",
                "type": "continuous",
                "selectorId": "selector-nsg-block-openai",
                "duration": "PT10M",
                "parameters": [
                  { "key": "name", "value": "chaos-block-openai" },
                  { "key": "protocol", "value": "*" },
                  { "key": "sourceAddresses", "value": "[\"*\"]" },
                  { "key": "destinationAddresses", "value": "[\"CognitiveServicesManagement\"]" },
                  { "key": "destinationPortRanges", "value": "[\"443\"]" },
                  { "key": "access", "value": "Deny" },
                  { "key": "priority", "value": "100" },
                  { "key": "direction", "value": "Outbound" }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

Nettverkspartisjonssimulering

Simuler cross-region nettverkspartisjon

# Chaos experiment: Simuler nettverkspartisjon mellom regioner
# Blokkerer VNet peering-trafikk for å teste failover

# Metode 1: NSG-basert blokkering
az network nsg rule create \
  --resource-group "rg-networking" \
  --nsg-name "nsg-ai-app" \
  --name "chaos-block-cross-region" \
  --priority 50 \
  --direction Outbound \
  --access Deny \
  --protocol "*" \
  --destination-address-prefixes "10.2.0.0/16" \
  --description "CHAOS TEST: Block cross-region traffic"

# Vent og observer (10 minutter)
sleep 600

# Fjern blokkeringen
az network nsg rule delete \
  --resource-group "rg-networking" \
  --nsg-name "nsg-ai-app" \
  --name "chaos-block-cross-region"

DNS-feil simulering

# Python: Simuler DNS-feil for testing
# Bruk Azure Private DNS zone override for å simulere DNS-feil

import subprocess

def simulate_dns_failure(target_fqdn: str, duration_minutes: int = 10):
    """Simulate DNS failure by overriding DNS resolution."""
    print(f"Simulating DNS failure for {target_fqdn} for {duration_minutes} min")

    # Opprett en DNS record som peker til en ikke-eksisterende IP
    subprocess.run([
        "az", "network", "private-dns", "record-set", "a", "add-record",
        "--resource-group", "rg-networking",
        "--zone-name", "privatelink.openai.azure.com",
        "--record-set-name", "chaos-test",
        "--ipv4-address", "10.255.255.255"  # Ikke-ruterbar IP
    ])

    print(f"DNS poisoned. Observing for {duration_minutes} minutes...")
    import time
    time.sleep(duration_minutes * 60)

    # Rydd opp
    subprocess.run([
        "az", "network", "private-dns", "record-set", "a", "remove-record",
        "--resource-group", "rg-networking",
        "--zone-name", "privatelink.openai.azure.com",
        "--record-set-name", "chaos-test",
        "--ipv4-address", "10.255.255.255"
    ])
    print("DNS restored.")

Last- og stresstesting

Load testing med Azure Load Testing

# JMeter test plan for AI API stress testing
# azure-load-test-config.yaml
version: v0.1
testId: ai-stress-test
testPlan: ai-load-test.jmx
engineInstances: 5
configurationFiles:
  - ai-load-test.jmx
failureCriteria:
  - avg(response_time_ms) > 5000
  - percentage(error) > 5
env:
  - name: AOAI_ENDPOINT
    value: https://aoai-prod.openai.azure.com
  - name: SEARCH_ENDPOINT
    value: https://search-prod.search.windows.net

# Opprett og kjør load test
az load test create \
  --name "ai-stress-test" \
  --resource-group "rg-ai-test" \
  --load-test-resource "lt-ai-prod" \
  --test-plan "ai-load-test.jmx" \
  --engine-instances 5

# Kjør test med failover-scenario
az load test-run create \
  --name "failover-stress-run" \
  --resource-group "rg-ai-test" \
  --load-test-resource "lt-ai-prod" \
  --test-id "ai-stress-test" \
  --description "Stress test during simulated failover"

Gradvis belastningsøkning

# Gradvis belastningsøkning for å finne breaking point
import asyncio
import aiohttp
import time

async def ramp_up_test(
    endpoint: str,
    start_rps: int = 10,
    end_rps: int = 500,
    step_rps: int = 10,
    step_duration_seconds: int = 60
):
    """Gradually increase load to find service breaking point."""
    current_rps = start_rps
    results = []

    while current_rps <= end_rps:
        print(f"Testing at {current_rps} RPS for {step_duration_seconds}s...")
        interval = 1.0 / current_rps
        success_count = 0
        error_count = 0
        total_latency = 0

        start = time.time()
        while time.time() - start < step_duration_seconds:
            try:
                req_start = time.time()
                async with aiohttp.ClientSession() as session:
                    async with session.post(endpoint, json={"query": "test"}) as resp:
                        if resp.status < 400:
                            success_count += 1
                        else:
                            error_count += 1
                        total_latency += (time.time() - req_start) * 1000
            except Exception:
                error_count += 1
            await asyncio.sleep(interval)

        total = success_count + error_count
        error_rate = error_count / max(total, 1) * 100
        avg_latency = total_latency / max(total, 1)

        results.append({
            "rps": current_rps,
            "success": success_count,
            "errors": error_count,
            "error_rate": round(error_rate, 2),
            "avg_latency_ms": round(avg_latency, 1)
        })

        print(f"  Results: {error_rate:.1f}% errors, {avg_latency:.0f}ms avg latency")

        # Stop hvis error rate er for høy
        if error_rate > 20:
            print(f"Breaking point found at {current_rps} RPS")
            break

        current_rps += step_rps

    return results

Recovery time-måling og validering

RTO-måling under chaos testing

# Automatisk RTO-måling under failover-test
import time
import requests
from datetime import datetime

class RTOMeasurement:
    """Measure actual RTO during failover tests."""

    def __init__(self, health_endpoint: str, check_interval_seconds: float = 1.0):
        self.health_endpoint = health_endpoint
        self.check_interval = check_interval_seconds
        self.measurements = []

    def measure_rto(self, max_wait_seconds: int = 600) -> dict:
        """Continuously check health and measure recovery time."""
        failure_detected = None
        recovery_detected = None
        was_healthy = True
        checks = []

        start = time.time()
        while time.time() - start < max_wait_seconds:
            try:
                resp = requests.get(self.health_endpoint, timeout=5)
                is_healthy = resp.status_code == 200
            except Exception:
                is_healthy = False

            check = {
                "timestamp": datetime.utcnow().isoformat(),
                "elapsed_seconds": round(time.time() - start, 1),
                "healthy": is_healthy
            }
            checks.append(check)

            if was_healthy and not is_healthy and failure_detected is None:
                failure_detected = time.time()
                print(f"Failure detected at {check['elapsed_seconds']}s")

            if not was_healthy and is_healthy and failure_detected and recovery_detected is None:
                recovery_detected = time.time()
                rto = recovery_detected - failure_detected
                print(f"Recovery detected at {check['elapsed_seconds']}s — RTO: {rto:.1f}s")

            was_healthy = is_healthy
            time.sleep(self.check_interval)

        result = {
            "failure_detected": failure_detected is not None,
            "recovery_detected": recovery_detected is not None,
            "rto_seconds": round(recovery_detected - failure_detected, 1) if recovery_detected and failure_detected else None,
            "total_checks": len(checks),
            "healthy_checks": sum(1 for c in checks if c["healthy"]),
            "unhealthy_checks": sum(1 for c in checks if not c["healthy"]),
            "availability_pct": round(
                sum(1 for c in checks if c["healthy"]) / max(len(checks), 1) * 100, 2
            )
        }

        self.measurements.append(result)
        return result

# Bruk
rto_meter = RTOMeasurement("https://ai-app-prod.azurewebsites.net/health")
result = rto_meter.measure_rto(max_wait_seconds=600)
print(f"Measured RTO: {result['rto_seconds']}s")

Verktøy og plattformer for chaos engineering

Azure Chaos Studio

Funksjon	Beskrivelse	Støttede ressurser
Service-direct faults	Feil injisert via Azure API	App Service, AKS, Cosmos DB, NSG
Agent-based faults	Feil injisert via VM-agent	CPU/memory stress, network faults
Experiments	Strukturerte feilsekvenser	Alle støttede resurser
Permissions	RBAC-basert tilgangskontroll	Dedicated Chaos role

Komplementære verktøy

Verktøy	Bruksområde	Integrasjon med Azure
Azure Chaos Studio	Native Azure fault injection	Innebygd
Azure Load Testing	Lasttesting	Innebygd, JMeter-basert
Litmus Chaos	Kubernetes chaos testing	AKS-kompatibel
Toxiproxy	Nettverksfeil for utvikling	Manuell oppsett
PYRIT	AI-spesifikk red teaming	Azure AI

Chaos Testing CI/CD-integrasjon

# Azure DevOps Pipeline: Chaos testing som del av release
trigger: none

stages:
  - stage: DeployToStaging
    displayName: 'Deploy to Staging'
    jobs:
      - job: Deploy
        steps:
          - task: AzureWebApp@1
            inputs:
              appName: 'ai-app-staging'

  - stage: ChaosTests
    displayName: 'Run Chaos Experiments'
    dependsOn: DeployToStaging
    jobs:
      - job: RunChaosExperiment
        steps:
          - task: AzureCLI@2
            displayName: 'Start chaos experiment'
            inputs:
              azureSubscription: 'chaos-service-connection'
              scriptType: 'bash'
              scriptLocation: 'inlineScript'
              inlineScript: |
                # Start chaos experiment
                EXPERIMENT_ID=$(az rest --method POST \
                  --url "https://management.azure.com/subscriptions/{sub}/resourceGroups/rg-ai-test/providers/Microsoft.Chaos/experiments/openai-failover-test/start?api-version=2024-01-01" \
                  --query "statusUrl" -o tsv)

                echo "Experiment started: $EXPERIMENT_ID"

                # Vent og mål RTO
                python measure_rto.py \
                  --endpoint "https://ai-app-staging.azurewebsites.net/health" \
                  --max-wait 300

          - task: AzureCLI@2
            displayName: 'Validate results'
            inputs:
              azureSubscription: 'chaos-service-connection'
              scriptType: 'bash'
              scriptLocation: 'inlineScript'
              inlineScript: |
                # Sjekk at RTO er innenfor mål
                RTO=$(python -c "import json; print(json.load(open('rto_result.json'))['rto_seconds'])")
                if [ $(echo "$RTO > 900" | bc) -eq 1 ]; then
                  echo "##vso[task.logissue type=error]RTO exceeded 15 minutes: ${RTO}s"
                  exit 1
                fi
                echo "RTO within target: ${RTO}s"

Referanser

What is Azure Chaos Studio? — Chaos Studio oversikt
Understand chaos engineering and resilience — Chaos engineering konsepter
Architecture strategies for designing a reliability testing strategy — WAF testing-strategi
Continuous validation with Azure Load Testing and Chaos Studio — Kombinert testing
Shift right to test in production — Fault injection i produksjon
Chaos Agent overview — Agent-basert feilinjeksjon

For Cosmo

Bruk denne referansen når kunden ønsker å implementere chaos engineering for AI-systemer, eller når de trenger å validere sine DR-prosedyrer.
Start med tabletop-øvelser før reelle feilinjeksjoner — forstå forventet oppførsel før du bryter ting.
Bruk Azure Chaos Studio i staging-miljøer først, deretter gradvis i produksjon med begrenset blast radius.
Integrer chaos testing i CI/CD — automatiserte failover-tester bør kjøres etter hver infrastrukturendring.
RTO-måling er den viktigste outputen — dokumenter faktisk vs. planlagt RTO for å identifisere gap.

16 KiB Raw Blame History