ktg-plugin-marketplace/plugins/ms-ai-architect/skills/ms-ai-engineering/references/api-management/genai-gateway-policies.md
Kjell Tore Guttormsen 6a7632146e feat(ms-ai-architect): add plugin to open marketplace (v1.5.0 baseline)
Initial addition of ms-ai-architect plugin to the open-source marketplace.
Private content excluded: orchestrator/ (Linear tooling), docs/utredning/
(client investigation), generated test reports and PDF export script.
skill-gen tooling moved from orchestrator/ to scripts/skill-gen/.

Security scan: WARNING (risk 20/100) — no secrets, no injection found.
False positive fixed: added gitleaks:allow to Python variable reference
in output-validation-grounding-verification.md line 109.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-07 17:17:17 +02:00

22 KiB

GenAI-Specific APIM Policies & Rules

Last updated: 2026-02 Status: GA Category: API Management & AI Gateway


Introduksjon

Azure API Management (APIM) inkluderer et sett med policyer spesifikt designet for generativ AI (GenAI). Disse policyene går utover tradisjonell API-gateway-funksjonalitet og adresserer unike utfordringer ved AI-workloads: content safety-modererering, prompt-validering, token-basert rate limiting, semantic caching, og audit-logging av prompts og completions. Samlet utgjør de kjernen i APIM sin AI gateway-kapabilitet.

For norsk offentlig sektor er GenAI-spesifikke policyer kritisk viktige. Krav fra AI Act, Datatilsynet og NSM innebarer at AI-systemer må ha mekanismer for innholdssikkerhet, logging for etterprøvbarhet, og kontroll over hva slags innhold som genereres. APIM-policyer gir disse kontrollene uten at hver enkelt applikasjon må implementere dem selv — en sentralisert, konsistent tilnærming til AI governance.

Denne referansen dekker alle GenAI-spesifikke APIM-policyer med fullstendige XML-eksempler, konfigurasjonsparametre og best practices. Policyene kan kombineres fritt i APIM sin inbound/outbound policy pipeline for å bygge en komplett AI safety-stack.


Content Safety Integration

llm-content-safety Policy

Policyen sender LLM-forespørsler til Azure AI Content Safety for moderering FØR de videresendes til backend-modellen:

<inbound>
    <base />
    <llm-content-safety backend-id="content-safety-backend"
                        shield-prompt="true"
                        enforce-on-completions="true">
        <categories output-type="EightSeverityLevels">
            <category name="Hate" threshold="4" />
            <category name="Violence" threshold="4" />
            <category name="SelfHarm" threshold="2" />
            <category name="Sexual" threshold="2" />
        </categories>
        <blocklists>
            <id>custom-blocklist-pii</id>
            <id>custom-blocklist-org-specific</id>
        </blocklists>
    </llm-content-safety>
</inbound>

Prerequisites for Content Safety

// 1. Azure AI Content Safety ressurs
resource contentSafety 'Microsoft.CognitiveServices/accounts@2023-05-01' = {
  name: 'content-safety-service'
  location: 'westeurope'
  kind: 'ContentSafety'
  sku: { name: 'S0' }
  properties: {
    publicNetworkAccess: 'Disabled'
  }
}

// 2. APIM Backend for Content Safety
resource contentSafetyBackend 'Microsoft.ApiManagement/service/backends@2023-09-01-preview' = {
  name: 'ai-gateway-apim/content-safety-backend'
  properties: {
    url: 'https://content-safety-service.cognitiveservices.azure.com'
    protocol: 'http'
    credentials: {
      authorization: {
        scheme: 'managed-identity'
        parameter: 'https://cognitiveservices.azure.com'
      }
    }
  }
}

Content Safety Konfigurasjon

Attributt Beskrivelse Standard
backend-id Backend-entitet for Content Safety Obligatorisk
shield-prompt Sjekk for adversarial attacks (jailbreak) false
enforce-on-completions Sjekk også respons fra modellen false

Kategorier og Terskelverider

Kategori Beskrivelse Anbefalt terskel (offentlig sektor)
Hate Hatefullt innhold, diskriminering 2-4 (streng)
Violence Voldelig innhold 2-4 (streng)
SelfHarm Selvskading 2 (svært streng)
Sexual Seksuelt innhold 2 (svært streng)

Terskelskala: 0 (mest restriktiv) til 7 (minst restriktiv). Lavere verdi = flere forespørsler blokkeres.

Severity Level Output Types

Output Type Nivåer Bruksområde
FourSeverityLevels 0, 2, 4, 6 Standard, enklere policy
EightSeverityLevels 0-7 Finkornet kontroll

Blokkert Request-respons

Når Content Safety blokkerer en forespørsel:

{
  "statusCode": 403,
  "message": "Content safety violation detected. The request has been blocked."
}

Prompt Validation Policies

Custom Prompt Validation

Utover Azure AI Content Safety kan du implementere egne prompt-valideringsregler:

<inbound>
    <base />

    <!-- Valider prompt-lengde -->
    <set-variable name="request-body" value="@{
        return context.Request.Body.As<JObject>(preserveContent: true);
    }" />

    <choose>
        <!-- Blokkér ekstremt lange prompts -->
        <when condition="@{
            var body = (JObject)context.Variables["request-body"];
            var messages = body?["messages"] as JArray;
            if (messages == null) return false;
            var totalLength = messages.Sum(m => m["content"]?.ToString().Length ?? 0);
            return totalLength > 50000;
        }">
            <return-response>
                <set-status code="400" reason="Bad Request" />
                <set-body>{
    "error": {
        "code": "PromptTooLong",
        "message": "Total prompt length exceeds 50,000 characters."
    }
}</set-body>
            </return-response>
        </when>

        <!-- Blokkér forespørsler uten system message -->
        <when condition="@{
            var body = (JObject)context.Variables["request-body"];
            var messages = body?["messages"] as JArray;
            if (messages == null) return true;
            return !messages.Any(m => m["role"]?.ToString() == "system");
        }">
            <return-response>
                <set-status code="400" reason="Bad Request" />
                <set-body>{
    "error": {
        "code": "SystemMessageRequired",
        "message": "A system message is required for all AI requests."
    }
}</set-body>
            </return-response>
        </when>

        <!-- Blokkér forsøk på å overstyre system message -->
        <when condition="@{
            var body = (JObject)context.Variables["request-body"];
            var messages = body?["messages"] as JArray;
            if (messages == null) return false;
            var systemMessages = messages.Where(m => m["role"]?.ToString() == "system").ToList();
            return systemMessages.Count > 1;
        }">
            <return-response>
                <set-status code="400" reason="Bad Request" />
                <set-body>{
    "error": {
        "code": "MultipleSystemMessages",
        "message": "Only one system message is allowed per request."
    }
}</set-body>
            </return-response>
        </when>
    </choose>
</inbound>

Inject Mandatory System Prompt

Tving en standard system prompt for alle forespørsler:

<inbound>
    <base />

    <!-- Injiser organisasjonens standard system prompt -->
    <set-body>@{
        var body = context.Request.Body.As<JObject>(preserveContent: true);
        var messages = body["messages"] as JArray ?? new JArray();

        // Organisasjonens mandatory system prompt
        var orgSystemPrompt = new JObject {
            ["role"] = "system",
            ["content"] = "You are a helpful assistant for Statens vegvesen. " +
                          "You must respond in Norwegian unless explicitly asked otherwise. " +
                          "Never share personal data, internal processes, or confidential information. " +
                          "Always cite sources when providing factual information."
        };

        // Fjern eksisterende system messages og legg inn organisasjonens
        var userMessages = new JArray(messages.Where(m => m["role"]?.ToString() != "system"));
        userMessages.Insert(0, orgSystemPrompt);
        body["messages"] = userMessages;

        return body.ToString();
    }</set-body>
</inbound>

Response Filtering

Filtrere Sensitiv Informasjon fra Responser

<outbound>
    <base />

    <!-- Fjern potensielle PII-lekkasjer fra AI-respons -->
    <choose>
        <when condition="@(!context.Response.Headers.GetValueOrDefault("Content-Type","")
                           .Contains("text/event-stream"))">
            <set-body>@{
                var body = context.Response.Body.As<JObject>(preserveContent: true);
                var choices = body?["choices"] as JArray;
                if (choices == null) return body.ToString();

                foreach (var choice in choices)
                {
                    var content = choice["message"]?["content"]?.ToString();
                    if (content == null) continue;

                    // Fjern fødselsnumre (11 siffer)
                    content = System.Text.RegularExpressions.Regex.Replace(
                        content, @"\b\d{11}\b", "[REDACTED-PII]");

                    // Fjern e-postadresser
                    content = System.Text.RegularExpressions.Regex.Replace(
                        content, @"[\w.+-]+@[\w-]+\.[\w.-]+", "[REDACTED-EMAIL]");

                    // Fjern telefonnumre (norsk format)
                    content = System.Text.RegularExpressions.Regex.Replace(
                        content, @"\b(\+47)?\s?\d{2}\s?\d{2}\s?\d{2}\s?\d{2}\b", "[REDACTED-PHONE]");

                    choice["message"]["content"] = content;
                }

                return body.ToString();
            }</set-body>
        </when>
    </choose>
</outbound>

Legg til Disclaimer i Responser

<outbound>
    <base />
    <set-header name="X-AI-Disclaimer" exists-action="override">
        <value>AI-generated content. Verify information before use.</value>
    </set-header>
</outbound>

Rate Limiting per Model

Token Rate Limiting (llm-token-limit)

Begrens token-forbruk per forbruker, per modell:

<inbound>
    <base />

    <!-- Global token-grense per subscription -->
    <llm-token-limit
        counter-key="@(context.Subscription.Id)"
        tokens-per-minute="10000"
        estimate-prompt-tokens="true"
        remaining-tokens-variable-name="remainingTokens" />

    <!-- Ekstra grense per modell -->
    <choose>
        <when condition="@(context.Request.MatchedParameters["deployment-id"] == "gpt-4o")">
            <llm-token-limit
                counter-key="@("gpt4o-" + context.Subscription.Id)"
                tokens-per-minute="5000"
                estimate-prompt-tokens="true" />
        </when>
        <when condition="@(context.Request.MatchedParameters["deployment-id"] == "gpt-4o-mini")">
            <llm-token-limit
                counter-key="@("gpt4omini-" + context.Subscription.Id)"
                tokens-per-minute="20000"
                estimate-prompt-tokens="true" />
        </when>
    </choose>
</inbound>

Token Quota (Periodisk)

Sett token-kvoter per dag, uke eller måned:

<inbound>
    <base />
    <!-- Daglig token-kvote per avdeling -->
    <llm-token-limit
        counter-key="@(context.Request.Headers.GetValueOrDefault("X-Department", "default"))"
        tokens-per-minute="0"
        token-quota="100000"
        token-quota-period="86400"
        estimate-prompt-tokens="true"
        remaining-tokens-variable-name="dailyRemaining" />

    <!-- Legg til gjenværende kvote i respons-header -->
    <set-header name="X-Daily-Tokens-Remaining" exists-action="override">
        <value>@(context.Variables.GetValueOrDefault<int>("dailyRemaining").ToString())</value>
    </set-header>
</inbound>

Prompt Token Pre-calculation

estimate-prompt-tokens="true" lar APIM estimere prompt-tokens FØR request sendes til backend. Hvis prompten allerede overskrider grensen, returneres 429 umiddelbart:

Med pre-calculation:
  Klient → APIM (estimerer: 8000 tokens, grense: 5000) → 429 returnert
  → Ingen request til Azure OpenAI → sparer backend-kapasitet

Uten pre-calculation:
  Klient → APIM → Azure OpenAI (bruker 8000 tokens) → Respons → APIM teller → Neste request: 429
  → Tokens allerede brukt

Multi-Region Rate Limiting

Viktig: Rate limiting-policyer (llm-token-limit, rate-limit) teller SEPARAT per regional gateway i multi-region deployments:

Policy Scope Multi-region oppførsel
llm-token-limit Per gateway Separate tellere per region
rate-limit Per gateway Separate tellere per region
quota Global (instans) Én global teller
quota-by-key Global (instans) Én global teller

For å oppnå global rate limiting, bruk quota-by-key i stedet for llm-token-limit.


Audit Logging for Prompts

Aktivere LLM API-logging

1. APIM → Monitoring → Diagnostic settings
2. "+ Add diagnostic setting"
3. Velg "Logs related to generative AI gateway"
4. Destination: Log Analytics workspace
5. Save

6. APIM → APIs → [din API] → Settings → Diagnostic Logs
7. Azure Monitor → Log LLM messages: Enabled
8. Log prompts: 32768 bytes
9. Log completions: 32768 bytes
10. Save

Log-skjema: ApiManagementGatewayLlmLog

Felt Beskrivelse Eksempel
TimeGenerated Tidspunkt for request 2026-02-11T10:30:00Z
CorrelationId Unik request-ID abc-123-def
OperationName API-operasjon ChatCompletions
ModelDeployment Deployment-navn gpt-4o
PromptTokens Antall prompt-tokens 150
CompletionTokens Antall completion-tokens 250
TotalTokens Totalt token-forbruk 400
RequestMessages Prompt-innhold (JSON) [{"role":"user","content":"..."}]
ResponseMessages Completion-innhold (JSON) [{"content":"..."}]

KQL: Audit Trail for AI-requests

// Full audit trail med prompt og respons
ApiManagementGatewayLlmLog
| where TimeGenerated > ago(24h)
| extend RequestArray = parse_json(RequestMessages)
| extend ResponseArray = parse_json(ResponseMessages)
| mv-expand RequestArray
| mv-expand ResponseArray
| project
    TimeGenerated,
    CorrelationId,
    Model = ModelDeployment,
    PromptTokens,
    CompletionTokens,
    Prompt = tostring(RequestArray.content),
    Response = tostring(ResponseArray.content)
| summarize
    Input = strcat_array(make_list(Prompt), " "),
    Output = strcat_array(make_list(Response), " ")
  by CorrelationId, TimeGenerated, Model, PromptTokens, CompletionTokens
| where isnotempty(Input) and isnotempty(Output)
| order by TimeGenerated desc

KQL: Detektere Anomalier

// Finn uvanlig høy token-bruk per bruker
ApiManagementGatewayLlmLog
| where TimeGenerated > ago(7d)
| extend UserId = tostring(CustomDimensions["UserId"])
| summarize
    AvgTokens = avg(toint(TotalTokens)),
    MaxTokens = max(toint(TotalTokens)),
    P95Tokens = percentile(toint(TotalTokens), 95),
    RequestCount = count()
  by UserId, bin(TimeGenerated, 1h)
| where MaxTokens > 3 * AvgTokens  // Flagg anomalier
| order by MaxTokens desc

Event Hub-logging for Real-time Monitoring

<outbound>
    <base />
    <!-- Logg til Event Hub for real-time analyse -->
    <log-to-eventhub logger-id="ai-audit-logger">@{
        var body = context.Response.Body.As<JObject>(preserveContent: true);
        var usage = body?["usage"];

        return new JObject(
            new JProperty("timestamp", DateTime.UtcNow.ToString("o")),
            new JProperty("correlationId", context.RequestId),
            new JProperty("subscriptionId", context.Subscription?.Id),
            new JProperty("apiId", context.Api?.Id),
            new JProperty("model", body?["model"]?.ToString()),
            new JProperty("promptTokens", usage?["prompt_tokens"]),
            new JProperty("completionTokens", usage?["completion_tokens"]),
            new JProperty("totalTokens", usage?["total_tokens"]),
            new JProperty("statusCode", context.Response.StatusCode),
            new JProperty("region", context.Deployment.Region),
            new JProperty("latencyMs", context.Elapsed.TotalMilliseconds)
        ).ToString();
    }</log-to-eventhub>
</outbound>

Komplett GenAI Policy Stack

Full Inbound + Outbound Policy

<policies>
    <inbound>
        <base />

        <!-- 1. Autentisering -->
        <validate-azure-ad-token tenant-id="{{TENANT_ID}}"
                                 header-name="Authorization"
                                 failed-validation-httpcode="401" />

        <!-- 2. Ekstraher brukerinfo for logging og rate limiting -->
        <set-variable name="caller-id"
                      value="@(context.Request.Headers.GetValueOrDefault("Authorization","")
                             .AsJwt()?.Claims.GetValueOrDefault("oid", "anonymous"))" />
        <set-variable name="department"
                      value="@(context.Request.Headers.GetValueOrDefault("Authorization","")
                             .AsJwt()?.Claims.GetValueOrDefault("department", "unknown"))" />

        <!-- 3. Token rate limiting -->
        <llm-token-limit
            counter-key="@((string)context.Variables["caller-id"])"
            tokens-per-minute="10000"
            estimate-prompt-tokens="true" />

        <!-- 4. Content Safety -->
        <llm-content-safety backend-id="content-safety-backend"
                            shield-prompt="true">
            <categories output-type="EightSeverityLevels">
                <category name="Hate" threshold="4" />
                <category name="Violence" threshold="4" />
                <category name="SelfHarm" threshold="2" />
                <category name="Sexual" threshold="2" />
            </categories>
        </llm-content-safety>

        <!-- 5. Semantic cache lookup -->
        <llm-semantic-cache-lookup
            score-threshold="0.9"
            embeddings-backend-id="embedding-backend"
            embeddings-backend-auth="system-assigned" />

        <!-- 6. Backend med managed identity -->
        <set-backend-service backend-id="aoai-pool" />
        <authentication-managed-identity
            resource="https://cognitiveservices.azure.com"
            output-token-variable-name="mi-token" />
        <set-header name="Authorization" exists-action="override">
            <value>@("Bearer " + (string)context.Variables["mi-token"])</value>
        </set-header>
    </inbound>

    <backend>
        <forward-request timeout="120"
                         fail-on-error-status-code="true"
                         buffer-response="false" />
    </backend>

    <outbound>
        <base />

        <!-- 7. Semantic cache store -->
        <llm-semantic-cache-store duration="3600" />

        <!-- 8. Token-metriker -->
        <llm-emit-token-metric namespace="ai-metrics">
            <dimension name="UserId" value="@((string)context.Variables["caller-id"])" />
            <dimension name="Department" value="@((string)context.Variables["department"])" />
            <dimension name="API" value="@(context.Api.Name)" />
            <dimension name="Region" value="@(context.Deployment.Region)" />
        </llm-emit-token-metric>
    </outbound>

    <on-error>
        <base />
        <return-response>
            <set-status code="500" reason="Internal Server Error" />
            <set-body>{
    "error": {
        "code": "GatewayError",
        "message": "An error occurred processing your AI request."
    }
}</set-body>
        </return-response>
    </on-error>
</policies>

Policy Execution Order

Inbound (fra topp til bunn):
  1. Authentication (validate-azure-ad-token)
  2. Variable extraction (set-variable)
  3. Token rate limiting (llm-token-limit)
  4. Content Safety (llm-content-safety)
  5. Cache lookup (llm-semantic-cache-lookup)
  6. Backend selection (set-backend-service)
  7. Backend auth (authentication-managed-identity)

Backend:
  8. Forward request (forward-request)

Outbound (fra topp til bunn):
  9. Cache store (llm-semantic-cache-store)
  10. Emit metrics (llm-emit-token-metric)

GenAI Policy Referanse

Alle GenAI-spesifikke Policyer

Policy Fase Formål
llm-content-safety Inbound Content Safety moderering
llm-token-limit Inbound Token rate limiting
llm-semantic-cache-lookup Inbound Semantic cache oppslag
llm-semantic-cache-store Outbound Lagre i semantic cache
llm-emit-token-metric Outbound Emitter token-metriker

Kompatibilitet

Policy Classic V2 Consumption Self-hosted Workspace
llm-content-safety Ja Ja Ja Ja Ja
llm-token-limit Ja Ja Ja Ja Ja
llm-semantic-cache-lookup Ja Ja Nei Nei Ja
llm-semantic-cache-store Ja Ja Nei Nei Ja
llm-emit-token-metric Ja Ja Ja Ja Ja

Referanser


For Cosmo

  • Bruk denne referansen når kunder trenger å implementere AI safety og governance gjennom APIM-policyer, spesielt content safety, prompt-validering og audit-logging.
  • Den viktigste policyen for norsk offentlig sektor er llm-content-safety med shield-prompt="true" — dette blokkerer jailbreak-forsøk og uønsket innhold FØR det når modellen.
  • Husk rekkefølgen: Autentisering FØRST, deretter rate limiting, SÅ content safety, SÅ cache lookup. Content Safety koster tokens (kall til Content Safety API) — cache lookup etter content safety betyr at cachen kun inneholder "godkjent" innhold.
  • For audit-logging: Aktiver LLM API-logging i Diagnostic Settings. Dette gir full etterprøvbarhet for alle prompts og completions — noe som er påkrevd under AI Act for høy-risiko AI-systemer.
  • Rate limiting per modell er viktig: GPT-4o er dyrere enn GPT-4o-mini, og bør ha strengere token-grenser for å kontrollere kostnader.