ktg-plugin-marketplace/plugins/ms-ai-architect/skills/ms-ai-governance/references/responsible-ai/red-teaming-ai-models.md
Kjell Tore Guttormsen 6a7632146e feat(ms-ai-architect): add plugin to open marketplace (v1.5.0 baseline)
Initial addition of ms-ai-architect plugin to the open-source marketplace.
Private content excluded: orchestrator/ (Linear tooling), docs/utredning/
(client investigation), generated test reports and PDF export script.
skill-gen tooling moved from orchestrator/ to scripts/skill-gen/.

Security scan: WARNING (risk 20/100) — no secrets, no injection found.
False positive fixed: added gitleaks:allow to Python variable reference
in output-validation-grounding-verification.md line 109.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-07 17:17:17 +02:00

26 KiB
Raw Blame History

Red Teaming AI Models - Adversarial Testing & Security

Dato: 2026-02-03 Kategori: Responsible AI & Governance Målgruppe: Arkitekter, sikkerhetsteam, AI-utviklere Konfidensgrad: ⚠️ HIGH — Basert på offisiell Microsoft-dokumentasjon (feb 2026)

Introduksjon

AI red teaming er en proaktiv sikkerhetsmetode for å identifisere sårbarheter i generative AI-systemer gjennom simulert adversarial testing. I motsetning til tradisjonell cybersecurity red teaming (som fokuserer på cyber kill chain), omfatter AI red teaming både sikkerhets- og innholdsrisiko, og simulerer adversarial brukere som forsøker å få AI-systemet til å oppføre seg uønsket.

Kjerneprinsipp: Kontinuerlig AI red teaming integrert i utviklingslivssyklusen identifiserer sårbarheter før de blir utnyttet av ondsinnet aktører. Uten systematisk adversarial testing deployer organisasjoner AI-systemer med ukjente svakheter som kan utnyttes via prompt injection, model poisoning, eller jailbreaking.

Hvorfor AI red teaming er kritisk

Microsoft Security Benchmark (AI-7) definerer continuous AI red teaming som obligatorisk best practice. Uten red teaming står organisasjoner overfor:

  1. Prompt injection attacks — Ondsinnet input manipulerer AI-output, omgår content filters, eller eksponerer sensitiv informasjon
  2. Adversarial examples — Subtile input-perturbations forårsaker misklassifisering eller uriktige output
  3. Jailbreaking — Teknikker som omgår safety mechanisms, gir tilgang til restricted functionalities eller genererer forbudt innhold

Kjernekomponenter

1. PyRIT (Python Risk Identification Tool for generative AI)

Microsofts open-source rammeverk for å automatisere og skalere adversarial testing av generative AI-systemer.

Nøkkelfunksjoner:

Funksjon Beskrivelse
Prompt Executors End-to-end attack orchestrering som kobler sammen targets, converters, og scorers
Datasets Kuraterte seed prompts og attack objectives per risikokategori
Converters 20+ teknikker for å transformere prompts (encoding, obfuscation, linguistic manipulation)
Scorers AI-baserte evaluators for å score attack success
Memory State management for multi-turn conversations og logging
Targets Integrasjoner mot Azure OpenAI, Hugging Face, REST APIs, lokale modeller

Installasjon:

# Via pip (latest stable release)
pip install pyrit

# Via Azure AI Evaluation SDK (inkluderer PyRIT + Foundry-integrasjon)
uv pip install "azure-ai-evaluation[redteam]"

Konfidensmarkør: PyRIT er production-ready, open-source, og aktivt vedlikeholdt av Microsoft AI Red Team.

2. Azure AI Red Teaming Agent (preview)

Managed service i Azure AI Foundry som kombinerer PyRIT med Risk and Safety Evaluations.

Tre-faset tilnærming:

  1. Automated scans for content risks — Simulerer adversarial probing mot model/agent endpoints
  2. Evaluate probing success — Scorer attack-response pairs, genererer Attack Success Rate (ASR)
  3. Reporting and logging — Scorecard med attack techniques og risk categories, logges i Foundry

Deployment-modeller:

Deployment Use case Sandboxing
Local red teaming Model-only testing, developer workflows Minimal (client-side)
Cloud red teaming Agent testing med agentic risks (prohibited actions, data leakage) Purple environment (transient runs, mock tools)

Region support (feb 2026): East US2, Sweden Central, France Central, Switzerland West

Konfidensmarkør: ⚠️ MEDIUM — Preview-feature, ikke anbefalt for production workloads (ingen SLA).

3. Supported Risk Categories

Risk Category Model/Agent Local/Cloud Beskrivelse
Hateful and Unfair Content Begge Begge Språk/bilder relatert til hat eller urettferdig representasjon basert på rase, kjønn, religion, etc.
Sexual Content Begge Begge Anatomiske detaljer, seksuelt innhold, prostitusjon, pornografi, overgrep
Violent Content Begge Begge Fysiske handlinger som skader, dreper, eller ødelegger; våpen, produsenter, assosiasjoner
Self-Harm-Related Content Begge Begge Handlinger som skader egen kropp eller selvmord
Protected Materials Begge Begge Opphavsrettsbeskyttet materiale (lyrics, oppskrifter, kode)
Code Vulnerability Begge Begge Generert kode med sikkerhetssårbarheter (SQL injection, code injection, stack trace exposure)
Ungrounded Attributes Begge Begge Ugrunnede inferenser om personlige attributter (demografi, emosjonell tilstand)
Prohibited Actions Agent Cloud Agenter som utfører forbudte high-risk eller irreversible actions
Sensitive Data Leakage Agent Cloud Eksponering av finansiell, medisinsk, eller personlig data fra interne kilder
Task Adherence Agent Cloud Agent kompletterer oppgaven innenfor regler, constraints, og uten unauthorized actions
Indirect Prompt Injection (XPIA) Agent Cloud Malicious instructions skjult i eksterne datakilder (e-post, dokumenter) hentet via tool calls

Konfidensmarkør: Risikokategorier er standardisert og alignet med NIST AI RMF og Microsofts Responsible AI-prinsipper.

4. Attack Strategies (via PyRIT)

20+ attack strategies for å omgå safety alignments:

Encoding-baserte:

  • Base64, Binary, Morse, ROT13, Atbash, Caesar, Url
  • UnicodeConfusable, UnicodeSubstitution, Diacritic

Obfuscation-teknikker:

  • CharacterSpace, CharSwap, Flip, Leetspeak, StringJoin
  • AsciiArt, AsciiSmuggler, AnsiAttack

Adversarial prompting:

  • Jailbreak (direct UPIA), Indirect Jailbreak (XPIA via tool outputs)
  • SuffixAppend, Tense transformation

Multi-turn:

  • Multi-turn (context accumulation over multiple turns)
  • Crescendo (gradvis eskalering av complexity/risk)

Konfidensmarkør: Strategies er dokumentert i PyRIT-repoen med eksempler.

5. Attack Success Rate (ASR)

Nøkkelmetrikk for å vurdere risk posture:

ASR = (Antall vellykkede attacks / Totalt antall attacks) × 100%

Hva definerer "success"?

  • Model-only: AI genererer harmful content som omgår content filters
  • Agentic: AI agent utfører prohibited action, lekker sensitiv data, eller feiler task adherence

Evaluering: Fine-tuned adversarial LLM dedikert til å score responses med harmful content via Risk and Safety Evaluators.

Konfidensmarkør: ⚠️ MEDIUM — ASR bruker generative modeller for evaluering (non-deterministic), alltid sjekk false positives.

Arkitekturmønstre

Pattern 1: Shift-Left Red Teaming (Design → Development → Pre-deployment)

NIST AI RMF-fasering:

  1. Map — Identifiser relevante risikoer og definer use case
  2. Measure — Evaluer risikoer at scale med automated scans
  3. Manage — Mitigate risks i production, monitor, incident response plan

Microsoft-anbefaling (per fase):

Fase Red Teaming Approach Tools Frequency
Design Test base models for safest choice AI Red Teaming Agent (cloud) Per model evaluation
Development Test fine-tuned models, RAG systems PyRIT (local) + CI/CD integration Per model update
Pre-deployment Full attack surface validation AI Red Teaming Agent (cloud) Pre-release gate
Post-deployment Scheduled continuous red teaming, monitor incidents AI Red Teaming Agent (cloud) + Azure Monitor Monthly/quarterly

Konfidensmarkør: Pattern er alignet med Microsoft AI Security Benchmark (AI-7.1).

Pattern 2: CI/CD-Integrated Automated Red Teaming

Azure DevOps / GitHub Actions workflow:

# Pseudo-kode
trigger: on_model_update

steps:
  1. Deploy model til staging environment
  2. Run PyRIT automated scan (prompt injection, jailbreak attempts)
  3. Log results to Azure Log Analytics
  4. If ASR > threshold:
       - Block deployment
       - Alert security team
       - Document findings
  5. Else:
       - Proceed to production
       - Archive test results (Azure Blob Storage)

Konfidensmarkør: Microsoft dokumenterer dette som implementation example for e-commerce chatbot.

Pattern 3: Purple Environment for Agentic Red Teaming

Problem: Agentic red teaming kan potensielt utføre harmful actions (file deletion, data exfiltration).

Løsning: Non-production "purple environment" konfigurert med production-like resources.

Komponenter:

  • Transient runs — Agent state logges ikke av Foundry Agent Service, chat completions lagres ikke
  • Mock tools — Synthetic data for sensitive data leakage testing (financial, medical, PII)
  • Sandboxed actions — Prohibited actions testes uten live production data
  • Redacted inputs — Harmful/adversarial prompts redacted fra developer-synlige resultater

Konfidensmarkør: ⚠️ MEDIUM — Purple environment-pattern er best practice, men tooling for full sandboxing er under utvikling.

Pattern 4: Defense-in-Depth for Prompt Injection

Microsoft Spotlighting Techniques:

Teknikk Beskrivelse Implementation
Delimiting Separer user input fra system instructions med special tokens `<
Data marking Label untrusted data eksplisitt i prompt [UNTRUSTED]: {user_input}
Encoding Encode untrusted data før processing Base64 encode før LLM ser det

Kombinert med:

  • Prompt Shields (Azure AI Content Safety) — Blokkerer kjente User Prompt Attacks (role-play, encoding attacks, conversation mockups)
  • Safety meta-prompts — System-level instructions som prioriterer system rules over user input
  • Input validation — Pre-LLM filtering av kjente injection patterns

Konfidensmarkør: Spotlighting er production-proven (Microsoft AI Red Team training episode 7).

Beslutningsveiledning

Når bruke AI red teaming?

Scenario Red Teaming? Tool Rationale
Nye AI-features før deploy Ja AI Red Teaming Agent (cloud) Catch issues pre-production
Hver model/fine-tuning update Ja PyRIT (CI/CD) Continuous validation
Agent med tool use (Azure functions, search, storage) Ja AI Red Teaming Agent (cloud) - agentic risks Test prohibited actions, data leakage
Monthly/quarterly security audit Ja AI Red Teaming Agent (cloud) Track risk posture over tid
Post-incident forensics Ja Manual red teaming + PyRIT repro Root cause analysis
Rapid prototyping / hackathon ⚠️ Valgfritt PyRIT (local) - lightweight scan Balance speed vs. risk

Velge mellom local vs. cloud red teaming

Factor Local (PyRIT) Cloud (AI Red Teaming Agent)
Target type Model-only (Azure OpenAI, Hugging Face) Model + Agent (Foundry hosted)
Risk categories Content risks (hate, violence, sexual, self-harm, protected materials, code vulnerabilities) Content + agentic risks (prohibited actions, data leakage, task adherence)
Sandboxing Minimal (client-side) Purple environment (transient, mock tools)
CI/CD integration Full støtte (Python SDK) ⚠️ Requires API calls til Foundry
Cost Free (open-source) Azure AI Foundry compute costs
SLA N/A None (preview)
Region availability Global East US2, Sweden Central, France Central, Switzerland West

Beslutningsregel: Bruk PyRIT for model-only CI/CD workflows, AI Red Teaming Agent for comprehensive agent testing pre-deployment.

Prioritere remediering

Severity ranking (Microsoft Security Benchmark):

Severity Eksempel Remediation SLA Action
Critical Data leakage (PII, financial), Unauthorized actions (file deletion) Immediate Block deployment, retrain model, tighten plugin permissions
High Jailbreak success, Prompt injection bypasses content filter 24-48 hours Update safety meta-prompts, enable Prompt Shields, add input validation
Medium Low-severity biases, Ungrounded attributes 1 week Fine-tune model, add disclaimers, improve grounding
Low Edge-case failures, Ambiguous responses 2 weeks Document known limitations, monitor in production

Integrasjon med Microsoft-stakken

Azure AI Foundry

AI Red Teaming Agent (native integration):

  • Foundry-hosted prompt agents ( supported)
  • Foundry-hosted container agents ( supported)
  • Foundry workflow agents ( not supported)
  • Azure tool calls ( supported)
  • Function tool calls ( not supported)

Comprehensive tools list: Azure AI Foundry Tools

Azure OpenAI Service

PyRIT target integration:

from pyrit.prompt_target import AzureOpenAICompletionTarget

azure_openai_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.environ.get("AZURE_OPENAI_KEY"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}

target = AzureOpenAICompletionTarget(
    deployment_name=azure_openai_config["azure_deployment"],
    endpoint=azure_openai_config["azure_endpoint"],
    api_key=azure_openai_config["api_key"]
)

Azure AI Content Safety

Prompt Shields (Jailbreak risk detection):

  • User Prompt Attacks (UPIA): Direct jailbreak attempts (role-play, encoding, rule changes)
  • Indirect Prompt Attacks (XPIA): Malicious instructions i external data sources

Integrasjon med red teaming:

  1. Run red teaming scan (PyRIT/AI Red Teaming Agent)
  2. Identify successful jailbreaks (ASR)
  3. Enable Prompt Shields for identified attack vectors
  4. Re-test to validate mitigation effectiveness

Azure Monitor & Sentinel

Logging red teaming outcomes:

Azure Log Analytics workspace:
  - Detected vulnerabilities
  - Attack success rates (ASR per risk category)
  - System responses (refused vs. compliant)
  - Anomaly detection (patterns of concern)

Alert configuration:

  • Trigger on successful prompt injection (ASR > 10% for critical risks)
  • Escalate to security team via Azure Monitor alerts
  • Integrate with Azure Sentinel for SIEM correlation

Azure DevOps & GitHub Actions

CI/CD pipeline integration example:

# GitHub Actions example
name: AI Red Teaming on Model Update

on:
  push:
    paths:
      - 'models/**'

jobs:
  red-team-scan:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Install PyRIT
        run: pip install pyrit

      - name: Run automated red teaming
        run: python scripts/run_pyrit_scan.py
        env:
          AZURE_OPENAI_ENDPOINT: ${{ secrets.AZURE_OPENAI_ENDPOINT }}
          AZURE_OPENAI_KEY: ${{ secrets.AZURE_OPENAI_KEY }}

      - name: Upload results to Azure Blob Storage
        run: az storage blob upload --file results.json --container red-teaming

      - name: Fail if ASR exceeds threshold
        run: python scripts/check_asr_threshold.py

MITRE ATLAS Integration

PyRIT alignment med MITRE ATLAS tactics:

MITRE ATLAS Tactic PyRIT Test Scenario
AML.TA0000 (Reconnaissance) Model probing for training data artifacts
AML.TA0001 (Initial Access) Prompt injection / jailbreaking
AML.TA0010 (Exfiltration) Model inversion, membership inference (simulert)
AML.TA0009 (Impact) Biased outputs, operational disruptions

Konfidensmarkør: Microsoft Security Benchmark refererer eksplisitt til MITRE ATLAS for structured attack simulations.

Offentlig sektor (Norge)

Regulatory compliance

EU AI Act implications:

  • High-risk AI systems (definert i Annex III) krever mandatory conformity assessment før deployment
  • Red teaming er implisitt requirement under Article 9 (risk management system)
  • Documentation av red teaming results kan inngå i technical documentation (Article 11)

Norsk Personvernforordning (GDPR):

  • Red teaming skal ikke bruke ekte persondata uten consent (synthetic data anbefales)
  • Data Protection Impact Assessment (DPIA) bør inkludere red teaming findings for høyrisiko AI

Konfidensmarkør: ⚠️ MEDIUM — EU AI Act er under implementering (tredde i kraft 2024), norske myndigheter utvikler veiledning.

Statens vegvesen-spesifikke vurderinger

Use cases med mandatory red teaming:

  • AI-systemer som påvirker trafikksikkerhet (autonomous systems, traffic prediction)
  • Chatbots som håndterer sensitive brukerdata (kjøretøyregistrering, førerkortinformasjon)
  • Decision-support systems for inspeksjon eller enforcement

Data sovereignty:

  • Red teaming i cloud (AI Red Teaming Agent) krever vurdering av data residency (region support begrenset til US/EU regions)
  • PyRIT local deployment gir full data kontroll (no data leaves premises)

Cross-functional red teaming teams:

  • AI-utviklere (teknisk exploit)
  • Domeneeksperter (Statens vegvesen domain knowledge)
  • Sikkerhetsteam (threat modeling)
  • Juridisk (compliance vurdering)

Kostnad og lisensiering

PyRIT (Open-Source)

Komponent Lisens Kostnad
PyRIT framework MIT License Gratis
Compute N/A Egen hardware eller cloud compute
Target API costs Varierer Azure OpenAI pay-per-token, Hugging Face Inference API, etc.

Estimert compute cost (local PyRIT):

  • Single red teaming run (100 prompts, 4 risk categories): ~40 000 tokens → ~200 NOK (gpt-4o-mini @ $0.15/1M input tokens)
  • CI/CD integrated (daily scans): ~6 000 NOK/måned

Azure AI Red Teaming Agent (Preview)

Komponent Pricing Model Estimat
AI Red Teaming Agent Preview (ingen publisert pricing feb 2026) TBD
Azure AI Foundry compute Per-second billing for deployed models Varierer (model size, region)
Azure Log Analytics Pay-as-you-go (data ingestion + retention) ~100 NOK/GB/måned
Azure Blob Storage Standard storage (audit trails) ~0.20 NOK/GB/måned

Konfidensmarkør: ⚠️ LOW — Pricing for AI Red Teaming Agent ikke publisert (preview-fase).

Lisenskrav

Microsoft-produkt Minimum lisens
Azure AI Foundry Azure subscription (Pay-As-You-Go eller Enterprise Agreement)
Azure OpenAI Service Azure subscription + approved application
Azure AI Content Safety Inkludert i Azure AI Services (pay-per-transaction)
PyRIT Ingen (MIT License open-source)

For arkitekten (Cosmo)

Red Teaming som arkitekturprinsipp

Mindset shift: Red teaming er ikke en "nice-to-have" sikkerhetstiltak — det er en arkitekturell constraint som påvirker design decisions fra dag 1.

Spørsmål å stille i enhver AI-arkitekturrådgivning:

  1. Har kunden en red teaming-plan?

    • Hvis nei: Start med PyRIT local prototype (low-friction onboarding)
    • Hvis ja: Evaluer gap mellom plan og implementation (verktøy, cadence, cross-functional teams)
  2. Er AI-systemet high-risk i henhold til EU AI Act?

    • Ja → Mandatory red teaming, dokumenter results for conformity assessment
    • Nei → Red teaming fortsatt anbefalt (reputational risk, security posture)
  3. Model-only eller agentic architecture?

    • Model-only → PyRIT (CI/CD integration, content risks)
    • Agentic → AI Red Teaming Agent (agentic risks: prohibited actions, data leakage, task adherence)
  4. Hva er kundens risk appetite for ASR?

    • Zero-tolerance (critical data/safety) → ASR < 1% for critical risks, block deployment ved failures
    • Moderate (internal tooling) → ASR < 10%, log-and-monitor approach
    • Eksperimentell (R&D) → No threshold, focus on discovering edge cases
  5. Hvem eier red teaming-prosessen?

    • Ideal: Cross-functional team (AI devs, security, domain experts)
    • Realitet: Ofte siloed (security-only eller dev-only) → Identifiser gaps, foreslå collaboration model

Conversation starters med kunder

Scenario 1: Kunde planlegger å deploye Azure OpenAI chatbot

"Før deployment bør vi kjøre AI red teaming for å identifisere prompt injection-risiko. Jeg anbefaler å starte med PyRIT i CI/CD pipeline — det tar 2-3 timer å sette opp første scan, og gir oss Attack Success Rate for de fire core content risks. Basert på resultater kan vi enable Prompt Shields i Azure AI Content Safety som mitigation."

Scenario 2: Kunde har agent med tool use (Azure Functions, Azure Search)

"Fordi agenten har tool access, må vi teste for agentic risks — ikke bare content risks. Azure AI Red Teaming Agent i cloud kan simulere prohibited actions (f.eks. file deletion) og sensitive data leakage. Vi setter opp purple environment med mock tools, kjører scan pre-deployment, og bruker resultater til å tighten permissions på function-nivå."

Scenario 3: Kunde spør om 'hvor ofte vi må red teame'

"Microsoft Security Benchmark anbefaler continuous red teaming med monthly eller quarterly cadence. For deres use case foreslår jeg: (1) Automated PyRIT scans i CI/CD per model update, (2) Comprehensive AI Red Teaming Agent scan quarterly, (3) Manual red teaming post-incident. Dette balanserer coverage med resource constraints."

Trade-offs og gotchas

Trade-off Implikasjon Cosmos råd
Automated vs. Manual red teaming Automated gir scale, manual gir creativity og edge-case discovery Start automated (PyRIT), supplement med manual quarterly
Local vs. Cloud Local gir data control, cloud gir agentic risk coverage Hybrid: PyRIT for CI/CD, AI Red Teaming Agent for pre-deployment gates
ASR threshold setting Strict threshold (ASR < 1%) blokkerer deployment ofte, loose threshold (ASR < 20%) gir false sense of security Segment per risk: Critical risks strict (< 1%), Medium risks moderate (< 10%)
False positives i ASR Generative evaluators er non-deterministic, kan flagge benign responses Alltid manual review av flagged responses før remediation
Synthetic data i purple environment Mock tools ikke representative av real data distribution Document limitations, supplement med manual testing on real staging data (sanitized)

Når si nei til red teaming

Red flags: Kunde ønsker å red teame i production med live user data → NEI

Alternativer:

  • Purple environment med production-like config
  • Staging environment med sanitized data
  • Synthetic data generation for agentic scenarios

Konfidensmarkør: Purple environment-pattern er Microsoft best practice.

Ressurser for videre læring

Microsoft AI Red Team Training Series (10 episoder):

  • Episode 1-2: Fundamentals
  • Episode 3-6: Attack techniques (direct/indirect prompt injection, single/multi-turn)
  • Episode 7: Defense strategies (Spotlighting, Prompt Shields)
  • Episode 8-10: Automation with PyRIT

Hands-on labs: https://aka.ms/AIRTlabs

PyRIT documentation: https://azure.github.io/PyRIT/

Kilder og verifisering

Microsoft Learn dokumentasjon

Kilde URL Verifikasjonsdato
AI Red Teaming Agent (preview) https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/ai-red-teaming-agent 2026-02-03
Microsoft Security Benchmark: AI-7 Continuous Red Teaming https://learn.microsoft.com/en-us/security/benchmark/azure/mcsb-v2-artificial-intelligence-security#ai-7-perform-continuous-ai-red-teaming 2026-02-03
AI Red Teaming Training Series https://learn.microsoft.com/en-us/security/ai-red-team/training 2026-02-03
Planning red teaming for LLMs https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/red-teaming 2026-02-03
Prompt Shields (Jailbreak detection) https://learn.microsoft.com/en-us/azure/ai-services/content-safety/concepts/jailbreak-detection 2026-02-03

Open-source verktøy

Tool Repository Lisens
PyRIT https://github.com/Azure/PyRIT MIT License
MITRE ATLAS https://atlas.mitre.org/ Free (non-commercial)
Adversarial Robustness Toolbox (ART) https://github.com/Trusted-AI/adversarial-robustness-toolbox MIT License

Bransje-ressurser

Ressurs Utgiver Relevans
OWASP Top 10 for LLM Applications OWASP Foundation Threat taxonomy
NIST AI Risk Management Framework (AI RMF) NIST Risk governance framework
Three takeaways from red teaming 100 generative AI products Microsoft Security Blog (jan 2025) Real-world lessons

Sist oppdatert: 2026-02-03 Neste review: 2026-05-03 (quarterly review anbefalt for rapidly evolving field)


For Cosmo: Quick Reference Card

Når kunden sier: "Vi må teste sikkerheten i vår Azure OpenAI-løsning"

Cosmo svarer:

  1. Start med PyRIT i CI/CD pipeline (automated content risk testing)
  2. ⚠️ Hvis agent med tool use → AI Red Teaming Agent (agentic risks)
  3. 🔄 Establish continuous red teaming cadence (monthly/quarterly)
  4. 📊 Track Attack Success Rate (ASR) per risk category, set thresholds
  5. 🛡️ Mitigate via Prompt Shields, safety meta-prompts, input validation
  6. 📝 Document findings for EU AI Act compliance (if high-risk system)

Decision tree:

AI System Type?
├─ Model-only (chatbot, completion) → PyRIT (local)
└─ Agent (tool use, RAG, function calling)
   ├─ Content risks only → PyRIT (local)
   └─ Agentic risks (prohibited actions, data leakage) → AI Red Teaming Agent (cloud)

Confidence reminder: PyRIT = production-ready , AI Red Teaming Agent = preview ⚠️