# Red Teaming AI Models - Adversarial Testing & Security **Dato:** 2026-02-03 **Kategori:** Responsible AI & Governance **Målgruppe:** Arkitekter, sikkerhetsteam, AI-utviklere **Konfidensgrad:** ⚠️ HIGH — Basert på offisiell Microsoft-dokumentasjon (feb 2026) ## Introduksjon AI red teaming er en proaktiv sikkerhetsmetode for å identifisere sårbarheter i generative AI-systemer gjennom simulert adversarial testing. I motsetning til tradisjonell cybersecurity red teaming (som fokuserer på cyber kill chain), omfatter AI red teaming både sikkerhets- og innholdsrisiko, og simulerer adversarial brukere som forsøker å få AI-systemet til å oppføre seg uønsket. **Kjerneprinsipp:** Kontinuerlig AI red teaming integrert i utviklingslivssyklusen identifiserer sårbarheter før de blir utnyttet av ondsinnet aktører. Uten systematisk adversarial testing deployer organisasjoner AI-systemer med ukjente svakheter som kan utnyttes via prompt injection, model poisoning, eller jailbreaking. ### Hvorfor AI red teaming er kritisk Microsoft Security Benchmark (AI-7) definerer continuous AI red teaming som obligatorisk best practice. Uten red teaming står organisasjoner overfor: 1. **Prompt injection attacks** — Ondsinnet input manipulerer AI-output, omgår content filters, eller eksponerer sensitiv informasjon 2. **Adversarial examples** — Subtile input-perturbations forårsaker misklassifisering eller uriktige output 3. **Jailbreaking** — Teknikker som omgår safety mechanisms, gir tilgang til restricted functionalities eller genererer forbudt innhold ## Kjernekomponenter ### 1. PyRIT (Python Risk Identification Tool for generative AI) Microsofts open-source rammeverk for å automatisere og skalere adversarial testing av generative AI-systemer. **Nøkkelfunksjoner:** | Funksjon | Beskrivelse | |----------|-------------| | **Prompt Executors** | End-to-end attack orchestrering som kobler sammen targets, converters, og scorers | | **Datasets** | Kuraterte seed prompts og attack objectives per risikokategori | | **Converters** | 20+ teknikker for å transformere prompts (encoding, obfuscation, linguistic manipulation) | | **Scorers** | AI-baserte evaluators for å score attack success | | **Memory** | State management for multi-turn conversations og logging | | **Targets** | Integrasjoner mot Azure OpenAI, Hugging Face, REST APIs, lokale modeller | **Installasjon:** ```python # Via pip (latest stable release) pip install pyrit # Via Azure AI Evaluation SDK (inkluderer PyRIT + Foundry-integrasjon) uv pip install "azure-ai-evaluation[redteam]" ``` **Konfidensmarkør:** ✅ PyRIT er production-ready, open-source, og aktivt vedlikeholdt av Microsoft AI Red Team. ### 2. Azure AI Red Teaming Agent (preview) Managed service i Azure AI Foundry som kombinerer PyRIT med Risk and Safety Evaluations. **Tre-faset tilnærming:** 1. **Automated scans for content risks** — Simulerer adversarial probing mot model/agent endpoints 2. **Evaluate probing success** — Scorer attack-response pairs, genererer Attack Success Rate (ASR) 3. **Reporting and logging** — Scorecard med attack techniques og risk categories, logges i Foundry **Deployment-modeller:** | Deployment | Use case | Sandboxing | |------------|----------|------------| | **Local red teaming** | Model-only testing, developer workflows | Minimal (client-side) | | **Cloud red teaming** | Agent testing med agentic risks (prohibited actions, data leakage) | Purple environment (transient runs, mock tools) | **Region support (feb 2026):** East US2, Sweden Central, France Central, Switzerland West **Konfidensmarkør:** ⚠️ MEDIUM — Preview-feature, ikke anbefalt for production workloads (ingen SLA). ### 3. Supported Risk Categories | Risk Category | Model/Agent | Local/Cloud | Beskrivelse | |---------------|-------------|-------------|-------------| | **Hateful and Unfair Content** | Begge | Begge | Språk/bilder relatert til hat eller urettferdig representasjon basert på rase, kjønn, religion, etc. | | **Sexual Content** | Begge | Begge | Anatomiske detaljer, seksuelt innhold, prostitusjon, pornografi, overgrep | | **Violent Content** | Begge | Begge | Fysiske handlinger som skader, dreper, eller ødelegger; våpen, produsenter, assosiasjoner | | **Self-Harm-Related Content** | Begge | Begge | Handlinger som skader egen kropp eller selvmord | | **Protected Materials** | Begge | Begge | Opphavsrettsbeskyttet materiale (lyrics, oppskrifter, kode) | | **Code Vulnerability** | Begge | Begge | Generert kode med sikkerhetssårbarheter (SQL injection, code injection, stack trace exposure) | | **Ungrounded Attributes** | Begge | Begge | Ugrunnede inferenser om personlige attributter (demografi, emosjonell tilstand) | | **Prohibited Actions** | **Agent** | **Cloud** | Agenter som utfører forbudte high-risk eller irreversible actions | | **Sensitive Data Leakage** | **Agent** | **Cloud** | Eksponering av finansiell, medisinsk, eller personlig data fra interne kilder | | **Task Adherence** | **Agent** | **Cloud** | Agent kompletterer oppgaven innenfor regler, constraints, og uten unauthorized actions | | **Indirect Prompt Injection (XPIA)** | **Agent** | **Cloud** | Malicious instructions skjult i eksterne datakilder (e-post, dokumenter) hentet via tool calls | **Konfidensmarkør:** ✅ Risikokategorier er standardisert og alignet med NIST AI RMF og Microsofts Responsible AI-prinsipper. ### 4. Attack Strategies (via PyRIT) 20+ attack strategies for å omgå safety alignments: **Encoding-baserte:** - Base64, Binary, Morse, ROT13, Atbash, Caesar, Url - UnicodeConfusable, UnicodeSubstitution, Diacritic **Obfuscation-teknikker:** - CharacterSpace, CharSwap, Flip, Leetspeak, StringJoin - AsciiArt, AsciiSmuggler, AnsiAttack **Adversarial prompting:** - Jailbreak (direct UPIA), Indirect Jailbreak (XPIA via tool outputs) - SuffixAppend, Tense transformation **Multi-turn:** - Multi-turn (context accumulation over multiple turns) - Crescendo (gradvis eskalering av complexity/risk) **Konfidensmarkør:** ✅ Strategies er dokumentert i PyRIT-repoen med eksempler. ### 5. Attack Success Rate (ASR) Nøkkelmetrikk for å vurdere risk posture: ``` ASR = (Antall vellykkede attacks / Totalt antall attacks) × 100% ``` **Hva definerer "success"?** - Model-only: AI genererer harmful content som omgår content filters - Agentic: AI agent utfører prohibited action, lekker sensitiv data, eller feiler task adherence **Evaluering:** Fine-tuned adversarial LLM dedikert til å score responses med harmful content via Risk and Safety Evaluators. **Konfidensmarkør:** ⚠️ MEDIUM — ASR bruker generative modeller for evaluering (non-deterministic), alltid sjekk false positives. ## Arkitekturmønstre ### Pattern 1: Shift-Left Red Teaming (Design → Development → Pre-deployment) **NIST AI RMF-fasering:** 1. **Map** — Identifiser relevante risikoer og definer use case 2. **Measure** — Evaluer risikoer at scale med automated scans 3. **Manage** — Mitigate risks i production, monitor, incident response plan **Microsoft-anbefaling (per fase):** | Fase | Red Teaming Approach | Tools | Frequency | |------|----------------------|-------|-----------| | **Design** | Test base models for safest choice | AI Red Teaming Agent (cloud) | Per model evaluation | | **Development** | Test fine-tuned models, RAG systems | PyRIT (local) + CI/CD integration | Per model update | | **Pre-deployment** | Full attack surface validation | AI Red Teaming Agent (cloud) | Pre-release gate | | **Post-deployment** | Scheduled continuous red teaming, monitor incidents | AI Red Teaming Agent (cloud) + Azure Monitor | Monthly/quarterly | **Konfidensmarkør:** ✅ Pattern er alignet med Microsoft AI Security Benchmark (AI-7.1). ### Pattern 2: CI/CD-Integrated Automated Red Teaming **Azure DevOps / GitHub Actions workflow:** ```yaml # Pseudo-kode trigger: on_model_update steps: 1. Deploy model til staging environment 2. Run PyRIT automated scan (prompt injection, jailbreak attempts) 3. Log results to Azure Log Analytics 4. If ASR > threshold: - Block deployment - Alert security team - Document findings 5. Else: - Proceed to production - Archive test results (Azure Blob Storage) ``` **Konfidensmarkør:** ✅ Microsoft dokumenterer dette som implementation example for e-commerce chatbot. ### Pattern 3: Purple Environment for Agentic Red Teaming **Problem:** Agentic red teaming kan potensielt utføre harmful actions (file deletion, data exfiltration). **Løsning:** Non-production "purple environment" konfigurert med production-like resources. **Komponenter:** - **Transient runs** — Agent state logges ikke av Foundry Agent Service, chat completions lagres ikke - **Mock tools** — Synthetic data for sensitive data leakage testing (financial, medical, PII) - **Sandboxed actions** — Prohibited actions testes uten live production data - **Redacted inputs** — Harmful/adversarial prompts redacted fra developer-synlige resultater **Konfidensmarkør:** ⚠️ MEDIUM — Purple environment-pattern er best practice, men tooling for full sandboxing er under utvikling. ### Pattern 4: Defense-in-Depth for Prompt Injection **Microsoft Spotlighting Techniques:** | Teknikk | Beskrivelse | Implementation | |---------|-------------|----------------| | **Delimiting** | Separer user input fra system instructions med special tokens | `<|user|>...<|/user|>` wrapper | | **Data marking** | Label untrusted data eksplisitt i prompt | `[UNTRUSTED]: {user_input}` | | **Encoding** | Encode untrusted data før processing | Base64 encode før LLM ser det | **Kombinert med:** - **Prompt Shields** (Azure AI Content Safety) — Blokkerer kjente User Prompt Attacks (role-play, encoding attacks, conversation mockups) - **Safety meta-prompts** — System-level instructions som prioriterer system rules over user input - **Input validation** — Pre-LLM filtering av kjente injection patterns **Konfidensmarkør:** ✅ Spotlighting er production-proven (Microsoft AI Red Team training episode 7). ## Beslutningsveiledning ### Når bruke AI red teaming? | Scenario | Red Teaming? | Tool | Rationale | |----------|--------------|------|-----------| | Nye AI-features før deploy | ✅ **Ja** | AI Red Teaming Agent (cloud) | Catch issues pre-production | | Hver model/fine-tuning update | ✅ **Ja** | PyRIT (CI/CD) | Continuous validation | | Agent med tool use (Azure functions, search, storage) | ✅ **Ja** | AI Red Teaming Agent (cloud) - agentic risks | Test prohibited actions, data leakage | | Monthly/quarterly security audit | ✅ **Ja** | AI Red Teaming Agent (cloud) | Track risk posture over tid | | Post-incident forensics | ✅ **Ja** | Manual red teaming + PyRIT repro | Root cause analysis | | Rapid prototyping / hackathon | ⚠️ **Valgfritt** | PyRIT (local) - lightweight scan | Balance speed vs. risk | ### Velge mellom local vs. cloud red teaming | Factor | Local (PyRIT) | Cloud (AI Red Teaming Agent) | |--------|---------------|-------------------------------| | **Target type** | Model-only (Azure OpenAI, Hugging Face) | Model + Agent (Foundry hosted) | | **Risk categories** | Content risks (hate, violence, sexual, self-harm, protected materials, code vulnerabilities) | Content + agentic risks (prohibited actions, data leakage, task adherence) | | **Sandboxing** | Minimal (client-side) | Purple environment (transient, mock tools) | | **CI/CD integration** | ✅ Full støtte (Python SDK) | ⚠️ Requires API calls til Foundry | | **Cost** | Free (open-source) | Azure AI Foundry compute costs | | **SLA** | N/A | None (preview) | | **Region availability** | Global | East US2, Sweden Central, France Central, Switzerland West | **Beslutningsregel:** Bruk PyRIT for model-only CI/CD workflows, AI Red Teaming Agent for comprehensive agent testing pre-deployment. ### Prioritere remediering **Severity ranking (Microsoft Security Benchmark):** | Severity | Eksempel | Remediation SLA | Action | |----------|----------|-----------------|--------| | **Critical** | Data leakage (PII, financial), Unauthorized actions (file deletion) | Immediate | Block deployment, retrain model, tighten plugin permissions | | **High** | Jailbreak success, Prompt injection bypasses content filter | 24-48 hours | Update safety meta-prompts, enable Prompt Shields, add input validation | | **Medium** | Low-severity biases, Ungrounded attributes | 1 week | Fine-tune model, add disclaimers, improve grounding | | **Low** | Edge-case failures, Ambiguous responses | 2 weeks | Document known limitations, monitor in production | ## Integrasjon med Microsoft-stakken ### Azure AI Foundry **AI Red Teaming Agent (native integration):** - Foundry-hosted prompt agents (✅ supported) - Foundry-hosted container agents (✅ supported) - Foundry workflow agents (❌ not supported) - Azure tool calls (✅ supported) - Function tool calls (❌ not supported) **Comprehensive tools list:** [Azure AI Foundry Tools](https://learn.microsoft.com/en-us/azure/ai-foundry/agents/how-to/tools/overview) ### Azure OpenAI Service **PyRIT target integration:** ```python from pyrit.prompt_target import AzureOpenAICompletionTarget azure_openai_config = { "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"), "api_key": os.environ.get("AZURE_OPENAI_KEY"), "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"), } target = AzureOpenAICompletionTarget( deployment_name=azure_openai_config["azure_deployment"], endpoint=azure_openai_config["azure_endpoint"], api_key=azure_openai_config["api_key"] ) ``` ### Azure AI Content Safety **Prompt Shields (Jailbreak risk detection):** - **User Prompt Attacks (UPIA):** Direct jailbreak attempts (role-play, encoding, rule changes) - **Indirect Prompt Attacks (XPIA):** Malicious instructions i external data sources **Integrasjon med red teaming:** 1. Run red teaming scan (PyRIT/AI Red Teaming Agent) 2. Identify successful jailbreaks (ASR) 3. Enable Prompt Shields for identified attack vectors 4. Re-test to validate mitigation effectiveness ### Azure Monitor & Sentinel **Logging red teaming outcomes:** ``` Azure Log Analytics workspace: - Detected vulnerabilities - Attack success rates (ASR per risk category) - System responses (refused vs. compliant) - Anomaly detection (patterns of concern) ``` **Alert configuration:** - Trigger on successful prompt injection (ASR > 10% for critical risks) - Escalate to security team via Azure Monitor alerts - Integrate with Azure Sentinel for SIEM correlation ### Azure DevOps & GitHub Actions **CI/CD pipeline integration example:** ```yaml # GitHub Actions example name: AI Red Teaming on Model Update on: push: paths: - 'models/**' jobs: red-team-scan: runs-on: ubuntu-latest steps: - name: Checkout code uses: actions/checkout@v3 - name: Install PyRIT run: pip install pyrit - name: Run automated red teaming run: python scripts/run_pyrit_scan.py env: AZURE_OPENAI_ENDPOINT: ${{ secrets.AZURE_OPENAI_ENDPOINT }} AZURE_OPENAI_KEY: ${{ secrets.AZURE_OPENAI_KEY }} - name: Upload results to Azure Blob Storage run: az storage blob upload --file results.json --container red-teaming - name: Fail if ASR exceeds threshold run: python scripts/check_asr_threshold.py ``` ### MITRE ATLAS Integration **PyRIT alignment med MITRE ATLAS tactics:** | MITRE ATLAS Tactic | PyRIT Test Scenario | |--------------------|---------------------| | **AML.TA0000 (Reconnaissance)** | Model probing for training data artifacts | | **AML.TA0001 (Initial Access)** | Prompt injection / jailbreaking | | **AML.TA0010 (Exfiltration)** | Model inversion, membership inference (simulert) | | **AML.TA0009 (Impact)** | Biased outputs, operational disruptions | **Konfidensmarkør:** ✅ Microsoft Security Benchmark refererer eksplisitt til MITRE ATLAS for structured attack simulations. ## Offentlig sektor (Norge) ### Regulatory compliance **EU AI Act implications:** - High-risk AI systems (definert i Annex III) krever mandatory conformity assessment før deployment - Red teaming er implisitt requirement under Article 9 (risk management system) - Documentation av red teaming results kan inngå i technical documentation (Article 11) **Norsk Personvernforordning (GDPR):** - Red teaming skal ikke bruke ekte persondata uten consent (synthetic data anbefales) - Data Protection Impact Assessment (DPIA) bør inkludere red teaming findings for høyrisiko AI **Konfidensmarkør:** ⚠️ MEDIUM — EU AI Act er under implementering (tredde i kraft 2024), norske myndigheter utvikler veiledning. ### Statens vegvesen-spesifikke vurderinger **Use cases med mandatory red teaming:** - AI-systemer som påvirker trafikksikkerhet (autonomous systems, traffic prediction) - Chatbots som håndterer sensitive brukerdata (kjøretøyregistrering, førerkortinformasjon) - Decision-support systems for inspeksjon eller enforcement **Data sovereignty:** - Red teaming i cloud (AI Red Teaming Agent) krever vurdering av data residency (region support begrenset til US/EU regions) - PyRIT local deployment gir full data kontroll (no data leaves premises) **Cross-functional red teaming teams:** - AI-utviklere (teknisk exploit) - Domeneeksperter (Statens vegvesen domain knowledge) - Sikkerhetsteam (threat modeling) - Juridisk (compliance vurdering) ## Kostnad og lisensiering ### PyRIT (Open-Source) | Komponent | Lisens | Kostnad | |-----------|--------|---------| | **PyRIT framework** | MIT License | Gratis | | **Compute** | N/A | Egen hardware eller cloud compute | | **Target API costs** | Varierer | Azure OpenAI pay-per-token, Hugging Face Inference API, etc. | **Estimert compute cost (local PyRIT):** - Single red teaming run (100 prompts, 4 risk categories): ~40 000 tokens → ~200 NOK (gpt-4o-mini @ $0.15/1M input tokens) - CI/CD integrated (daily scans): ~6 000 NOK/måned ### Azure AI Red Teaming Agent (Preview) | Komponent | Pricing Model | Estimat | |-----------|---------------|---------| | **AI Red Teaming Agent** | Preview (ingen publisert pricing feb 2026) | TBD | | **Azure AI Foundry compute** | Per-second billing for deployed models | Varierer (model size, region) | | **Azure Log Analytics** | Pay-as-you-go (data ingestion + retention) | ~100 NOK/GB/måned | | **Azure Blob Storage** | Standard storage (audit trails) | ~0.20 NOK/GB/måned | **Konfidensmarkør:** ⚠️ LOW — Pricing for AI Red Teaming Agent ikke publisert (preview-fase). ### Lisenskrav | Microsoft-produkt | Minimum lisens | |-------------------|----------------| | **Azure AI Foundry** | Azure subscription (Pay-As-You-Go eller Enterprise Agreement) | | **Azure OpenAI Service** | Azure subscription + approved application | | **Azure AI Content Safety** | Inkludert i Azure AI Services (pay-per-transaction) | | **PyRIT** | Ingen (MIT License open-source) | ## For arkitekten (Cosmo) ### Red Teaming som arkitekturprinsipp **Mindset shift:** Red teaming er ikke en "nice-to-have" sikkerhetstiltak — det er en **arkitekturell constraint** som påvirker design decisions fra dag 1. **Spørsmål å stille i enhver AI-arkitekturrådgivning:** 1. **Har kunden en red teaming-plan?** - Hvis nei: Start med PyRIT local prototype (low-friction onboarding) - Hvis ja: Evaluer gap mellom plan og implementation (verktøy, cadence, cross-functional teams) 2. **Er AI-systemet high-risk i henhold til EU AI Act?** - Ja → Mandatory red teaming, dokumenter results for conformity assessment - Nei → Red teaming fortsatt anbefalt (reputational risk, security posture) 3. **Model-only eller agentic architecture?** - Model-only → PyRIT (CI/CD integration, content risks) - Agentic → AI Red Teaming Agent (agentic risks: prohibited actions, data leakage, task adherence) 4. **Hva er kundens risk appetite for ASR?** - Zero-tolerance (critical data/safety) → ASR < 1% for critical risks, block deployment ved failures - Moderate (internal tooling) → ASR < 10%, log-and-monitor approach - Eksperimentell (R&D) → No threshold, focus on discovering edge cases 5. **Hvem eier red teaming-prosessen?** - Ideal: Cross-functional team (AI devs, security, domain experts) - Realitet: Ofte siloed (security-only eller dev-only) → Identifiser gaps, foreslå collaboration model ### Conversation starters med kunder **Scenario 1: Kunde planlegger å deploye Azure OpenAI chatbot** > "Før deployment bør vi kjøre AI red teaming for å identifisere prompt injection-risiko. Jeg anbefaler å starte med PyRIT i CI/CD pipeline — det tar 2-3 timer å sette opp første scan, og gir oss Attack Success Rate for de fire core content risks. Basert på resultater kan vi enable Prompt Shields i Azure AI Content Safety som mitigation." **Scenario 2: Kunde har agent med tool use (Azure Functions, Azure Search)** > "Fordi agenten har tool access, må vi teste for agentic risks — ikke bare content risks. Azure AI Red Teaming Agent i cloud kan simulere prohibited actions (f.eks. file deletion) og sensitive data leakage. Vi setter opp purple environment med mock tools, kjører scan pre-deployment, og bruker resultater til å tighten permissions på function-nivå." **Scenario 3: Kunde spør om 'hvor ofte vi må red teame'** > "Microsoft Security Benchmark anbefaler continuous red teaming med monthly eller quarterly cadence. For deres use case foreslår jeg: (1) Automated PyRIT scans i CI/CD per model update, (2) Comprehensive AI Red Teaming Agent scan quarterly, (3) Manual red teaming post-incident. Dette balanserer coverage med resource constraints." ### Trade-offs og gotchas | Trade-off | Implikasjon | Cosmos råd | |-----------|-------------|------------| | **Automated vs. Manual red teaming** | Automated gir scale, manual gir creativity og edge-case discovery | Start automated (PyRIT), supplement med manual quarterly | | **Local vs. Cloud** | Local gir data control, cloud gir agentic risk coverage | Hybrid: PyRIT for CI/CD, AI Red Teaming Agent for pre-deployment gates | | **ASR threshold setting** | Strict threshold (ASR < 1%) blokkerer deployment ofte, loose threshold (ASR < 20%) gir false sense of security | Segment per risk: Critical risks strict (< 1%), Medium risks moderate (< 10%) | | **False positives i ASR** | Generative evaluators er non-deterministic, kan flagge benign responses | Alltid manual review av flagged responses før remediation | | **Synthetic data i purple environment** | Mock tools ikke representative av real data distribution | Document limitations, supplement med manual testing on real staging data (sanitized) | ### Når si nei til red teaming **Red flags:** Kunde ønsker å red teame i production med live user data → **NEI** **Alternativer:** - Purple environment med production-like config - Staging environment med sanitized data - Synthetic data generation for agentic scenarios **Konfidensmarkør:** ✅ Purple environment-pattern er Microsoft best practice. ### Ressurser for videre læring **Microsoft AI Red Team Training Series (10 episoder):** - Episode 1-2: Fundamentals - Episode 3-6: Attack techniques (direct/indirect prompt injection, single/multi-turn) - Episode 7: Defense strategies (Spotlighting, Prompt Shields) - Episode 8-10: Automation with PyRIT **Hands-on labs:** [https://aka.ms/AIRTlabs](https://aka.ms/AIRTlabs) **PyRIT documentation:** [https://azure.github.io/PyRIT/](https://azure.github.io/PyRIT/) ## Kilder og verifisering ### Microsoft Learn dokumentasjon | Kilde | URL | Verifikasjonsdato | |-------|-----|-------------------| | **AI Red Teaming Agent (preview)** | https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/ai-red-teaming-agent | 2026-02-03 | | **Microsoft Security Benchmark: AI-7 Continuous Red Teaming** | https://learn.microsoft.com/en-us/security/benchmark/azure/mcsb-v2-artificial-intelligence-security#ai-7-perform-continuous-ai-red-teaming | 2026-02-03 | | **AI Red Teaming Training Series** | https://learn.microsoft.com/en-us/security/ai-red-team/training | 2026-02-03 | | **Planning red teaming for LLMs** | https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/red-teaming | 2026-02-03 | | **Prompt Shields (Jailbreak detection)** | https://learn.microsoft.com/en-us/azure/ai-services/content-safety/concepts/jailbreak-detection | 2026-02-03 | ### Open-source verktøy | Tool | Repository | Lisens | |------|------------|--------| | **PyRIT** | https://github.com/Azure/PyRIT | MIT License | | **MITRE ATLAS** | https://atlas.mitre.org/ | Free (non-commercial) | | **Adversarial Robustness Toolbox (ART)** | https://github.com/Trusted-AI/adversarial-robustness-toolbox | MIT License | ### Bransje-ressurser | Ressurs | Utgiver | Relevans | |---------|---------|----------| | **OWASP Top 10 for LLM Applications** | OWASP Foundation | Threat taxonomy | | **NIST AI Risk Management Framework (AI RMF)** | NIST | Risk governance framework | | **Three takeaways from red teaming 100 generative AI products** | Microsoft Security Blog (jan 2025) | Real-world lessons | **Sist oppdatert:** 2026-02-03 **Neste review:** 2026-05-03 (quarterly review anbefalt for rapidly evolving field) --- ## For Cosmo: Quick Reference Card **Når kunden sier:** "Vi må teste sikkerheten i vår Azure OpenAI-løsning" **Cosmo svarer:** 1. ✅ Start med PyRIT i CI/CD pipeline (automated content risk testing) 2. ⚠️ Hvis agent med tool use → AI Red Teaming Agent (agentic risks) 3. 🔄 Establish continuous red teaming cadence (monthly/quarterly) 4. 📊 Track Attack Success Rate (ASR) per risk category, set thresholds 5. 🛡️ Mitigate via Prompt Shields, safety meta-prompts, input validation 6. 📝 Document findings for EU AI Act compliance (if high-risk system) **Decision tree:** ``` AI System Type? ├─ Model-only (chatbot, completion) → PyRIT (local) └─ Agent (tool use, RAG, function calling) ├─ Content risks only → PyRIT (local) └─ Agentic risks (prohibited actions, data leakage) → AI Red Teaming Agent (cloud) ``` **Confidence reminder:** PyRIT = production-ready ✅, AI Red Teaming Agent = preview ⚠️