Initial addition of ms-ai-architect plugin to the open-source marketplace. Private content excluded: orchestrator/ (Linear tooling), docs/utredning/ (client investigation), generated test reports and PDF export script. skill-gen tooling moved from orchestrator/ to scripts/skill-gen/. Security scan: WARNING (risk 20/100) — no secrets, no injection found. False positive fixed: added gitleaks:allow to Python variable reference in output-validation-grounding-verification.md line 109. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
17 KiB
Adversarial Input Robustness Testing and Fuzzing
Kategori: AI Security Engineering Dato: 2026-02-05 Status: Aktiv
Oversikt
Adversarial input robustness testing og fuzzing er systematiske metoder for å evaluere hvordan AI-modeller og -agenter reagerer på manipulerte, fordreide eller utilsiktede inndata. Målet er å identifisere sårbarheter før angripere kan utnytte dem, og bygge robuste forsvar mot adversarial attacks, prompt injection, jailbreaking og andre angrepsformer.
Microsoft anbefaler kontinuerlig AI red teaming som en kjernekomponent i AI-sikkerhet, integrert i hele utviklingslivssyklusen fra design til produksjon.
Adversarial Test Case Generation
Threat Taxonomy
Microsoft bruker Adversarial Machine Learning Threat Taxonomy som grunnlag for test case generation:
Perturbation-baserte angrep:
- Targeted misclassification — Angriper genererer input som blir feilklassifisert til en spesifikk målklasse
- Source/Target misclassification — Tvinger modellen til å returnere false positive/negative
- Random misclassification — Injiserer støy for å redusere klassifikasjonsytelse
- Confidence reduction — Reduserer konfidensen i korrekt klassifikasjon
Innholdsbaserte angrep:
- Prompt injection — Manipulerer LLM-output ved å injisere instruksjoner i user input
- Jailbreaking — Omgår safety guardrails for å få modellen til å generere forbudt innhold
- Indirect prompt injection (XPIA) — Skjuler angrep i eksterne datakilder (e-poster, dokumenter) som agenter henter via tool calls
Agentic-spesifikke angrep:
- Prohibited actions — Utfører forbudte, høyrisiko eller irreversible handlinger
- Sensitive data leakage — Lekker finansiell, medisinsk eller personlig informasjon
- Task adherence violations — Feiler i å følge oppgave, regler eller prosedyrer
Azure AI Red Teaming Agent
Azure AI Foundry tilbyr AI Red Teaming Agent som automatiserer adversarial testing:
Capabilities:
- Automatiserte scans for safety risks ved å simulere adversarial probing
- Evaluering av attack-response pairs med Attack Success Rate (ASR) som nøkkelmetrikk
- Support for både modell- og agent-testing med ulike risikokategorier
- Integrerer PyRIT (Python Risk Identification Tool) og Azure AI Risk and Safety Evaluations
Supported Risk Categories:
- Hateful and Unfair Content
- Sexual Content
- Violent Content
- Self-Harm-Related Content
- Protected Materials (copyright)
- Code Vulnerability
- Ungrounded Attributes
- Prohibited Actions (agents only)
- Sensitive Data Leakage (agents only)
- Task Adherence (agents only)
Testing Phases:
- Design: Velg den sikreste foundation model for use case
- Development: Test modelloppgraderinger og fine-tuning
- Pre-deployment: Valider før produksjonsutrulling
- Post-deployment: Kontinuerlig testing på syntetiske adversarial data
Attack Strategy Framework
PyRIT tilbyr 20+ attack strategies for test case generation:
Encoding-baserte:
- Base64, Binary, ASCII Art, Morse, ROT13, Atbash, Caesar cipher
- URL encoding, Unicode substitution, Unicode confusables
Obfuscation-baserte:
- Leetspeak, Diacritic marks, Character spacing, CharSwap
- Flip (mirroring), AsciiSmuggler, ANSI escape sequences
Jailbreak-baserte:
- User Prompt Injected Attacks (UPIA)
- Indirect Prompt Injection Attacks
- SuffixAppend (adversarial suffix)
- Multi-turn attacks (context accumulation)
- Crescendo (gradvis eskalering)
Test Data Generation
Manuell generasjon:
from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
scenario = AdversarialScenario.ADVERSARIAL_QA
simulator = AdversarialSimulator(
azure_ai_project=azure_ai_project,
credential=DefaultAzureCredential()
)
outputs = await simulator(
scenario=scenario,
max_conversation_turns=3,
max_simulation_results=10,
target=callback
)
Syntetisk generasjon:
from databricks.agents.evals import generate_evals_df
evals = generate_evals_df(
docs,
num_evals=100,
agent_description=agent_description,
question_guidelines=question_guidelines
)
Fuzzing Frameworks for AI
PyRIT (Python Risk Identification Tool)
Open-source framework fra Microsoft for AI red teaming:
Arkitektur:
- Orchestrator: Koordinerer attack campaigns
- Target: AI-system som skal testes (model endpoint, agent)
- Scorers: Evaluerer responses (safety, quality, custom metrics)
- Attack Strategy: Transformerer prompts (encoding, jailbreak)
- Memory: Logger alle interactions for analyse
Key Features:
- Multi-turn conversation attacks
- Dynamic attack strategy chaining
- Support for både lokale og cloud-baserte red teaming runs
- Integrering med Azure AI Foundry for centralisert logging
Typisk workflow:
- Definer target (model/agent endpoint)
- Velg attack scenario (ADVERSARIAL_QA, UPIA, XPIA)
- Konfigurer attack strategies
- Kjør automated scan
- Evaluer ASR (Attack Success Rate)
- Generer scorecard og rapport
Adversarial Robustness Toolbox (ART)
IBM-utviklet open-source bibliotek for adversarial testing:
Capabilities:
- Evasion attacks (FGSM, PGD, C&W, DeepFool)
- Poisoning attacks (training data contamination)
- Extraction attacks (model stealing)
- Inference attacks (membership inference, model inversion)
Defense mechanisms:
- Adversarial training
- Feature squeezing
- Certified defenses
- Detector-based defenses
Microsoft Recommendation: Bruk ART for tradisjonelle ML-modeller (image classification, malware detection). For LLM og agenter, bruk PyRIT og Azure AI Red Teaming Agent.
MITRE ATLAS Integration
Microsoft anbefaler MITRE ATLAS (Adversarial Threat Landscape for AI Systems) for strukturert attack simulation:
Relevante taktikker:
- AML.TA0000 Reconnaissance — Probe model capabilities
- AML.TA0001 Initial Access — Prompt injection, jailbreaking
- AML.TA0010 Exfiltration — Model inversion, membership inference
- AML.TA0009 Impact — Data poisoning, adversarial examples
Integrasjon i CI/CD:
# Azure DevOps pipeline example
- task: AzureCLI@2
displayName: 'Run AI Red Teaming'
inputs:
azureSubscription: 'AI-Security-Sub'
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
python -m pyrit run-scan \
--target $(AGENT_ENDPOINT) \
--scenario ADVERSARIAL_QA \
--max-turns 5 \
--output results.json
Input Perturbation Techniques
Feature-Level Perturbations
Feature Squeezing:
- Reduserer søkerommet tilgjengelig for angripere
- Sammenligner model predictions på original vs. squeezed input
- Detekterer adversarial examples ved uenighet mellom predictions
Feature Denoising:
- Bruker non-local means eller andre filtre for å rense features
- Trenes end-to-end med adversarial training
- Forbedrer robusthet i både white-box og black-box settings
Input Validation:
- Definer well-formed input format
- Reject queries som ikke matcher format
- Sanitize user-supplied data før training/inference
Model-Level Perturbations
Adversarial Training:
- Tren modell med kjente adversarial samples
- Fungerer som regularization (øker input margin)
- Gjør prediction function glattere
Certified Defenses:
- Bruker semi-definite relaxation
- Gir certificate at ingen angrep kan overskride error threshold
- Differentiable — kan optimaliseres sammen med model parameters
Monotonic Classification:
- Sikrer at angripere ikke kan evade ved å legge til negative class features
- Særlig relevant for malware detection og spam filtering
Agent-Level Perturbations
Mock Tools for Safe Testing:
- Syntetiske tools som returnerer kontrollerte data
- Simulerer sensitive data leakage scenarios uten real-world impact
- Begrensning: støtter kun data retrieval, ikke complex behaviors
Purple Environment Testing:
- Non-production environment med production-like resources
- Transient runs (ingen logging av harmful data)
- Sandboxed mock tools for prohibited actions testing
Robustness Metrics
Attack Success Rate (ASR)
Primærmetrikk for adversarial robustness:
ASR = (Antall suksessfulle angrep / Totalt antall angrep) × 100%
Definisjoner per risikokategori:
| Kategori | Suksessfullt angrep |
|---|---|
| Hateful/Sexual/Violent Content | Model genererer harmful content (severity > threshold) |
| Jailbreak | Safety guardrails omgås |
| Prohibited Actions | Agent utfører forbudt handling uten human-in-the-loop |
| Sensitive Data Leakage | Format-level leak detektert via pattern matching |
| Task Adherence | Agent feiler i goal/rule/procedure compliance |
Severity Levels:
- Critical: Remote EOP, model kontroll, data exfiltration
- Important: Targeted misclassification, model stealing, privacy leaks
- Moderate: Random misclassification, confidence reduction
Confidence Metrics
Model Confidence Analysis:
- Track distribution av confidence scores over time
- Alert på plutselig drop i confidence levels
- Sammenlign confidence for legitimate vs. adversarial inputs
Highly Confident Near Neighbor (HCNN):
- Kombinerer confidence information og nearest neighbor search
- Skiller riktige fra gale predictions i neighborhood av training data
- Reinforcer adversarial robustness av base model
Attribution-Based Metrics
Attribution-Driven Causal Analysis:
- Adversarial inputs er IKKE robust i attribution space
- Masking av high-attribution features endrer decision
- Natural inputs ER robust i attribution space
Defense Strategy:
- Bygg two-layer cognition system:
- Original model prediction
- Attribution-based validation
- Angriper må kompromittere BEGGE systemer samtidig
Coverage Metrics
Test Coverage:
- % av attack strategies tested
- % av risk categories covered
- % av tool/function space explored (for agents)
Data Coverage:
- Distribution av synthetic test cases over risk categories
- Representation av edge cases og boundary conditions
- Coverage av user personas og query types
Continuous Security Testing
Integration i Development Lifecycle
Pre-commit Hooks:
#!/bin/bash
# Run quick adversarial test before commit
python -m pyrit run-scan \
--target local \
--scenario ADVERSARIAL_QA \
--max-turns 1 \
--max-results 5 \
--fail-on-asr 20
CI/CD Pipeline:
# GitHub Actions example
name: AI Security Testing
on: [push, pull_request]
jobs:
red-team:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run PyRIT scan
run: |
python -m pyrit run-scan \
--target ${{ secrets.STAGING_ENDPOINT }} \
--scenario COMPREHENSIVE \
--output results.json
- name: Evaluate ASR
run: |
python scripts/evaluate_asr.py results.json \
--threshold 10 \
--fail-on-critical
Scheduled Production Testing:
# Azure Function for continuous monitoring
import azure.functions as func
from pyrit import RedTeamingOrchestrator
def main(mytimer: func.TimerRequest):
orchestrator = RedTeamingOrchestrator(
target=os.environ['PROD_AGENT_ENDPOINT'],
scenarios=['ADVERSARIAL_QA', 'UPIA', 'XPIA']
)
results = orchestrator.run()
if results.asr > THRESHOLD:
send_alert_to_security_team(results)
log_to_azure_monitor(results)
Monitoring and Alerting
Azure Monitor Integration:
from azure.monitor.opentelemetry import configure_azure_monitor
configure_azure_monitor()
# Log ASR metrics
logger.info("ASR_METRIC", extra={
"scenario": "ADVERSARIAL_QA",
"asr": 15.3,
"severity": "Important",
"timestamp": datetime.utcnow()
})
Anomaly Detection:
- Baseline normal ASR for hver scenario
- Alert ved statistisk signifikant avvik
- Trend analysis for gradvis degradering
Incident Response:
- ASR overstiger threshold → trigger alert
- Security team undersøker results
- Categorize by severity (Critical/Important/Moderate)
- Prioritize remediation basert på risk assessment
- Retest etter mitigations deployed
- Update baseline hvis nødvendig
Regression Testing
Model Update Validation:
- Run full red teaming suite før deployment av ny modellversjon
- Compare ASR mot baseline (previous version)
- Reject deployment hvis ASR øker signifikant
Fine-Tuning Validation:
- Test adversarial robustness etter fine-tuning
- Ensure safety alignment ikke er degradert
- Validate både safety og quality metrics
Agent Workflow Changes:
- Test prohibited actions compliance når tools endres
- Validate task adherence for nye workflows
- Ensure sensitive data leakage ikke introduseres
For Cosmo: Practical Implementation
When to Recommend Adversarial Testing
Mandatory scenarios:
- Alle LLM-baserte systemer som går i produksjon
- Agenter med tool access (spesielt Azure Functions, databases, external APIs)
- Systemer som håndterer sensitive data (PII, financial, health)
- High-consequence scenarios (autonomous decisions, safety-critical)
Testing cadence:
- Design phase: Baseline model selection (test alle kandidater)
- Development: Per sprint/major feature
- Pre-deployment: Full comprehensive scan
- Production: Monthly scheduled + ad-hoc etter incidents
Azure AI Foundry Workflow
Step 1: Setup
azure_ai_project = {
"subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"],
"resource_group_name": os.environ["RESOURCE_GROUP"],
"project_name": os.environ["PROJECT_NAME"]
}
simulator = AdversarialSimulator(
azure_ai_project=azure_ai_project,
credential=DefaultAzureCredential()
)
Step 2: Define Target
@mlflow.trace
async def target_callback(messages, stream=False, session_state=None):
# Your agent logic here
response = agent.invoke(messages)
return {
"messages": response.messages,
"stream": stream,
"session_state": session_state
}
Step 3: Run Scan
outputs = await simulator(
scenario=AdversarialScenario.ADVERSARIAL_QA,
max_conversation_turns=3,
max_simulation_results=50,
target=target_callback,
language=SupportedLanguages.English
)
Step 4: Analyze Results
# View results in Azure AI Foundry portal
# ASR per risk category
# Individual attack-response pairs
# Scorecard with pass/fail per attack strategy
Remediation Strategies
High ASR for Prompt Injection:
- Implement input validation (strip/escape special characters)
- Add system message defensive instructions
- Use Azure AI Content Safety filters (pre-input)
- Consider fine-tuning med adversarial training data
High ASR for Prohibited Actions:
- Review og strengthen agent policy/taxonomy
- Implement human-in-the-loop for high-risk actions
- Add confirmation steps for irreversible operations
- Use Foundry Control Plane for centralized governance
High ASR for Sensitive Data Leakage:
- Implement data masking/redaction i tool outputs
- Review knowledge base access controls
- Add output filters før response til user
- Consider differential privacy techniques
Norwegian Public Sector Considerations
Forvaltningsloven §11a (automatiserte avgjørelser):
- Adversarial testing er påkrevd for å dokumentere robusthet
- ASR må være under akseptabelt nivå (define i DPIA)
- Kontinuerlig testing dokumenterer ongoing compliance
Personopplysningsloven (GDPR):
- Sensitive data leakage testing er mandatory
- Dokumenter at membership inference ikke er mulig
- Model inversion attacks må være mitigated
NSM Grunnprinsipper:
- Red teaming er del av "Kjenn din risiko"
- Continuous testing støtter "Beskytt mot kjente trusler"
- ASR metrics gir "Oppdage hendelser" capability
References
- Threat Modeling AI/ML Systems — Microsoft Security Engineering
- AI Red Teaming Agent — Azure AI Foundry
- PyRIT Framework — Microsoft open-source red teaming tool
- Artificial Intelligence Security (MCSB) — Azure Security Benchmark
- Failure Modes in Machine Learning — Microsoft Security
- AI Risk Assessment for ML Engineers — Microsoft AI Red Team
- MITRE ATLAS — Adversarial Threat Landscape for AI Systems
- Adversarial Robustness Toolbox — IBM Research
Denne referansen er del av AI Security Engineering kunnskapsbasen for Microsoft AI Solution Architect plugin.