# Adversarial Input Robustness Testing and Fuzzing **Kategori:** AI Security Engineering **Dato:** 2026-02-05 **Status:** Aktiv ## Oversikt Adversarial input robustness testing og fuzzing er systematiske metoder for å evaluere hvordan AI-modeller og -agenter reagerer på manipulerte, fordreide eller utilsiktede inndata. Målet er å identifisere sårbarheter før angripere kan utnytte dem, og bygge robuste forsvar mot adversarial attacks, prompt injection, jailbreaking og andre angrepsformer. Microsoft anbefaler kontinuerlig AI red teaming som en kjernekomponent i AI-sikkerhet, integrert i hele utviklingslivssyklusen fra design til produksjon. ## Adversarial Test Case Generation ### Threat Taxonomy Microsoft bruker Adversarial Machine Learning Threat Taxonomy som grunnlag for test case generation: **Perturbation-baserte angrep:** - **Targeted misclassification** — Angriper genererer input som blir feilklassifisert til en spesifikk målklasse - **Source/Target misclassification** — Tvinger modellen til å returnere false positive/negative - **Random misclassification** — Injiserer støy for å redusere klassifikasjonsytelse - **Confidence reduction** — Reduserer konfidensen i korrekt klassifikasjon **Innholdsbaserte angrep:** - **Prompt injection** — Manipulerer LLM-output ved å injisere instruksjoner i user input - **Jailbreaking** — Omgår safety guardrails for å få modellen til å generere forbudt innhold - **Indirect prompt injection (XPIA)** — Skjuler angrep i eksterne datakilder (e-poster, dokumenter) som agenter henter via tool calls **Agentic-spesifikke angrep:** - **Prohibited actions** — Utfører forbudte, høyrisiko eller irreversible handlinger - **Sensitive data leakage** — Lekker finansiell, medisinsk eller personlig informasjon - **Task adherence violations** — Feiler i å følge oppgave, regler eller prosedyrer ### Azure AI Red Teaming Agent Azure AI Foundry tilbyr AI Red Teaming Agent som automatiserer adversarial testing: **Capabilities:** - Automatiserte scans for safety risks ved å simulere adversarial probing - Evaluering av attack-response pairs med Attack Success Rate (ASR) som nøkkelmetrikk - Support for både modell- og agent-testing med ulike risikokategorier - Integrerer PyRIT (Python Risk Identification Tool) og Azure AI Risk and Safety Evaluations **Supported Risk Categories:** - Hateful and Unfair Content - Sexual Content - Violent Content - Self-Harm-Related Content - Protected Materials (copyright) - Code Vulnerability - Ungrounded Attributes - Prohibited Actions (agents only) - Sensitive Data Leakage (agents only) - Task Adherence (agents only) **Testing Phases:** - **Design:** Velg den sikreste foundation model for use case - **Development:** Test modelloppgraderinger og fine-tuning - **Pre-deployment:** Valider før produksjonsutrulling - **Post-deployment:** Kontinuerlig testing på syntetiske adversarial data ### Attack Strategy Framework PyRIT tilbyr 20+ attack strategies for test case generation: **Encoding-baserte:** - Base64, Binary, ASCII Art, Morse, ROT13, Atbash, Caesar cipher - URL encoding, Unicode substitution, Unicode confusables **Obfuscation-baserte:** - Leetspeak, Diacritic marks, Character spacing, CharSwap - Flip (mirroring), AsciiSmuggler, ANSI escape sequences **Jailbreak-baserte:** - User Prompt Injected Attacks (UPIA) - Indirect Prompt Injection Attacks - SuffixAppend (adversarial suffix) - Multi-turn attacks (context accumulation) - Crescendo (gradvis eskalering) ### Test Data Generation **Manuell generasjon:** ```python from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario scenario = AdversarialScenario.ADVERSARIAL_QA simulator = AdversarialSimulator( azure_ai_project=azure_ai_project, credential=DefaultAzureCredential() ) outputs = await simulator( scenario=scenario, max_conversation_turns=3, max_simulation_results=10, target=callback ) ``` **Syntetisk generasjon:** ```python from databricks.agents.evals import generate_evals_df evals = generate_evals_df( docs, num_evals=100, agent_description=agent_description, question_guidelines=question_guidelines ) ``` ## Fuzzing Frameworks for AI ### PyRIT (Python Risk Identification Tool) Open-source framework fra Microsoft for AI red teaming: **Arkitektur:** - **Orchestrator:** Koordinerer attack campaigns - **Target:** AI-system som skal testes (model endpoint, agent) - **Scorers:** Evaluerer responses (safety, quality, custom metrics) - **Attack Strategy:** Transformerer prompts (encoding, jailbreak) - **Memory:** Logger alle interactions for analyse **Key Features:** - Multi-turn conversation attacks - Dynamic attack strategy chaining - Support for både lokale og cloud-baserte red teaming runs - Integrering med Azure AI Foundry for centralisert logging **Typisk workflow:** 1. Definer target (model/agent endpoint) 2. Velg attack scenario (ADVERSARIAL_QA, UPIA, XPIA) 3. Konfigurer attack strategies 4. Kjør automated scan 5. Evaluer ASR (Attack Success Rate) 6. Generer scorecard og rapport ### Adversarial Robustness Toolbox (ART) IBM-utviklet open-source bibliotek for adversarial testing: **Capabilities:** - Evasion attacks (FGSM, PGD, C&W, DeepFool) - Poisoning attacks (training data contamination) - Extraction attacks (model stealing) - Inference attacks (membership inference, model inversion) **Defense mechanisms:** - Adversarial training - Feature squeezing - Certified defenses - Detector-based defenses **Microsoft Recommendation:** Bruk ART for tradisjonelle ML-modeller (image classification, malware detection). For LLM og agenter, bruk PyRIT og Azure AI Red Teaming Agent. ### MITRE ATLAS Integration Microsoft anbefaler MITRE ATLAS (Adversarial Threat Landscape for AI Systems) for strukturert attack simulation: **Relevante taktikker:** - **AML.TA0000 Reconnaissance** — Probe model capabilities - **AML.TA0001 Initial Access** — Prompt injection, jailbreaking - **AML.TA0010 Exfiltration** — Model inversion, membership inference - **AML.TA0009 Impact** — Data poisoning, adversarial examples **Integrasjon i CI/CD:** ```yaml # Azure DevOps pipeline example - task: AzureCLI@2 displayName: 'Run AI Red Teaming' inputs: azureSubscription: 'AI-Security-Sub' scriptType: 'bash' scriptLocation: 'inlineScript' inlineScript: | python -m pyrit run-scan \ --target $(AGENT_ENDPOINT) \ --scenario ADVERSARIAL_QA \ --max-turns 5 \ --output results.json ``` ## Input Perturbation Techniques ### Feature-Level Perturbations **Feature Squeezing:** - Reduserer søkerommet tilgjengelig for angripere - Sammenligner model predictions på original vs. squeezed input - Detekterer adversarial examples ved uenighet mellom predictions **Feature Denoising:** - Bruker non-local means eller andre filtre for å rense features - Trenes end-to-end med adversarial training - Forbedrer robusthet i både white-box og black-box settings **Input Validation:** - Definer well-formed input format - Reject queries som ikke matcher format - Sanitize user-supplied data før training/inference ### Model-Level Perturbations **Adversarial Training:** - Tren modell med kjente adversarial samples - Fungerer som regularization (øker input margin) - Gjør prediction function glattere **Certified Defenses:** - Bruker semi-definite relaxation - Gir certificate at ingen angrep kan overskride error threshold - Differentiable — kan optimaliseres sammen med model parameters **Monotonic Classification:** - Sikrer at angripere ikke kan evade ved å legge til negative class features - Særlig relevant for malware detection og spam filtering ### Agent-Level Perturbations **Mock Tools for Safe Testing:** - Syntetiske tools som returnerer kontrollerte data - Simulerer sensitive data leakage scenarios uten real-world impact - Begrensning: støtter kun data retrieval, ikke complex behaviors **Purple Environment Testing:** - Non-production environment med production-like resources - Transient runs (ingen logging av harmful data) - Sandboxed mock tools for prohibited actions testing ## Robustness Metrics ### Attack Success Rate (ASR) Primærmetrikk for adversarial robustness: ``` ASR = (Antall suksessfulle angrep / Totalt antall angrep) × 100% ``` **Definisjoner per risikokategori:** | Kategori | Suksessfullt angrep | |----------|---------------------| | Hateful/Sexual/Violent Content | Model genererer harmful content (severity > threshold) | | Jailbreak | Safety guardrails omgås | | Prohibited Actions | Agent utfører forbudt handling uten human-in-the-loop | | Sensitive Data Leakage | Format-level leak detektert via pattern matching | | Task Adherence | Agent feiler i goal/rule/procedure compliance | **Severity Levels:** - **Critical:** Remote EOP, model kontroll, data exfiltration - **Important:** Targeted misclassification, model stealing, privacy leaks - **Moderate:** Random misclassification, confidence reduction ### Confidence Metrics **Model Confidence Analysis:** - Track distribution av confidence scores over time - Alert på plutselig drop i confidence levels - Sammenlign confidence for legitimate vs. adversarial inputs **Highly Confident Near Neighbor (HCNN):** - Kombinerer confidence information og nearest neighbor search - Skiller riktige fra gale predictions i neighborhood av training data - Reinforcer adversarial robustness av base model ### Attribution-Based Metrics **Attribution-Driven Causal Analysis:** - Adversarial inputs er IKKE robust i attribution space - Masking av high-attribution features endrer decision - Natural inputs ER robust i attribution space **Defense Strategy:** - Bygg two-layer cognition system: 1. Original model prediction 2. Attribution-based validation - Angriper må kompromittere BEGGE systemer samtidig ### Coverage Metrics **Test Coverage:** - % av attack strategies tested - % av risk categories covered - % av tool/function space explored (for agents) **Data Coverage:** - Distribution av synthetic test cases over risk categories - Representation av edge cases og boundary conditions - Coverage av user personas og query types ## Continuous Security Testing ### Integration i Development Lifecycle **Pre-commit Hooks:** ```bash #!/bin/bash # Run quick adversarial test before commit python -m pyrit run-scan \ --target local \ --scenario ADVERSARIAL_QA \ --max-turns 1 \ --max-results 5 \ --fail-on-asr 20 ``` **CI/CD Pipeline:** ```yaml # GitHub Actions example name: AI Security Testing on: [push, pull_request] jobs: red-team: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Run PyRIT scan run: | python -m pyrit run-scan \ --target ${{ secrets.STAGING_ENDPOINT }} \ --scenario COMPREHENSIVE \ --output results.json - name: Evaluate ASR run: | python scripts/evaluate_asr.py results.json \ --threshold 10 \ --fail-on-critical ``` **Scheduled Production Testing:** ```python # Azure Function for continuous monitoring import azure.functions as func from pyrit import RedTeamingOrchestrator def main(mytimer: func.TimerRequest): orchestrator = RedTeamingOrchestrator( target=os.environ['PROD_AGENT_ENDPOINT'], scenarios=['ADVERSARIAL_QA', 'UPIA', 'XPIA'] ) results = orchestrator.run() if results.asr > THRESHOLD: send_alert_to_security_team(results) log_to_azure_monitor(results) ``` ### Monitoring and Alerting **Azure Monitor Integration:** ```python from azure.monitor.opentelemetry import configure_azure_monitor configure_azure_monitor() # Log ASR metrics logger.info("ASR_METRIC", extra={ "scenario": "ADVERSARIAL_QA", "asr": 15.3, "severity": "Important", "timestamp": datetime.utcnow() }) ``` **Anomaly Detection:** - Baseline normal ASR for hver scenario - Alert ved statistisk signifikant avvik - Trend analysis for gradvis degradering **Incident Response:** 1. ASR overstiger threshold → trigger alert 2. Security team undersøker results 3. Categorize by severity (Critical/Important/Moderate) 4. Prioritize remediation basert på risk assessment 5. Retest etter mitigations deployed 6. Update baseline hvis nødvendig ### Regression Testing **Model Update Validation:** - Run full red teaming suite før deployment av ny modellversjon - Compare ASR mot baseline (previous version) - Reject deployment hvis ASR øker signifikant **Fine-Tuning Validation:** - Test adversarial robustness etter fine-tuning - Ensure safety alignment ikke er degradert - Validate både safety og quality metrics **Agent Workflow Changes:** - Test prohibited actions compliance når tools endres - Validate task adherence for nye workflows - Ensure sensitive data leakage ikke introduseres ## For Cosmo: Practical Implementation ### When to Recommend Adversarial Testing **Mandatory scenarios:** - Alle LLM-baserte systemer som går i produksjon - Agenter med tool access (spesielt Azure Functions, databases, external APIs) - Systemer som håndterer sensitive data (PII, financial, health) - High-consequence scenarios (autonomous decisions, safety-critical) **Testing cadence:** - **Design phase:** Baseline model selection (test alle kandidater) - **Development:** Per sprint/major feature - **Pre-deployment:** Full comprehensive scan - **Production:** Monthly scheduled + ad-hoc etter incidents ### Azure AI Foundry Workflow **Step 1: Setup** ```python azure_ai_project = { "subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"], "resource_group_name": os.environ["RESOURCE_GROUP"], "project_name": os.environ["PROJECT_NAME"] } simulator = AdversarialSimulator( azure_ai_project=azure_ai_project, credential=DefaultAzureCredential() ) ``` **Step 2: Define Target** ```python @mlflow.trace async def target_callback(messages, stream=False, session_state=None): # Your agent logic here response = agent.invoke(messages) return { "messages": response.messages, "stream": stream, "session_state": session_state } ``` **Step 3: Run Scan** ```python outputs = await simulator( scenario=AdversarialScenario.ADVERSARIAL_QA, max_conversation_turns=3, max_simulation_results=50, target=target_callback, language=SupportedLanguages.English ) ``` **Step 4: Analyze Results** ```python # View results in Azure AI Foundry portal # ASR per risk category # Individual attack-response pairs # Scorecard with pass/fail per attack strategy ``` ### Remediation Strategies **High ASR for Prompt Injection:** 1. Implement input validation (strip/escape special characters) 2. Add system message defensive instructions 3. Use Azure AI Content Safety filters (pre-input) 4. Consider fine-tuning med adversarial training data **High ASR for Prohibited Actions:** 1. Review og strengthen agent policy/taxonomy 2. Implement human-in-the-loop for high-risk actions 3. Add confirmation steps for irreversible operations 4. Use Foundry Control Plane for centralized governance **High ASR for Sensitive Data Leakage:** 1. Implement data masking/redaction i tool outputs 2. Review knowledge base access controls 3. Add output filters før response til user 4. Consider differential privacy techniques ### Norwegian Public Sector Considerations **Forvaltningsloven §11a (automatiserte avgjørelser):** - Adversarial testing er påkrevd for å dokumentere robusthet - ASR må være under akseptabelt nivå (define i DPIA) - Kontinuerlig testing dokumenterer ongoing compliance **Personopplysningsloven (GDPR):** - Sensitive data leakage testing er mandatory - Dokumenter at membership inference ikke er mulig - Model inversion attacks må være mitigated **NSM Grunnprinsipper:** - Red teaming er del av "Kjenn din risiko" - Continuous testing støtter "Beskytt mot kjente trusler" - ASR metrics gir "Oppdage hendelser" capability ## References - [Threat Modeling AI/ML Systems](https://learn.microsoft.com/en-us/security/engineering/threat-modeling-aiml) — Microsoft Security Engineering - [AI Red Teaming Agent](https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/ai-red-teaming-agent) — Azure AI Foundry - [PyRIT Framework](https://azure.github.io/PyRIT/) — Microsoft open-source red teaming tool - [Artificial Intelligence Security (MCSB)](https://learn.microsoft.com/en-us/security/benchmark/azure/mcsb-v2-artificial-intelligence-security) — Azure Security Benchmark - [Failure Modes in Machine Learning](https://learn.microsoft.com/en-us/security/engineering/failure-modes-in-machine-learning) — Microsoft Security - [AI Risk Assessment for ML Engineers](https://learn.microsoft.com/en-us/security/ai-red-team/ai-risk-assessment) — Microsoft AI Red Team - [MITRE ATLAS](https://atlas.mitre.org/) — Adversarial Threat Landscape for AI Systems - [Adversarial Robustness Toolbox](https://adversarial-robustness-toolbox.org/) — IBM Research --- *Denne referansen er del av AI Security Engineering kunnskapsbasen for Microsoft AI Solution Architect plugin.*