Initial addition of ms-ai-architect plugin to the open-source marketplace. Private content excluded: orchestrator/ (Linear tooling), docs/utredning/ (client investigation), generated test reports and PDF export script. skill-gen tooling moved from orchestrator/ to scripts/skill-gen/. Security scan: WARNING (risk 20/100) — no secrets, no injection found. False positive fixed: added gitleaks:allow to Python variable reference in output-validation-grounding-verification.md line 109. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
21 KiB
Speech Services - Speaker Recognition and Identification
Last updated: 2026-02 Status: GA Category: Azure AI Services (Foundry Tools)
Introduksjon
Azure Speech Services Speaker Recognition gir biometriske algorithmer som verifiserer og identifiserer talere basert på deres unike stemmesignaturer. Tjenesten besvarer spørsmålet "hvem snakker?" gjennom voice biometry som ekstraherer stemmekarakteristikker fra lydopptak.
Speaker Recognition dekker to hovedscenarier: Speaker Verification (én-til-én matching for autentisering) og Speaker Identification (én-til-mange matching for å finne hvem som snakker). Begge API-ene benytter voice signatures (også kalt voiceprints) – numeriske vektorer som representerer individuelle stemmekarakteristikker ekstrahert fra taleopptak.
En kritisk begrensning å merke seg: API-ene er ikke designet for å oppdage liveness (levende person vs. opptak/imitasjon). Replay attack-mitigering må implementeres separat gjennom tilfeldige passfraser eller andre metoder.
Kjernekomponenter
Speaker Verification
| Type | Beskrivelse | Bruksområde | Enrollment | Similarity Threshold |
|---|---|---|---|---|
| Text-dependent | Krever samme passphrase ved enrollment og verifisering | Multi-factor authentication, banking | 10 forhåndsdefinerte engelsk phrases | ≥ 0.5 (kombinert voice + tekst) |
| Text-independent | Ingen begrensninger på hva som sies | General authentication, identity confirmation | Fritt talespråk | ≥ 0.5 (kun voice similarity) |
Text-dependent passphrases (English):
- I am going to make him an offer he cannot refuse.
- Houston we have had a problem.
- My voice is my passport verify me.
- Apple juice tastes funny after toothpaste.
- You can get in without your password.
- You can activate security system now.
- My voice is stronger than passwords.
- My password is not your business.
- My name is unknown to you.
- Be yourself everyone else is already taken.
API Response:
{
"recognitionResult": "Accept" | "Reject",
"similarityScore": 0.0-1.0
}
Speaker Identification
| Egenskap | Verdi |
|---|---|
| Type | Text-independent (alltid) |
| Max kandidater | 50 speakers per request |
| Response | 1 identified ID + 5 top-ranked IDs med scores |
| Threshold | Default 0.5 (kan overstyres) |
| No match handling | Returnerer "0" string hvis ingen score ≥ 0.5 |
Use case: Call center routing, meeting attribution, forensics, access control for grupper.
Voice Signature Storage
// C# SDK eksempel - Speaker Verification
var config = SpeechConfig.FromSubscription("YourKey", "YourRegion");
var client = new VoiceProfileClient(config);
// Enrollment
var profile = await client.CreateProfileAsync(
VoiceProfileType.TextIndependentVerification, "en-US");
var result = await client.EnrollProfileAsync(profile, audioInput);
// Verification
var recognizer = new SpeakerRecognizer(config, audioInput);
var verifyResult = await recognizer.RecognizeOnceAsync(profile);
Arkitekturmønstre
Mønster 1: Multi-Factor Authentication (Text-Dependent)
Scenario: Banking app med voice + passphrase som sikkerhetslag.
Fordeler:
- To-faktor sikkerhet (voice signature + passphrase innhold)
- Lavere false positive rate enn text-independent
- Compliance-vennlig (NIST AAL2-kompatibel)
Ulemper:
- Dårlig brukeropplevelse (må huske spesifikk phrase)
- Engelsk-kun for forhåndsdefinerte phrases
- Sårbar for replay attacks uten ekstra tiltak
Implementering:
Enrollment: Speaker → velger phrase → recorder 3-5 samples → voice signature lagres
Verification: Speaker → sier samme phrase → Accept/Reject (voice + tekst matching)
Mønster 2: Transparent Identification i Teams Rooms
Scenario: Hybrid-møte hvor deltakere i rom identifiseres automatisk for transkripsjon.
Fordeler:
- Seamless UX (ingen manuell pålogging)
- Nøyaktig speaker attribution for Copilot/recap
- Støtter opptil 50 enrolled speakers per møte
Ulemper:
- Krever forhånds-enrollment av alle deltakere
- GDPR/privacy kompleksitet (biometric data)
- Quality avhenger av mikrofon (Intelligent Speaker anbefalt)
Arkitektur:
Teams Room → Audio stream → Speaker Identification API (50 kandidater) →
Attribution i transcript → Copilot bruker navn for summaries
Policy-krav:
Set-CsTeamsAIPolicy -Identity Global -SpeakerAttributionBYOD Enabled
Mønster 3: Call Center Routing (Text-Independent Verification)
Scenario: IVR-system som verifiserer high-value kunder uten PIN-kode.
Fordeler:
- Naturlig samtaleflyt (ingen spesifikk phrase)
- Raskere enn PIN/security questions
- Fungerer på alle språk
Ulemper:
- Høyere false positive rate enn text-dependent
- Krever lengre audio sample (minimum 5 sekunder anbefalt)
- Ingen liveness detection (replay-sårbar)
Decision flow:
Caller → "I need help with my account" →
Voice extracted → Verification API →
Accept (score ≥ 0.5) → Route to agent med kundedata
Reject → Fallback til PIN-kode
Beslutningsveiledning
Valg mellom Verification og Identification
| Scenario | Anbefalt API | Begrunnelse |
|---|---|---|
| Login til app (kjent bruker) | Verification | 1:1 matching, raskere, lavere cost |
| "Hvem er dette?" (ukjent fra gruppe) | Identification | 1:N matching, returnerer ranked list |
| Multi-user device | Identification | Identifiserer fra pool av registrerte |
| Banking authentication | Verification (text-dependent) | Høyere security via dual-factor |
| Meeting transcription | Identification | Attributer multiple speakers |
Threshold Tuning
Default threshold (0.5) passer for:
- General-purpose scenarios
- Balansert security vs. convenience
Høyere threshold (0.7-0.9) når:
- High-security context (banking, healthcare)
- Lavere false positive er viktigere enn false negative
- Forventet høy audio quality
Lavere threshold (0.3-0.4) når:
- Poor audio quality (noisy environments)
- Convenience prioriteres over security
- Acceptable med noen false positives
Vanlige feil
| Feil | Årsak | Løsning |
|---|---|---|
| Lav accuracy | For kort enrollment audio | Min. 20 sek total enrollment anbefalt |
| "No match" for gyldige brukere | Endret stemmekvalitet (syk, stress) | Re-enrollment eller lavere threshold |
| Replay attack success | Ingen liveness detection | Implementer random passphrase-generering |
| GDPR-brudd | Manglende consent/purpose limitation | Explicit consent + data minimization |
| Dårlig speaker attribution | Suboptimal mikrofon | Bruk certified Intelligent Speaker |
Røde flagg
❌ Bruk IKKE Speaker Recognition for:
- Liveness detection (bruk dedikert liveness API)
- Emotion analysis (bruk Speech Analytics i stedet)
- Forensic legal evidence (API ikke designet for dette)
- Automatic enrollment uten consent (GDPR/privacy-brudd)
Integrasjon med Microsoft-stakken
Azure AI Foundry Integration
Azure AI Foundry → Speech resource → Speaker Recognition
├── Custom Neural Voice: Bruker Speaker Verification for voice talent consent
├── Personal Voice: Validerer at consent audio matcher training prompt
└── Teams Intelligent Speaker: Attribution via Identification API
Microsoft 365 Copilot
| Feature | Speaker Recognition Rolle |
|---|---|
| Teams Transcript | Identifiserer in-room speakers for nøyaktig attribution |
| Meeting Recap | Copilot trenger speaker names for å summere hvem-sa-hva |
| Action Items | Tildeler tasks til riktig person basert på identification |
Policy-konfigurasjon:
# Teams Rooms People Recognition
Set-CsTeamsAIPolicy -RoomAttributeUserOverride Attribute
# BYOD Rooms Speaker Attribution
Set-CsTeamsAIPolicy -SpeakerAttributionBYOD Enabled
Power Platform
Power Automate Cloud Flow:
Trigger: OnNewVoicemail
→ Get Recording → Speaker Verification API →
If Verified → Route to Priority Queue
Else → Standard Queue
Limitations: Speaker Recognition API krever custom connector (ikke pre-built).
Azure Communication Services
// Call Automation med Speaker Recognition
var recognizeOptions = new CallMediaRecognizeSpeechOptions(
targetParticipant: new PhoneNumberIdentifier(callerNumber))
{
Prompt = new TextSource("How can I help you today?", "en-US-ElizabethNeural"),
SpeechLanguages = new List<string> { "en-US", "nb-NO" }
};
// Kombiner med Speaker Verification for caller authentication
var verifyResult = await VerifySpeaker(audioStream, enrolledProfileId);
if (verifyResult.Score >= 0.7)
{
await RouteToPrivilegedAgent(callConnectionId);
}
Offentlig sektor (Norge)
GDPR & Biometric Data (Art. 9)
Juridisk grunnlag:
- Speaker Recognition prosesserer biometric data (voice signatures)
- Art. 9(1): Utgangspunkt forbudt (sensitive personopplysninger)
- Art. 9(2)(a): Explicit consent påkrevd (ikke implicit)
Compliance checklist:
- ✅ Explicit consent fra hver voice talent/user før enrollment
- ✅ Purpose limitation: Kun bruk til formål beskrevet ved consent
- ✅ Data minimization: Slett voice signatures når ikke lenger nødvendig
- ✅ Transparency: Klar informasjon om at voice biometry brukes
- ✅ Right to deletion: Mekanisme for sletting av voice profiles
Microsoft speaker verification for Custom Neural Voice:
- Microsoft bruker Speaker Verification for å validere at consent audio matcher training data
- Prosessering under DPA Legitimate Interest Business Operations
- Voice signatures beholdes kun for security/integrity (ikke re-brukt til annet)
Schrems II & Data Residency
| Region | Data Location | Schrems II Impact |
|---|---|---|
| Norway East | Norge (Oslo) | ✅ Anbefalt: Data innenfor EØS |
| West Europe | Nederland | ✅ Akseptabelt: EU data residency |
| US regions | USA | ⚠️ Krev GDPR-vurdering: Potential US gov access |
Voice signature storage:
- Lagres i Azure Storage i samme region som Speech resource
- Encryption at rest via Azure Storage Encryption
- Kan bruke Customer-Managed Keys (CMK) for ekstra kontroll
AI Act (EU AI Act)
Risk Classification: Speaker Recognition = High-Risk AI System (biometric identification)
Obligatoriske krav:
- Fundamental rights impact assessment (FRIA)
- Technical documentation (model cards, training data provenance)
- Human oversight mechanisms (mulighet for human override av beslutninger)
- Transparency obligations (informere brukere om biometric processing)
- Accuracy, robustness, cybersecurity requirements
Norwegian implementation: Avventer nasjonal tilpasningslovgivning (2025-2026).
Forvaltningsloven & Vedtak
Hvis Speaker Recognition brukes for automatiserte vedtak:
- § 11a: Krav om individuell vurdering i "viktige saker"
- § 25: Begrunnelsesplikt (må kunne forklare hvorfor voice rejected)
- § 41: Klageadgang (må kunne contest false rejections)
Mitigering:
- Kombiner voice med andre faktorer (multi-factor)
- Alltid ha fallback til manuell prosess
- Dokumenter decision logic for transparency
Datasuverenitet
Statens Standard (DSS-001):
- Krever norsk data residency for "sensitive" offentlige data
- Voice signatures klassifiseres normalt som sensitive
- Anbefaling: Bruk Norway East region + CMK
Alternative:
- West Europe akseptabelt for "normal" skjermingsverdi
- US regions kun for ikke-personidentifiserbare data
Kostnad og lisensiering
Prismodell (per 2026-02)
| API | Enhet | NOK Pris (ca.)* | Use Case |
|---|---|---|---|
| Speaker Verification (text-dependent) | Per transaction | 11,60 | High-security auth |
| Speaker Verification (text-independent) | Per transaction | 11,60 | General auth |
| Speaker Identification | Per transaction | 11,60 | Meeting attribution, call routing |
| Enrollment | Per transaction | 11,60 | Voice profile creation |
*Estimert fra USD pricing ($1.05/1000 txn → ca. 11 NOK/1000). Verifiser aktuelle priser på Azure Pricing Calculator.
Transaksjonsdefinisjoner:
- 1 transaction = 1 API call (verification, identification, eller enrollment)
- Enrollment krever typisk 3-5 calls per user for god accuracy
- Verification/identification = 1 call per authentication attempt
Optimaliseringstips
1. Batch enrollment:
// Unngå: 5 separate API calls for enrollment
for (int i = 0; i < 5; i++)
{
await client.EnrollProfileAsync(profile, audioClips[i]); // 5 x 0.012 NOK
}
// Bedre: Kombiner audio før enrollment (hvis mulig)
var combinedAudio = CombineAudioClips(audioClips);
await client.EnrollProfileAsync(profile, combinedAudio); // 1 x 0.012 NOK
2. Caching av verification results:
- Cache positive verifications i 5-10 min for same session
- Reduser re-verification frequency i low-risk scenarios
3. Threshold tuning for cost vs. security:
- Lavere threshold → færre re-attempts → lavere cost
- Høyere threshold → mer sikkerhet men flere re-tries
4. Regional pricing:
- Norway East og West Europe har samme pricing tier
- Velg Norway East for compliance + likt cost
TCO-estimat (10,000 brukere, banking scenario)
Assumptions:
- 10,000 enrolled users
- 5 enrollment attempts per user (initial setup)
- 2 verifications per user per day (login frequency)
- 250 working days per year
Enrollment cost: 10,000 users × 5 attempts × 0.012 NOK = 600 NOK (one-time)
Annual verification: 10,000 × 2 × 250 × 0.012 NOK = 60,000 NOK
Total first year: 60,600 NOK (~$5,500 USD)
Alternative cost: PIN-kode reset har typisk support cost på 50-100 NOK per incident. Med 5% users resetting annually (500 users) = 25,000-50,000 NOK support cost saved.
Lisensiering
| Komponenet | Lisenskrav |
|---|---|
| Speaker Recognition API | Ingen spesiell lisens (consumption-based) |
| Teams Intelligent Speaker | Teams Rooms Pro (ikke Standard/Basic) |
| Copilot Speaker Attribution | Teams Premium eller Copilot-lisens |
| Speech SDK | Gratis (open source, MIT license) |
For arkitekten (Cosmo)
5-8 spørsmål å stille kunden
-
Consent framework: "Har dere etablert prosess for å innhente explicit consent til biometrisk prosessering fra hver enkelt bruker/ansatt? Hvilken dokumentasjon har dere for dette?"
-
Liveness detection: "Er dere klar over at Speaker Recognition ikke oppdager replay attacks eller deepfakes? Planlegger dere ekstra sikkerhetstiltak som tilfeldige passphrases eller challenge-response?"
-
Data residency: "Har dere datasuverenitetskrav som krever norsk/europeisk lagring av voice signatures? Er dere komfortabel med at Microsoft kan beholde kopier av voice models for security purposes?"
-
Fallback strategy: "Hva er plan B når voice recognition feiler? PIN-kode, security questions, eller human-in-the-loop? Hvor ofte forventer dere false rejections?"
-
Use case classification: "Er dette authentication (1:1 verification) eller identification (1:N)? Hvor mange kandidater må søkes gjennom samtidig (max 50 per call)?"
-
Audio quality: "Hvilken mikrofon/device-kvalitet forventer dere? Bakgrunnsstøy-nivå? Telefoni-kvalitet (8kHz) eller HD-lyd (16kHz+)?"
-
Re-enrollment frequency: "Hvor ofte må voice profiles oppdateres? Forventer dere stemmeendringer over tid (aging, sykdom) som påvirker accuracy?"
-
Compliance readiness: "Har dere gjennomført fundamental rights impact assessment (FRIA) for biometric processing? Er DPO involvert i denne avgjørelsen?"
Fallgruver
| Fallgruve | Konsekvens | Mitigering |
|---|---|---|
| Forutsetter liveness detection | Replay attacks går gjennom | Kombiner med random passphrase eller dedikert liveness API |
| Manglende consent | GDPR-brudd (Art. 9) | Implementer explicit consent flow før enrollment |
| For kort enrollment audio | Lav accuracy (< 70%) | Krev minimum 20 sek total enrollment audio |
| Hardkodet threshold 0.5 | Sub-optimal for use case | Tune threshold basert på ROC curve for dine data |
| Forventet multi-lingual | Text-dependent er kun engelsk | Bruk text-independent hvis multi-språk påkrevd |
| Ignorerer AI Act | Legal/regulatory risk | Start med FRIA, dokumenter model governance |
| Ingen human override | Poor UX når false rejection | Alltid ha fallback-mekanisme |
Anbefalinger per modenhetsnivå
Nybegynner (Proof of Concept):
- Start med text-independent verification for enklere UX
- Bruk default threshold (0.5) og Speech SDK quickstart samples
- Norway East region for compliance
- 10-20 test users for å validere accuracy i realistiske scenarios
Erfaren (Pilot Production):
- Tune custom threshold basert på pilot data
- Implementer consent management workflow
- Intelligent Speaker for Teams Rooms scenarios
- Monitoring av similarity score distribution og rejection rate
Avansert (Enterprise Scale):
- Customer-Managed Keys (CMK) for voice signature encryption
- Multi-region deployment for redundancy (Norway East + West Europe)
- Integration med Identity Governance (Entra ID verification)
- Automated re-enrollment når accuracy degraderer
- SIEM-integration for detection av replay attack patterns
Enterprise Security Add-ons:
Speaker Recognition + Azure AD Conditional Access
→ Require voice verification for high-value transactions
→ Step-up authentication basert på risk score
→ Anomaly detection hvis voice matcher men location/device er uvanlig
Decision Checklist
Før du anbefaler Speaker Recognition:
- Kunden har legal basis for biometric processing (consent/legal obligation)
- Data residency requirements er kartlagt (Norway East vs. West Europe)
- Liveness detection gap er forstått og mitigert
- Fallback mechanism er designet for false rejections
- Audio quality fra target devices er validert
- Threshold tuning plan eksisterer (ikke default 0.5 for prod)
- AI Act compliance er vurdert (FRIA for high-risk systems)
- Cost model er godkjent (transactions vs. support cost tradeoff)
Kilder og verifisering
Microsoft Learn (Verified via MCP)
-
Speaker Recognition REST API Reference
- URL: https://learn.microsoft.com/en-us/rest/api/speakerrecognition/
- Confidence: Verified (MCP fetch 2026-02-03)
- Coverage: API endpoints, text-dependent/independent specs, similarity scoring
-
Speaker Recognition Overview
- URL: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speaker-recognition-overview
- Confidence: Verified (MCP fetch 2026-02-03)
- Coverage: Feature overview, verification vs. identification, use cases
-
Data Privacy and Security for Text-to-Speech
- URL: https://learn.microsoft.com/en-us/azure/ai-foundry/responsible-ai/speech-service/text-to-speech/data-privacy-security
- Confidence: Verified (MCP fetch 2026-02-03)
- Coverage: Speaker Verification for voice talent consent, voice signature processing, DPA compliance
-
Speech SDK Code Samples
- URL: https://github.com/Azure-Samples/cognitive-services-speech-sdk
- Confidence: Verified (MCP code sample search 2026-02-03)
- Coverage: C# enrollment/verification examples, Speech SDK patterns
-
Teams Rooms Voice Recognition
- URL: https://learn.microsoft.com/en-us/microsoftteams/rooms/voice-recognition
- Confidence: Verified (MCP search 2026-02-03)
- Coverage: Intelligent Speaker, policy configuration, speaker attribution
Confidence Markers per Section
| Seksjon | Confidence | Kilde |
|---|---|---|
| Kjernekomponenter | Verified | REST API ref + Overview docs (MCP) |
| Arkitekturmønstre | Baseline + Verified | Model knowledge + Teams docs (MCP) |
| Beslutningsveiledning | Baseline | Praktisk erfaring + threshold best practices |
| Microsoft-integrasjon | Verified | Teams, Custom Voice docs (MCP) |
| GDPR/Offentlig sektor | Baseline | Legal framework knowledge (update med legal review) |
| Kostnad | Baseline | Estimated fra USD pricing (verifiser Azure calculator) |
Områder som bør verifiseres videre
⚠️ Prismodell: Estimert fra USD → NOK konvertering. Verifiser eksakt NOK-pricing i Azure Portal.
⚠️ AI Act compliance: Generell fortolkning av high-risk classification. Krev juridisk review for production.
⚠️ Norway East availability: Antatt tilgjengelig basert på Speech Services regional presence. Verifiser i Azure Portal.
Denne referansen er generert 2026-02-03 basert på Microsoft Learn dokumentasjon hentet via MCP (microsoft-learn server). For production decisions, verifiser alltid mot Azure Portal og konsulter legal team for compliance-spørsmål.