Updated 66 stale knowledge base reference files (10 critical, 56 high) across all 5 skills using Microsoft Learn MCP research. Key factual updates: - Groundedness Detection API: `correction` → `mitigating` param, `correctedText` → `correctionText` (breaking change) - Copilot Studio: GPT-4.1 mini now default (was GPT-4o mini); Claude Sonnet 4.5 + Opus 4.5 added (experimental, 200K ctx) - Agentic Retrieval: still public preview; 50M free tokens/month - Azure security baselines: "Cognitive Services" → "Foundry Tools" - Databricks: Delta Live Tables → Lakeflow Spark Declarative Pipelines - MLflow 3 GenAI: new Feedback/Expectation data model - Token tracking doc: "Azure OpenAI in Foundry Models through a gateway" - Agent Registry: Risks column (M365 E7), Graph API (preview) - Copilot DLP: new Entra AI Admin + Purview Data Security AI Admin roles - ISO/IEC 42001: scope expanded to M365 Copilot, Foundry, Security Copilot - Zero Trust: CAE now via Conditional Access, Strict Location Enforcement - Purview: new Fabric Copilots/agents governance section - AG-UI HITL: ApprovalRequiredAIFunction (C#), @tool approval_mode (Python) All files: Last updated → 2026-04, *(Verified MCP 2026-04)* markers added. Build registry: 1341 URLs from 387 files (+2 new URLs). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
19 KiB
Data Quality for Responsible AI - Ensuring Training Data Integrity
Last updated: 2026-04 Status: GA Category: Responsible AI & Governance
Introduksjon
Datakvalitet er grunnmuren for ansvarlig AI. Machine learning-modeller lærer fra historiske beslutninger og handlinger fanget i treningsdata, og deres ytelse i produksjon er direkte avhengig av kvaliteten på disse dataene. Dårlig datakvalitet fører til bias, unfairness, feilprediksjoner og tap av tillit.
Denne referansen dekker Microsofts tilnærming til å sikre dataintegritet gjennom hele ML-livssyklusen — fra datainnsamling og preprosessering til vedlikehold og lineage tracking. For organisasjoner i offentlig sektor (spesielt Norge) er dette kritisk for å oppfylle krav til etterrettelighet, åpenhet og rettferdig behandling.
Kjerneprinsipp: Trustworthy training data har høyere sannsynlighet for å generere trustworthy outcomes. Data quality er ikke en engangsjobb, men en kontinuerlig prosess som må integreres i MLOps-praksis.
Kjernekomponenter
1. Data Sources og Diversitet
Kilder til treningsdata:
| Type | Beskrivelse | Kvalitetsrisiko |
|---|---|---|
| Proprietary data | Organisasjonens egen data | Label bias, underrepresentasjon |
| Public sources | Wikipedia, PubMed, offentlige datasett | Variabel kvalitet, mangelfull kurering |
| User-generated data | Brukerinteraksjoner, feedback, samarbeid | Støy, malicious inputs, drifting patterns |
Kvalitetsutfordringer:
- Imbalanced datasets → modeller som favoriserer majoritetsklasser
- Underrepresentasjon → dårlig ytelse for minoritetsgrupper
- Skewed feature distribution → feilprediksjoner for underrepresenterte segmenter
Teknikker for balansering:
- SMOTE (Synthetic Minority Oversampling Technique) — genererer syntetiske eksempler for minoritetsklasser
- Undersampling — reduserer majoritetsklasser
- Synthetic data generation (Azure AI Foundry) — genererer representative datasett
2. Exploratory Data Analysis (EDA)
Gjennomfør EDA tidlig i feature design for å identifisere:
- Karakteristikker, relasjoner, mønstre
- Kvalitetsproblemer (missing values, outliers, noise)
- Over-/underrepresentasjon
- Statistisk bias
Plattformstøtte:
- Azure Machine Learning Responsible AI dashboard → Data Analysis-komponent
- Visualiseringer: aggregate plots, scatter plots, cohort-basert analyse
- Filtrer på predicted outcome, dataset features, error groups
3. Data Preprocessing
Fire nøkkelteknikker (Verified fra Microsoft Docs):
| Teknikk | Formål | Eksempel |
|---|---|---|
| Quality filtering | Fjern støy, ufullstendige observasjoner | Eliminer produktanmeldelser som er for korte |
| Rescoping | Broadening overly specific fields | Adresse → by/stat i stedet for gate/husnummer |
| Deduplication | Fjern redundans | 1000 identiske loggoppføringer → 1 observasjon |
| Sensitive data handling | Eliminer persondata hvis ikke kritisk | Anonymiser PII, fjern unødvendige personopplysninger |
Standardized transformation:
- Konverter til ML-kompatible formater
- Image → text (OCR for scanned documents)
- Adjust orientations/aspect ratios for modellkompatibilitet
4. Data Validation og Guardrails
Azure Machine Learning AutoML Data Guardrails:
| Guardrail | Status | Condition |
|---|---|---|
| Class balancing detection | Alerted/Passed | Detekterer ubalanserte klasser |
| Memory issues detection | Done/Passed | Sjekker at horizon/lag/rolling window ikke forårsaker OOM |
| Frequency detection | Done/Passed | Verifiserer time-series alignment |
Data quality expectations (Azure Databricks / Lakeflow Spark Declarative Pipelines): (Verified MCP 2026-04)
Merk: Delta Live Tables er nå offisielt omdøpt til Lakeflow Spark Declarative Pipelines. Kodeeksemplene (
@dp.table,@dp.expect_all_or_drop) er fortsatt gyldige.
valid_pages = {
"valid_count": "count > 0",
"valid_current_page": "current_page_id IS NOT NULL AND current_page_title IS NOT NULL"
}
@dp.table
@dp.expect_all_or_drop(valid_pages)
def prepared_data():
# Dropper records som feiler expectations
5. Feature Stores
Sentralisert repository for features som sikrer:
- Konsistens mellom training og inference
- Feature reuse på tvers av modeller og team
- Versjonering og immutability
- Automated data drift detection
Implementeringsmønstre:
- Centralized → single source of truth, sterk governance
- Distributed → team-autonomi, krever koordinering
- Hybrid → common features sentralt, domain-specific features distribuert
6. Data Lineage Tracking
Spor dataens vei fra kilde til modelltrening for:
- Explainability og åpenhet
- Debugging og root cause analysis
- Identifisere bias introdusert i preprocessing
- Compliance og auditability
Plattformintegrasjon:
- Azure Machine Learning + Microsoft Purview → automatisk lineage tracking
- Version control (Git, Azure DevOps) → track changes til training datasets
7. Decision Integrity og Security
Threats til training data (Verified fra Microsoft Security whitepaper):
| Threat | Beskrivelse | Mitigasjon |
|---|---|---|
| Malicious data injection | Angripere introduserer crafted inputs | Data resilience, decision integrity checks |
| Target leakage | Modellen "jukser" med data fra fremtiden | Validate features, temporal consistency |
| Training data tampering | Modifikasjon av trusted training data | Access controls, immutable datasets |
Overtraining pitfalls:
- Overfitting → modellen memorerer trening, feiler på test
- Target leakage → abnormally høy accuracy (95%+) → sannsynligvis leakage
Arkitekturmønstre
Pattern 1: Centralized Training Data Pipeline
Source Data (Production/External)
↓
Data Collection Store (localized)
↓
Exploratory Data Analysis (EDA)
↓
Preprocessing (quality, rescoping, deduplication, PII removal)
↓
Feature Store (versioned, immutable features)
↓
Training Data (train/validation/test split)
↓
Model Training
↓
Responsible AI Dashboard → Data Analysis
Når bruke:
- Sterk data governance
- Compliance-krav (GDPR, offentlig sektor)
- Flere team deler samme datasett
Pattern 2: Segmented Data Pipeline
Use case: Separate pipelines for data med distinct security requirements.
Geo Region A Data → Pipeline A → Model A
Geo Region B Data → Pipeline B → Model B
↓
(Optional) Federated Training → Combined Model
Krav:
- Access controls per segment
- Same security rigor på alle segmenter
- Regulatory constraints (data residency)
Pattern 3: Continuous Data Quality Monitoring
Production Data → Real-time Ingestion
↓
Data Quality Checks (expectations, guardrails)
↓
[Pass] → Feature Store → Retraining
[Fail] → Alert → Manual Review
↓
Monitor for Data Drift / Concept Drift
↓
Trigger Retraining (condition-based or scheduled)
Plattform:
- Azure Machine Learning Model Monitoring → data drift, data quality signals
- Databricks Expectations → inline quality checks
Pattern 4: Foundation Model Fine-Tuning Data Pipeline
Mindre volum, høyere kvalitetskrav:
High-Quality Domain-Specific Examples
↓
Manual Curation / Expert Review
↓
Small Training Set (100-1000s examples)
↓
Fine-Tune Pre-Trained Model
↓
Validate on Hold-Out Test Set
Eksempel: Fine-tune GPT-4 for medical documentation → training examples må accurately representere medical terminology og clinical reasoning.
Beslutningsveiledning
Når bruke sentralisert vs. distribuert feature store?
| Kriterium | Centralized | Distributed |
|---|---|---|
| Organization size | Large, standardized | Multiple autonomous teams |
| Governance maturity | High | Moderate |
| Feature overlap | High (many shared features) | Low (domain-specific) |
| Compliance | Strict centralized control | Team-level flexibility |
Hvor ofte gjøre retraining?
| Trigger | Frequency | Use Case |
|---|---|---|
| Scheduled | Daily/weekly | Routine maintenance, stable data |
| Trigger-based | On data drift detection | Dynamic environments, rapid change |
| Hybrid | Both | Fail-proof operations (scheduled + triggered) |
Hvor lenge beholde training data?
| Scenario | Retention Policy | Rationale |
|---|---|---|
| Data unchanged | Delete after training | Reduce storage costs, minimize risk |
| Model drift detected | Retain for comparison | Rebuild/retrain with historical data |
| Compliance | Follow RTBF (Right to Be Forgotten) | Remove personal data on request |
| Disaster recovery | Secondary pipeline with redundancy | Regenerate model exactly as before |
Hvordan håndtere imbalanced data?
IF minority class < 10% THEN
IF synthetic data acceptable THEN
Apply SMOTE
ELSE
Oversample minority OR Undersample majority
END IF
ELSE IF 10-30% THEN
Use class weights in model training
ELSE
Standard training (sufficient balance)
END IF
Integrasjon med Microsoft-stakken
Azure Machine Learning
| Komponent | Kapabilitet | Data Quality Support |
|---|---|---|
| Responsible AI Dashboard | Data analysis, fairness, error analysis | Visualize distribution, identify bias |
| AutoML Data Guardrails | Class balancing, memory, frequency checks | Automated alerts |
| Model Monitoring | Data drift, data quality signals | Continuous monitoring |
| ML Datasets | Versioned, registered datasets | Lineage tracking |
Code Sample (Data Quality Signal):
from azure.ai.ml.entities import DataQualitySignal, DataQualityMetricThreshold
metric_thresholds = DataQualityMetricThreshold(
numerical=DataQualityMetricsNumerical(null_value_rate=0.01),
categorical=DataQualityMetricsCategorical(out_of_bounds_rate=0.02)
)
data_quality_signal = DataQualitySignal(
production_data=production_data,
reference_data=reference_data_training,
features=['feature_A', 'feature_B', 'feature_C'],
metric_thresholds=metric_thresholds,
alert_enabled=True
)
Azure AI Foundry
- Evaluation tools → assess data quality before training
- Synthetic data generation → generate balanced datasets
- Content Safety → filter harmful training data (protected material detection)
Azure Databricks
Expectations pattern:
@dp.table
@dp.expect_all_or_fail({"valid_count": "count > 0"})
def customer_facing_data():
# Pipeline fails if expectation not met
Microsoft Purview
- Data discovery and classification → automated tagging
- Lineage tracking → full data provenance
- Compliance policies → enforce GDPR/data residency
Azure DevOps
- Version control for training datasets
- CI/CD pipelines → automated data validation
- Rollback → revert to previous dataset version if quality degrades
Offentlig sektor (Norge)
Særlige krav
| Krav | Implementasjon | Microsoft-støtte |
|---|---|---|
| Etterrettelighet | Full lineage tracking fra kilde til modell | Azure ML + Purview |
| Åpenhet | Responsible AI Scorecard (PDF for stakeholders) | Azure ML RAI dashboard |
| Rettferdig behandling | Fairness assessment, class balancing | AutoML guardrails, RAI dashboard |
| Personvern (GDPR) | PII removal, RTBF compliance | Data preprocessing, anonymization |
| Data residency | Segmented pipelines per region | Norway East/West regions |
Eksempel: NAV (arbeids- og velferdsetaten)
Scenario: Prediksjonsmodell for uføretrygd.
Data quality challenges:
- Historiske data kan inneholde bias (underrepresentasjon av grupper)
- Personopplysninger må anonymiseres
- Modellen må være transparent for revisorer
Løsning:
- EDA → identifiser underrepresentasjon (alder, kjønn, region)
- Balancing → SMOTE for minoritetsgrupper
- PII removal → anonymiser fødselsnummer, adresser
- Fairness assessment → RAI dashboard → verifiser at accuracy er lik på tvers av demografiske grupper
- Scorecard → generer PDF for politiske stakeholders og revisorer
- Lineage → Purview → dokumenter at all data er lovlig innsamlet
Kostnad og lisensiering
Kostnadskomponenter
| Komponent | Kostnadsfaktor | Estimat (NOK/måned) |
|---|---|---|
| Data storage (localized) | Azure Blob/ADLS Gen2 | 500-5000 (avhenger av volum) |
| Compute (EDA, preprocessing) | Databricks/Synapse Spark | 2000-20000 (avhenger av scale) |
| Feature store | Azure ML Feature Store | Inkludert i Azure ML |
| Purview (lineage) | Data governance scanning | 3000-10000 (avhenger av data sources) |
| AutoML (guardrails) | Compute for training | 1000-10000 per experiment |
Optimeringstips:
- Bruk serverless Spark (pay-per-use) for EDA
- Delete stale training data
- Share feature stores på tvers av team
Lisenskrav
| Kapabilitet | Lisens | Kommentar |
|---|---|---|
| Azure Machine Learning | Azure subscription | Pay-as-you-go compute |
| Responsible AI Dashboard | Inkludert i Azure ML | Ingen ekstra kostnad |
| Microsoft Purview | Separate license | Data governance add-on |
| Databricks Expectations | Databricks license | Premium/Enterprise tier |
| Azure AI Foundry | Azure subscription | Separate compute charges |
For arkitekten (Cosmo)
Quick Decision Tree
START: Kunde trenger AI-modell
1. Har de eksisterende training data?
NO → Anbefal data collection strategy (proprietary vs. public vs. synthetic)
YES → Fortsett til 2
2. Er datasettet balansert?
NO → Anbefal SMOTE/oversampling/synthetic data
YES → Fortsett til 3
3. Har de gjort EDA?
NO → Anbefal Azure ML RAI Dashboard → Data Analysis
YES → Fortsett til 4
4. Er det PII i datasettet?
YES → KRITISK: Anbefal preprocessing (anonymization/removal)
NO → Fortsett til 5
5. Trenger de compliance/auditability?
YES → Anbefal Purview + RAI Scorecard
NO → Fortsett til 6
6. Har de data drift i produksjon?
YES → Anbefal Model Monitoring med data quality signals
NO → Basic training pipeline OK
7. Er dette foundation model fine-tuning?
YES → Anbefal small, high-quality curated dataset
NO → Standard training pipeline
Red Flags (Varsle umiddelbart)
| Symptom | Problem | Løsning |
|---|---|---|
| "Vi har 95%+ accuracy på test" | Sannsynlig target leakage | Validate features, temporal consistency |
| "Brukerdata går rett i training" | Malicious injection risk | Data validation, guardrails |
| "Vi slettet dårlige eksempler" | Selection bias | Behold representative samples |
| "Vi trener på all historisk data" | Overfitting, stale data | Implement temporal windowing |
| "Vi har ikke test set" | Kan ikke validere generalisering | 80/10/10 split (train/val/test) |
Cosmo's Talking Points
Når kunden sier: "Vi har mye data, så kvalitet er ikke så viktig."
Svar: "Det er motsatt — mer data forsterker bias hvis kvaliteten er dårlig. En modell trent på 1M dårlige eksempler er verre enn 10K gode. La oss starte med EDA for å forstå hva dere faktisk har."
Når kunden sier: "Vi kan ikke slette persondata, det er viktig for modellen."
Svar: "Det er to spørsmål: 1) Er det kritisk for prediktiv kraft, eller kan vi anonymisere? 2) Hvis kritisk, må dere ha GDPR-compliance (RTBF-policy, consent management). Jeg anbefaler Purview for å tracke dette."
Når kunden sier: "Modellen fungerer dårlig på noen grupper."
Svar: "Det er sannsynligvis underrepresentasjon i training data. La oss kjøre Fairness Assessment i RAI Dashboard og se om vi trenger oversampling eller mer data."
Verktøyvalg per scenario
| Scenario | Anbefalt verktøy | Alternativ |
|---|---|---|
| EDA | Azure ML Notebooks + RAI Dashboard | Databricks Notebooks |
| Data validation | AutoML Guardrails | Databricks Expectations |
| Lineage tracking | Purview | Manual documentation (ikke anbefalt) |
| Class balancing | SMOTE (Azure ML) | Synthetic data (AI Foundry) |
| PII removal | Custom preprocessing scripts | Azure Cognitive Services (PII detection) |
| Monitoring | Azure ML Model Monitoring | Custom dashboards (Grafana) |
Kilder og verifisering
Verified (fra Microsoft Learn MCP):
-
Design training data for AI workloads on Azure https://learn.microsoft.com/en-us/azure/well-architected/ai/training-data-design Confidence: Verified → Covering data sources, preprocessing, feature stores, lineage, maintenance
-
Understand your datasets (Responsible AI) https://learn.microsoft.com/en-us/azure/machine-learning/concept-data-analysis?view=azureml-api-2 Confidence: Verified → Data analysis component, cohorts, over/underrepresentation
-
What is Responsible AI? https://learn.microsoft.com/en-us/azure/machine-learning/concept-responsible-ai?view=azureml-api-2 Confidence: Verified → Six principles, RAI dashboard components, transparency
-
Responsible AI in Azure workloads https://learn.microsoft.com/en-us/azure/well-architected/ai/responsible-ai Confidence: Verified → User data handling, RTBF, explainability, privacy
-
Prevent overfitting and imbalanced data with Automated ML https://learn.microsoft.com/en-us/azure/machine-learning/concept-manage-ml-pitfalls?view=azureml-api-2 Confidence: Verified → Overfitting, target leakage, class imbalance, SMOTE
-
Securing AI and Machine Learning at Microsoft https://learn.microsoft.com/en-us/security/engineering/securing-artificial-intelligence-machine-learning Confidence: Verified → Malicious data injection, decision integrity, training data security
-
Govern Azure platform services for AI https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/ai/platform/governance Confidence: Verified → Data discovery, classification, Purview, version control
-
Model performance and fairness https://learn.microsoft.com/en-us/azure/machine-learning/concept-fairness-ml?view=azureml-api-2 Confidence: Verified → Parity constraints, mitigation algorithms (Fairlearn)
-
Data featurization in AutoML https://learn.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-features?view=azureml-api-1 Confidence: Verified → Data guardrails (class balancing, memory, frequency detection)
-
Azure Databricks Data Expectations https://learn.microsoft.com/en-us/azure/databricks/ldp/expectations Confidence: Verified → expect_all, expect_all_or_drop, expect_all_or_fail patterns
Baseline (modellkunnskap):
- Feature store patterns (centralized/distributed/hybrid)
- Decision trees for trigger-based vs. scheduled retraining
- Norwegian public sector requirements (etterrettelighet, GDPR)
Code samples:
- Azure ML Data Quality Signal (Python SDK) → Verified
- Databricks Expectations decorator pattern → Verified
Sist oppdatert: 2026-02 Neste review: 2026-08 (eller ved større Microsoft AI-oppdateringer)