ktg-plugin-marketplace/plugins/ms-ai-architect/skills/ms-ai-governance/references/responsible-ai/data-quality-responsible-ai.md
Kjell Tore Guttormsen ad8a411f38 docs(architect): weekly KB update — 66 files refreshed (2026-04)
Updated 66 stale knowledge base reference files (10 critical, 56 high)
across all 5 skills using Microsoft Learn MCP research.

Key factual updates:
- Groundedness Detection API: `correction` → `mitigating` param,
  `correctedText` → `correctionText` (breaking change)
- Copilot Studio: GPT-4.1 mini now default (was GPT-4o mini);
  Claude Sonnet 4.5 + Opus 4.5 added (experimental, 200K ctx)
- Agentic Retrieval: still public preview; 50M free tokens/month
- Azure security baselines: "Cognitive Services" → "Foundry Tools"
- Databricks: Delta Live Tables → Lakeflow Spark Declarative Pipelines
- MLflow 3 GenAI: new Feedback/Expectation data model
- Token tracking doc: "Azure OpenAI in Foundry Models through a gateway"
- Agent Registry: Risks column (M365 E7), Graph API (preview)
- Copilot DLP: new Entra AI Admin + Purview Data Security AI Admin roles
- ISO/IEC 42001: scope expanded to M365 Copilot, Foundry, Security Copilot
- Zero Trust: CAE now via Conditional Access, Strict Location Enforcement
- Purview: new Fabric Copilots/agents governance section
- AG-UI HITL: ApprovalRequiredAIFunction (C#), @tool approval_mode (Python)

All files: Last updated → 2026-04, *(Verified MCP 2026-04)* markers added.
Build registry: 1341 URLs from 387 files (+2 new URLs).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 22:41:26 +02:00

19 KiB

Data Quality for Responsible AI - Ensuring Training Data Integrity

Last updated: 2026-04 Status: GA Category: Responsible AI & Governance


Introduksjon

Datakvalitet er grunnmuren for ansvarlig AI. Machine learning-modeller lærer fra historiske beslutninger og handlinger fanget i treningsdata, og deres ytelse i produksjon er direkte avhengig av kvaliteten på disse dataene. Dårlig datakvalitet fører til bias, unfairness, feilprediksjoner og tap av tillit.

Denne referansen dekker Microsofts tilnærming til å sikre dataintegritet gjennom hele ML-livssyklusen — fra datainnsamling og preprosessering til vedlikehold og lineage tracking. For organisasjoner i offentlig sektor (spesielt Norge) er dette kritisk for å oppfylle krav til etterrettelighet, åpenhet og rettferdig behandling.

Kjerneprinsipp: Trustworthy training data har høyere sannsynlighet for å generere trustworthy outcomes. Data quality er ikke en engangsjobb, men en kontinuerlig prosess som må integreres i MLOps-praksis.


Kjernekomponenter

1. Data Sources og Diversitet

Kilder til treningsdata:

Type Beskrivelse Kvalitetsrisiko
Proprietary data Organisasjonens egen data Label bias, underrepresentasjon
Public sources Wikipedia, PubMed, offentlige datasett Variabel kvalitet, mangelfull kurering
User-generated data Brukerinteraksjoner, feedback, samarbeid Støy, malicious inputs, drifting patterns

Kvalitetsutfordringer:

  • Imbalanced datasets → modeller som favoriserer majoritetsklasser
  • Underrepresentasjon → dårlig ytelse for minoritetsgrupper
  • Skewed feature distribution → feilprediksjoner for underrepresenterte segmenter

Teknikker for balansering:

  • SMOTE (Synthetic Minority Oversampling Technique) — genererer syntetiske eksempler for minoritetsklasser
  • Undersampling — reduserer majoritetsklasser
  • Synthetic data generation (Azure AI Foundry) — genererer representative datasett

2. Exploratory Data Analysis (EDA)

Gjennomfør EDA tidlig i feature design for å identifisere:

  • Karakteristikker, relasjoner, mønstre
  • Kvalitetsproblemer (missing values, outliers, noise)
  • Over-/underrepresentasjon
  • Statistisk bias

Plattformstøtte:

  • Azure Machine Learning Responsible AI dashboard → Data Analysis-komponent
  • Visualiseringer: aggregate plots, scatter plots, cohort-basert analyse
  • Filtrer på predicted outcome, dataset features, error groups

3. Data Preprocessing

Fire nøkkelteknikker (Verified fra Microsoft Docs):

Teknikk Formål Eksempel
Quality filtering Fjern støy, ufullstendige observasjoner Eliminer produktanmeldelser som er for korte
Rescoping Broadening overly specific fields Adresse → by/stat i stedet for gate/husnummer
Deduplication Fjern redundans 1000 identiske loggoppføringer → 1 observasjon
Sensitive data handling Eliminer persondata hvis ikke kritisk Anonymiser PII, fjern unødvendige personopplysninger

Standardized transformation:

  • Konverter til ML-kompatible formater
  • Image → text (OCR for scanned documents)
  • Adjust orientations/aspect ratios for modellkompatibilitet

4. Data Validation og Guardrails

Azure Machine Learning AutoML Data Guardrails:

Guardrail Status Condition
Class balancing detection Alerted/Passed Detekterer ubalanserte klasser
Memory issues detection Done/Passed Sjekker at horizon/lag/rolling window ikke forårsaker OOM
Frequency detection Done/Passed Verifiserer time-series alignment

Data quality expectations (Azure Databricks / Lakeflow Spark Declarative Pipelines): (Verified MCP 2026-04)

Merk: Delta Live Tables er nå offisielt omdøpt til Lakeflow Spark Declarative Pipelines. Kodeeksemplene (@dp.table, @dp.expect_all_or_drop) er fortsatt gyldige.

valid_pages = {
    "valid_count": "count > 0",
    "valid_current_page": "current_page_id IS NOT NULL AND current_page_title IS NOT NULL"
}

@dp.table
@dp.expect_all_or_drop(valid_pages)
def prepared_data():
    # Dropper records som feiler expectations

5. Feature Stores

Sentralisert repository for features som sikrer:

  • Konsistens mellom training og inference
  • Feature reuse på tvers av modeller og team
  • Versjonering og immutability
  • Automated data drift detection

Implementeringsmønstre:

  • Centralized → single source of truth, sterk governance
  • Distributed → team-autonomi, krever koordinering
  • Hybrid → common features sentralt, domain-specific features distribuert

6. Data Lineage Tracking

Spor dataens vei fra kilde til modelltrening for:

  • Explainability og åpenhet
  • Debugging og root cause analysis
  • Identifisere bias introdusert i preprocessing
  • Compliance og auditability

Plattformintegrasjon:

  • Azure Machine Learning + Microsoft Purview → automatisk lineage tracking
  • Version control (Git, Azure DevOps) → track changes til training datasets

7. Decision Integrity og Security

Threats til training data (Verified fra Microsoft Security whitepaper):

Threat Beskrivelse Mitigasjon
Malicious data injection Angripere introduserer crafted inputs Data resilience, decision integrity checks
Target leakage Modellen "jukser" med data fra fremtiden Validate features, temporal consistency
Training data tampering Modifikasjon av trusted training data Access controls, immutable datasets

Overtraining pitfalls:

  • Overfitting → modellen memorerer trening, feiler på test
  • Target leakage → abnormally høy accuracy (95%+) → sannsynligvis leakage

Arkitekturmønstre

Pattern 1: Centralized Training Data Pipeline

Source Data (Production/External)
    ↓
Data Collection Store (localized)
    ↓
Exploratory Data Analysis (EDA)
    ↓
Preprocessing (quality, rescoping, deduplication, PII removal)
    ↓
Feature Store (versioned, immutable features)
    ↓
Training Data (train/validation/test split)
    ↓
Model Training
    ↓
Responsible AI Dashboard → Data Analysis

Når bruke:

  • Sterk data governance
  • Compliance-krav (GDPR, offentlig sektor)
  • Flere team deler samme datasett

Pattern 2: Segmented Data Pipeline

Use case: Separate pipelines for data med distinct security requirements.

Geo Region A Data → Pipeline A → Model A
Geo Region B Data → Pipeline B → Model B
    ↓
(Optional) Federated Training → Combined Model

Krav:

  • Access controls per segment
  • Same security rigor på alle segmenter
  • Regulatory constraints (data residency)

Pattern 3: Continuous Data Quality Monitoring

Production Data → Real-time Ingestion
    ↓
Data Quality Checks (expectations, guardrails)
    ↓
[Pass] → Feature Store → Retraining
[Fail] → Alert → Manual Review
    ↓
Monitor for Data Drift / Concept Drift
    ↓
Trigger Retraining (condition-based or scheduled)

Plattform:

  • Azure Machine Learning Model Monitoring → data drift, data quality signals
  • Databricks Expectations → inline quality checks

Pattern 4: Foundation Model Fine-Tuning Data Pipeline

Mindre volum, høyere kvalitetskrav:

High-Quality Domain-Specific Examples
    ↓
Manual Curation / Expert Review
    ↓
Small Training Set (100-1000s examples)
    ↓
Fine-Tune Pre-Trained Model
    ↓
Validate on Hold-Out Test Set

Eksempel: Fine-tune GPT-4 for medical documentation → training examples må accurately representere medical terminology og clinical reasoning.


Beslutningsveiledning

Når bruke sentralisert vs. distribuert feature store?

Kriterium Centralized Distributed
Organization size Large, standardized Multiple autonomous teams
Governance maturity High Moderate
Feature overlap High (many shared features) Low (domain-specific)
Compliance Strict centralized control Team-level flexibility

Hvor ofte gjøre retraining?

Trigger Frequency Use Case
Scheduled Daily/weekly Routine maintenance, stable data
Trigger-based On data drift detection Dynamic environments, rapid change
Hybrid Both Fail-proof operations (scheduled + triggered)

Hvor lenge beholde training data?

Scenario Retention Policy Rationale
Data unchanged Delete after training Reduce storage costs, minimize risk
Model drift detected Retain for comparison Rebuild/retrain with historical data
Compliance Follow RTBF (Right to Be Forgotten) Remove personal data on request
Disaster recovery Secondary pipeline with redundancy Regenerate model exactly as before

Hvordan håndtere imbalanced data?

IF minority class < 10% THEN
    IF synthetic data acceptable THEN
        Apply SMOTE
    ELSE
        Oversample minority OR Undersample majority
    END IF
ELSE IF 10-30% THEN
    Use class weights in model training
ELSE
    Standard training (sufficient balance)
END IF

Integrasjon med Microsoft-stakken

Azure Machine Learning

Komponent Kapabilitet Data Quality Support
Responsible AI Dashboard Data analysis, fairness, error analysis Visualize distribution, identify bias
AutoML Data Guardrails Class balancing, memory, frequency checks Automated alerts
Model Monitoring Data drift, data quality signals Continuous monitoring
ML Datasets Versioned, registered datasets Lineage tracking

Code Sample (Data Quality Signal):

from azure.ai.ml.entities import DataQualitySignal, DataQualityMetricThreshold

metric_thresholds = DataQualityMetricThreshold(
    numerical=DataQualityMetricsNumerical(null_value_rate=0.01),
    categorical=DataQualityMetricsCategorical(out_of_bounds_rate=0.02)
)

data_quality_signal = DataQualitySignal(
    production_data=production_data,
    reference_data=reference_data_training,
    features=['feature_A', 'feature_B', 'feature_C'],
    metric_thresholds=metric_thresholds,
    alert_enabled=True
)

Azure AI Foundry

  • Evaluation tools → assess data quality before training
  • Synthetic data generation → generate balanced datasets
  • Content Safety → filter harmful training data (protected material detection)

Azure Databricks

Expectations pattern:

@dp.table
@dp.expect_all_or_fail({"valid_count": "count > 0"})
def customer_facing_data():
    # Pipeline fails if expectation not met

Microsoft Purview

  • Data discovery and classification → automated tagging
  • Lineage tracking → full data provenance
  • Compliance policies → enforce GDPR/data residency

Azure DevOps

  • Version control for training datasets
  • CI/CD pipelines → automated data validation
  • Rollback → revert to previous dataset version if quality degrades

Offentlig sektor (Norge)

Særlige krav

Krav Implementasjon Microsoft-støtte
Etterrettelighet Full lineage tracking fra kilde til modell Azure ML + Purview
Åpenhet Responsible AI Scorecard (PDF for stakeholders) Azure ML RAI dashboard
Rettferdig behandling Fairness assessment, class balancing AutoML guardrails, RAI dashboard
Personvern (GDPR) PII removal, RTBF compliance Data preprocessing, anonymization
Data residency Segmented pipelines per region Norway East/West regions

Eksempel: NAV (arbeids- og velferdsetaten)

Scenario: Prediksjonsmodell for uføretrygd.

Data quality challenges:

  • Historiske data kan inneholde bias (underrepresentasjon av grupper)
  • Personopplysninger må anonymiseres
  • Modellen må være transparent for revisorer

Løsning:

  1. EDA → identifiser underrepresentasjon (alder, kjønn, region)
  2. Balancing → SMOTE for minoritetsgrupper
  3. PII removal → anonymiser fødselsnummer, adresser
  4. Fairness assessment → RAI dashboard → verifiser at accuracy er lik på tvers av demografiske grupper
  5. Scorecard → generer PDF for politiske stakeholders og revisorer
  6. Lineage → Purview → dokumenter at all data er lovlig innsamlet

Kostnad og lisensiering

Kostnadskomponenter

Komponent Kostnadsfaktor Estimat (NOK/måned)
Data storage (localized) Azure Blob/ADLS Gen2 500-5000 (avhenger av volum)
Compute (EDA, preprocessing) Databricks/Synapse Spark 2000-20000 (avhenger av scale)
Feature store Azure ML Feature Store Inkludert i Azure ML
Purview (lineage) Data governance scanning 3000-10000 (avhenger av data sources)
AutoML (guardrails) Compute for training 1000-10000 per experiment

Optimeringstips:

  • Bruk serverless Spark (pay-per-use) for EDA
  • Delete stale training data
  • Share feature stores på tvers av team

Lisenskrav

Kapabilitet Lisens Kommentar
Azure Machine Learning Azure subscription Pay-as-you-go compute
Responsible AI Dashboard Inkludert i Azure ML Ingen ekstra kostnad
Microsoft Purview Separate license Data governance add-on
Databricks Expectations Databricks license Premium/Enterprise tier
Azure AI Foundry Azure subscription Separate compute charges

For arkitekten (Cosmo)

Quick Decision Tree

START: Kunde trenger AI-modell

1. Har de eksisterende training data?
   NO → Anbefal data collection strategy (proprietary vs. public vs. synthetic)
   YES → Fortsett til 2

2. Er datasettet balansert?
   NO → Anbefal SMOTE/oversampling/synthetic data
   YES → Fortsett til 3

3. Har de gjort EDA?
   NO → Anbefal Azure ML RAI Dashboard → Data Analysis
   YES → Fortsett til 4

4. Er det PII i datasettet?
   YES → KRITISK: Anbefal preprocessing (anonymization/removal)
   NO → Fortsett til 5

5. Trenger de compliance/auditability?
   YES → Anbefal Purview + RAI Scorecard
   NO → Fortsett til 6

6. Har de data drift i produksjon?
   YES → Anbefal Model Monitoring med data quality signals
   NO → Basic training pipeline OK

7. Er dette foundation model fine-tuning?
   YES → Anbefal small, high-quality curated dataset
   NO → Standard training pipeline

Red Flags (Varsle umiddelbart)

Symptom Problem Løsning
"Vi har 95%+ accuracy på test" Sannsynlig target leakage Validate features, temporal consistency
"Brukerdata går rett i training" Malicious injection risk Data validation, guardrails
"Vi slettet dårlige eksempler" Selection bias Behold representative samples
"Vi trener på all historisk data" Overfitting, stale data Implement temporal windowing
"Vi har ikke test set" Kan ikke validere generalisering 80/10/10 split (train/val/test)

Cosmo's Talking Points

Når kunden sier: "Vi har mye data, så kvalitet er ikke så viktig."

Svar: "Det er motsatt — mer data forsterker bias hvis kvaliteten er dårlig. En modell trent på 1M dårlige eksempler er verre enn 10K gode. La oss starte med EDA for å forstå hva dere faktisk har."

Når kunden sier: "Vi kan ikke slette persondata, det er viktig for modellen."

Svar: "Det er to spørsmål: 1) Er det kritisk for prediktiv kraft, eller kan vi anonymisere? 2) Hvis kritisk, må dere ha GDPR-compliance (RTBF-policy, consent management). Jeg anbefaler Purview for å tracke dette."

Når kunden sier: "Modellen fungerer dårlig på noen grupper."

Svar: "Det er sannsynligvis underrepresentasjon i training data. La oss kjøre Fairness Assessment i RAI Dashboard og se om vi trenger oversampling eller mer data."

Verktøyvalg per scenario

Scenario Anbefalt verktøy Alternativ
EDA Azure ML Notebooks + RAI Dashboard Databricks Notebooks
Data validation AutoML Guardrails Databricks Expectations
Lineage tracking Purview Manual documentation (ikke anbefalt)
Class balancing SMOTE (Azure ML) Synthetic data (AI Foundry)
PII removal Custom preprocessing scripts Azure Cognitive Services (PII detection)
Monitoring Azure ML Model Monitoring Custom dashboards (Grafana)

Kilder og verifisering

Verified (fra Microsoft Learn MCP):

  1. Design training data for AI workloads on Azure https://learn.microsoft.com/en-us/azure/well-architected/ai/training-data-design Confidence: Verified → Covering data sources, preprocessing, feature stores, lineage, maintenance

  2. Understand your datasets (Responsible AI) https://learn.microsoft.com/en-us/azure/machine-learning/concept-data-analysis?view=azureml-api-2 Confidence: Verified → Data analysis component, cohorts, over/underrepresentation

  3. What is Responsible AI? https://learn.microsoft.com/en-us/azure/machine-learning/concept-responsible-ai?view=azureml-api-2 Confidence: Verified → Six principles, RAI dashboard components, transparency

  4. Responsible AI in Azure workloads https://learn.microsoft.com/en-us/azure/well-architected/ai/responsible-ai Confidence: Verified → User data handling, RTBF, explainability, privacy

  5. Prevent overfitting and imbalanced data with Automated ML https://learn.microsoft.com/en-us/azure/machine-learning/concept-manage-ml-pitfalls?view=azureml-api-2 Confidence: Verified → Overfitting, target leakage, class imbalance, SMOTE

  6. Securing AI and Machine Learning at Microsoft https://learn.microsoft.com/en-us/security/engineering/securing-artificial-intelligence-machine-learning Confidence: Verified → Malicious data injection, decision integrity, training data security

  7. Govern Azure platform services for AI https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/ai/platform/governance Confidence: Verified → Data discovery, classification, Purview, version control

  8. Model performance and fairness https://learn.microsoft.com/en-us/azure/machine-learning/concept-fairness-ml?view=azureml-api-2 Confidence: Verified → Parity constraints, mitigation algorithms (Fairlearn)

  9. Data featurization in AutoML https://learn.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-features?view=azureml-api-1 Confidence: Verified → Data guardrails (class balancing, memory, frequency detection)

  10. Azure Databricks Data Expectations https://learn.microsoft.com/en-us/azure/databricks/ldp/expectations Confidence: Verified → expect_all, expect_all_or_drop, expect_all_or_fail patterns

Baseline (modellkunnskap):

  • Feature store patterns (centralized/distributed/hybrid)
  • Decision trees for trigger-based vs. scheduled retraining
  • Norwegian public sector requirements (etterrettelighet, GDPR)

Code samples:

  • Azure ML Data Quality Signal (Python SDK) → Verified
  • Databricks Expectations decorator pattern → Verified

Sist oppdatert: 2026-02 Neste review: 2026-08 (eller ved større Microsoft AI-oppdateringer)