Kjell Tore Guttormsen 34c6db36fa docs(architect): weekly KB update — 52 files refreshed (2026-04)

Key content changes:
- MLOps: MLflow 3 scorers expanded (RetrievalRelevance, Fluency, multi-turn judges)
- MLflow 3 A/B eval: mirror_traffic GA confirmed, new scorer catalog
- CI/CD: OIDC auth replaces deprecated --sdk-auth (Azure ML GitHub Actions)
- Agent framework A2A: updated SDK patterns (A2ACardResolver, BearerAuth)
- AG-UI backend tool rendering: accurate TOOL_CALL_* event shapes
- Computer Use agents: US region requirement, credentials patterns
- Purview governance: bulk term edit, expire/delete workflows
- CAF AI Secure: 3-phase structure confirmed current
- Copilot Studio: Claude Sonnet 4.5/4.6 GA, new orchestration controls
- M365 manifest: v1.26 GA (April 2026), copilotAgents node
- Power Platform: agent flow capacity enforcement corrected
- Azure Monitor: Simple Log Alerts GA, AMBA for policy-based alerting
- Security Copilot: SCU capacity model (400 SCU/1000 users)
- EU Data Boundary: all EU + EFTA countries confirmed
- gateway-multi-backend: added 4th topology, subscription-level quota note

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-10 11:31:11 +02:00

34 KiB

Raw Blame History

Feedback Loops and Continuous Improvement

Kategori: MLOps & GenAIOps Dato: 2026-02-04 Last updated: 2026-04 Confidence: HIGH (basert på offisiell Microsoft-dokumentasjon)

Verified: MCP 2026-04

Introduksjon

Feedback loops og kontinuerlig forbedring er kritiske komponenter i moderne AI-operasjoner. I motsetning til tradisjonell programvare, hvor funksjonalitet er deterministisk, kan AI-modeller vise kvalitetsdrift eller uventet oppførsel når de møter reelle data. Et velfungerende feedback-system sikrer at modeller forblir nøyaktige, relevante og trygge gjennom hele sin livssyklus.

Nøkkelkonsept: Feedback loops kobler produksjonsdata, brukerinnsikt og ytelsesmetrikker tilbake til utviklingsprosessen, og skaper en kontinuerlig syklus av måling, læring og forbedring.

Hvorfor dette er viktig

Modellforfall (model decay): AI-modeller degraderer over tid på grunn av endringer i data, brukermønstre eller forretningskontekst
Kvalitetssikring: Automatisert og manuell evaluering avdekker gap mellom forventet og faktisk ytelse
Brukerverdi: Direkte tilbakemelding fra sluttbrukere gir innsikt som ikke fanges av tekniske metrikker
Compliance: Regulatoriske krav (AI Act, GDPR) krever sporbarhet og kontinuerlig overvåking

Kjernekomponenter

1. Production Monitoring & Telemetry

Azure-tjenester:

Azure Monitor + Application Insights: Sanker telemetri fra endpoints, sporer latens, feilrater, token-forbruk
Azure Machine Learning Model Monitoring: Automatisk deteksjon av data drift, prediction drift og model performance degradation
MLflow Tracing: Detaljert sporing av hver inferens-interaksjon, inkludert inputs, outputs, mellomsteg

Nøkkelmetrikker:

Dimensjon	Metrikker	Confidence
Operational	Request volume, latency (p50/p95), error rates, token usage	HIGH
Quality	Groundedness, relevance, coherence, safety pass rate	HIGH (GenAI)
User Feedback	Thumbs up/down, ratings, explicit reports	MEDIUM

Kodeeksempel: Logging av user feedback (MLflow)

import mlflow
from mlflow.entities import AssessmentSource
import time

# Wait for trace to be ready
time.sleep(1)

# Extract span and trace IDs from response
response_dict = response.as_dict()
first_prediction = response_dict["predictions"][0]
first_result = first_prediction["results"][0]

span_id = first_result["span_id"]
trace_id = first_prediction["trace_id"]

# Log user feedback
mlflow.log_feedback(
    trace_id=trace_id,
    span_id=span_id,
    name="user_feedback",
    value=True,  # True for positive, False for negative
    source=AssessmentSource(source_type="HUMAN"),
    rationale="Answer was accurate and well-reasoned",
)

2. Data Collection & Evaluation Datasets

Prosess:

Production traces → Evaluation set: Bruk inference table logs til å identifisere problematiske interaksjoner
Synthetic data generation: Generer startdatasett før produksjonsdata er tilgjengelig
Expert curation: SMEs validerer og annoterer edge cases, gold standard-svar

Azure-tjenester:

MLflow Datasets: Versjonert lagring av eval-datasett i Unity Catalog
Azure AI Foundry Agent Evaluation: Evaluering med LLM judges (correctness, relevance, groundedness, safety)
Databricks Review App: Samle feedback fra domeneeksperter på produksjonstracer

Best practices:

Inkluder både forventede og uventede bruksmønstre i eval-settet
Test for edge cases (lange/korte inputs, misspellings, prompt injection)
Kombiner expected_facts (fleksibelt) med guidelines (tone, style, policy)

Kodeeksempel: Evaluering med MLflow

import mlflow
from mlflow.genai.scorers import Correctness, RelevanceToQuery

# Define evaluation dataset
eval_data = [
    {
        "inputs": {"question": "What is MLflow?"},
        "expectations": {
            "expected_facts": ["open-source platform", "ML lifecycle management"]
        }
    },
    {
        "inputs": {"question": "How do I track experiments?"},
        "expectations": {
            "expected_facts": ["mlflow.start_run()", "log metrics", "log parameters"]
        }
    }
]

# Run evaluation
results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=my_agent,
    scorers=[Correctness(), RelevanceToQuery()],
)

print(f"Correctness score: {results.metrics['correctness/mean']:.2f}")

3. Automated Retraining & Model Promotion

Strategier:

Strategi	Når bruke	Trade-offs
Online training	Daglig/kontinuerlig oppdatering med nye data	Høy kostnad, krever robust automation
Offline training	Sjeldnere oppdatering (ukentlig/månedlig)	Lavere kostnad, risiko for model decay
Threshold-based	Retrain når ytelse faller under terskel	Balanserer presisjon vs energiforbruk

Azure-tjenester:

Azure Machine Learning Pipelines: CI/CD for modelltrening og deployment
Azure DevOps / GitHub Actions: Automatiserte triggers ved model registration
Azure Arc: Hybrid/multicloud deployment-orkestrering

Triggers for retraining:

Data drift: Statistical properties of input data har endret seg (detektert via monitoring)
Prediction drift: Output-distribusjonen avviker fra baseline
Performance degradation: Metrics (accuracy, F1-score) faller under threshold
Manual trigger: Human-in-the-loop approval for kritiske modeller

Kodeeksempel: Model monitoring setup

from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    MonitorSchedule,
    RecurrenceTrigger,
    MonitorDefinition,
    ServerlessSparkCompute,
    MonitoringTarget,
    AlertNotification,
    DataDriftSignal,
    DataDriftMetricThreshold,
    NumericalDriftMetrics,
)

# Setup monitoring for data drift
ml_client = MLClient(...)

spark_compute = ServerlessSparkCompute(
    instance_type="standard_e4s_v3",
    runtime_version="3.3"
)

monitoring_target = MonitoringTarget(
    ml_task="classification",
    endpoint_deployment_id="azureml:fraud-detection-endpoint:main"
)

# Define drift thresholds
metric_thresholds = DataDriftMetricThreshold(
    numerical=NumericalDriftMetrics(
        jensen_shannon_distance=0.01  # Retrain when drift exceeds 1%
    )
)

data_drift_signal = DataDriftSignal(
    reference_data=training_data,
    metric_thresholds=metric_thresholds,
    alert_enabled=True
)

# Create monitoring schedule
monitor_definition = MonitorDefinition(
    compute=spark_compute,
    monitoring_target=monitoring_target,
    monitoring_signals={"data_drift": data_drift_signal},
    alert_notification=AlertNotification(emails=["ml-team@example.com"])
)

recurrence_trigger = RecurrenceTrigger(
    frequency="day",
    interval=1,
    schedule=RecurrencePattern(hours=3, minutes=0)
)

model_monitor = MonitorSchedule(
    name="fraud_detection_monitor",
    trigger=recurrence_trigger,
    create_monitor=monitor_definition
)

ml_client.schedules.begin_create_or_update(model_monitor)

4. Human-in-the-Loop (HITL) Workflows

Komponenter:

Review App (Databricks): Thumbs up/down, textual feedback på agent-svar
Expert labeling: SMEs annoterer traces med expected outputs, policy violations
Approval gates: Human godkjenning før deploy til prod (kritiske modeller)

Azure-tjenester:

Azure Logic Apps / Power Automate: Workflow automation for HITL review
AI Builder Feedback Loop: Automatisk routing av low-confidence predictions til human review

Best practices:

Balancer automation vs HITL: Kun review low-confidence outputs (< 70% score)
Unngå reviewer fatigue: Sample strategisk, ikke alle interaksjoner
Incorporate feedback raskt: Weekly review cycles, ikke månedlig

5. Continuous Improvement Cycle (MLflow for GenAI)

10-stegs syklus:

🚀 Production App: Deployed agent generer traces med inputs/outputs
👍 👎 User Feedback: Thumbs up/down på hver interaksjon
🔍 Monitor & Score: LLM judges (correctness, safety, relevance) scorer automatisk
⚠️ Identify Issues: Trace UI viser mønstre i low-scoring traces
👥 Domain Expert Review: Sample sendes til SMEs via Review App
📋 Build Eval Dataset: Kurater problematiske + high-quality traces til eval-sett
🎯 Tune Scorers: Bruk expert feedback til å align LLM judges med human judgment
🧪 Evaluate New Versions: Test forbedringer mot eval-settet med samme scorers
📈 Compare Results: MLflow evaluation runs sammenligner versioner
✅ Deploy or Iterate: Deploy hvis kvalitet forbedres uten regresjon

Kodeeksempel: Versjon-sammenligning

import mlflow

# Evaluate v1
with mlflow.start_run(run_name="v1"):
    eval_results_v1 = mlflow.genai.evaluate(
        data=eval_dataset,
        predict_fn=generate_sales_email_v1,
        scorers=email_judges,
    )

# Evaluate v2
with mlflow.start_run(run_name="v2"):
    eval_results_v2 = mlflow.genai.evaluate(
        data=eval_dataset,
        predict_fn=generate_sales_email_v2,
        scorers=email_judges,  # Same judges for fairness
    )

# Compare results
run_v1_df = mlflow.search_runs(filter_string=f"run_id = '{eval_results_v1.run_id}'")
run_v2_df = mlflow.search_runs(filter_string=f"run_id = '{eval_results_v2.run_id}'")

metric_cols = [col for col in run_v1_df.columns
               if col.startswith('metrics.') and col.endswith('/mean')]

for metric in metric_cols:
    v1_score = run_v1_df[metric].iloc[0]
    v2_score = run_v2_df[metric].iloc[0]
    improvement = v2_score - v1_score
    print(f"{metric}: {v1_score:.3f} → {v2_score:.3f} ({improvement:+.3f})")

Arkitekturmønstre

Pattern 1: Automated MLOps Loop (Classical ML)

┌─────────────────────────────────────────────────────────┐
│ Production Deployment (Managed Online Endpoint)         │
│   ├─ Data Collection (inference tables)                │
│   └─ Monitoring (Azure Monitor, drift detection)       │
└─────────────────────┬───────────────────────────────────┘
                      │ Drift detected / Threshold reached
                      ▼
┌─────────────────────────────────────────────────────────┐
│ CI/CD Pipeline (Azure Pipelines / GitHub Actions)      │
│   ├─ Pull production data                              │
│   ├─ Retrain model (Azure ML Compute)                  │
│   ├─ Evaluate (test set + validation metrics)          │
│   └─ Promote to staging (if quality gates pass)        │
└─────────────────────┬───────────────────────────────────┘
                      │ Human approval (HITL)
                      ▼
┌─────────────────────────────────────────────────────────┐
│ Staging Environment                                     │
│   ├─ A/B testing (champion vs challenger)              │
│   ├─ Responsible AI checks (bias, fairness)            │
│   └─ Final validation                                  │
└─────────────────────┬───────────────────────────────────┘
                      │ Deploy to prod
                      ▼
                  [Production]

Når bruke:

Tabular ML (classification, regression, forecasting)
Automated retraining er justified (kostnadseffektivt)
Modellen har clear performance metrics (accuracy, RMSE, F1)

Pattern 2: GenAI Feedback Loop (LLM Applications)

┌─────────────────────────────────────────────────────────┐
│ Production Agent (Model Serving Endpoint)              │
│   ├─ MLflow Tracing (span-level telemetry)             │
│   ├─ User feedback (thumbs up/down)                    │
│   └─ Inference tables (Unity Catalog)                  │
└─────────────────────┬───────────────────────────────────┘
                      │ Daily batch evaluation
                      ▼
┌─────────────────────────────────────────────────────────┐
│ Production Monitoring (Agent Evaluation)                │
│   ├─ LLM Judges (correctness, safety, relevance)       │
│   ├─ Sampling rate: 10-100% of traffic                 │
│   └─ Alerts on quality degradation                     │
└─────────────────────┬───────────────────────────────────┘
                      │ Export low-scoring traces
                      ▼
┌─────────────────────────────────────────────────────────┐
│ Evaluation Dataset Curation                             │
│   ├─ Filter by user feedback + LLM judge scores        │
│   ├─ SME review (Review App)                           │
│   └─ Add to versioned eval dataset (MLflow Datasets)   │
└─────────────────────┬───────────────────────────────────┘
                      │ Trigger improvement cycle
                      ▼
┌─────────────────────────────────────────────────────────┐
│ Agent Development (Inner Loop)                          │
│   ├─ Refine prompts / retrieval logic / tools          │
│   ├─ Run offline evaluation (eval dataset + scorers)   │
│   └─ Compare to baseline (MLflow tracking)             │
└─────────────────────┬───────────────────────────────────┘
                      │ Quality improved?
                      ▼
                  [Yes: Deploy]  [No: Iterate]

Når bruke:

Agentic RAG, chatbots, content generation
Quality er subjektiv (tone, style, policy compliance)
Frequent prompt/logic changes, ikke bare model retraining

Pattern 3: Hybrid (CV/NLP med Human Annotation)

┌─────────────────────────────────────────────────────────┐
│ Production Model (Batch/Online Endpoint)                │
│   └─ Model performance monitoring (accuracy on new data)│
└─────────────────────┬───────────────────────────────────┘
                      │ Performance drops
                      ▼
┌─────────────────────────────────────────────────────────┐
│ Human-in-the-Loop Annotation                            │
│   ├─ Sample low-confidence predictions                 │
│   ├─ Annotators label new data (Azure ML Labeling)     │
│   └─ Quality review by SMEs                            │
└─────────────────────┬───────────────────────────────────┘
                      │ New labeled data
                      ▼
┌─────────────────────────────────────────────────────────┐
│ Model Development (Inner Loop)                          │
│   ├─ Update training set with new annotations          │
│   ├─ Retrain model (not automated)                     │
│   └─ Evaluate on test set + new edge cases             │
└─────────────────────┬───────────────────────────────────┘
                      │ Quality gates pass?
                      ▼
                  [Staging → Production]

Når bruke:

Computer vision (image classification, object detection)
NLP tasks (text classification, NER)
Automated retraining ikke ønskelig (ressurskrevende, krever human review)

Beslutningsveiledning

Når implementere automated vs manual retraining?

Factor	Automated Retraining	Manual Retraining
Data volume	High (daglig nye data)	Low (ukentlig/månedlig)
Model stability	High (proven architecture)	Low (experimental)
Cost tolerance	High (compute budget ok)	Low (kostnadssensitiv)
Regulatory	Low risk (non-critical)	High risk (health, finance)
Expertise	Available (MLOps team)	Limited (manual review nødvendig)

Tommelfingerregel:

Classical ML (tabular): Automatiser hvis data volume > 1000 nye rader/dag
GenAI (LLM): Manuell iteration (prompt refinement) oftere enn retraining
CV/NLP: Hybrid (automated monitoring → manual annotation → triggered retraining)

Når bruke LLM judges vs human evaluation?

Scenario	LLM Judges	Human Evaluation
Factual correctness	✅ (with expected_facts)	✅ (gold standard)
Safety (toxicity, bias)	✅ (high recall)	✅ (final validation)
Style/tone compliance	✅ (guidelines judge)	✅ (subjective quality)
Edge cases	⚠️ (may miss nuance)	✅ (domain expertise)
Volume	✅ (scale to 100% traffic)	❌ (sample 1-10%)
Cost	Medium (LLM inference)	High (SME time)

Best practice:

Start med LLM judges for bulk evaluation (development + production monitoring)
Sample 10-20% av low-scoring traces for human review
Bruk human feedback til å tune LLM judges (few-shot examples)

Integrasjon med Microsoft-stakken

Azure Machine Learning (Classical ML)

Feedback loop-komponenter:

Komponent	Azure-tjeneste	Formål
Data collection	Inference tables (managed endpoints)	Capture production inputs/outputs
Monitoring	Model Monitor (Azure ML)	Data drift, prediction drift, performance
Alerting	Azure Monitor Alerts	Email/webhook ved threshold breach
Retraining	Azure ML Pipelines	Triggered retraining workflow
A/B testing	Staging endpoints	Champion vs challenger validation
Deployment	Managed Online Endpoints	Blue-green deployment

Kodeeksempel: Alert notification ved data drift

from azure.ai.ml.entities import AlertNotification

alert_notification = AlertNotification(
    emails=['ml-team@example.com', 'data-science-lead@example.com']
)

monitor_definition = MonitorDefinition(
    compute=spark_compute,
    monitoring_target=monitoring_target,
    monitoring_signals={"data_drift": data_drift_signal},
    alert_notification=alert_notification  # Sends email when drift detected
)

Azure AI Foundry (GenAI)

Feedback loop-komponenter:

Komponent	Azure-tjeneste	Formål
Production tracing	MLflow Tracing (Databricks)	Span-level telemetry
User feedback	Review App	Thumbs up/down, textual feedback
LLM judges	Agent Evaluation	Automated quality scoring
Monitoring dashboard	Azure AI Foundry Observability	Quality trends, latency, errors
Eval datasets	MLflow Datasets (Unity Catalog)	Versioned test sets
Red teaming	AI Red Teaming Agent	Adversarial testing for safety

Kodeeksempel: Production monitoring setup (GenAI)

from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    MonitorSchedule,
    CronTrigger,
    MonitorDefinition,
    ServerlessSparkCompute,
    MonitoringTarget,
    GenerationSafetyQualitySignal,
    GenerationSafetyQualityMonitoringMetricThreshold,
    LlmData,
    BaselineDataRange,
)

ml_client = MLClient(...)

# Define quality thresholds (70% passing rate)
quality_thresholds = GenerationSafetyQualityMonitoringMetricThreshold(
    groundedness={"aggregated_groundedness_pass_rate": 0.7},
    relevance={"aggregated_relevance_pass_rate": 0.7},
    coherence={"aggregated_coherence_pass_rate": 0.7},
    fluency={"aggregated_fluency_pass_rate": 0.7},
)

# Reference production data (app traces)
data_window = BaselineDataRange(lookback_window_size="P7D", lookback_window_offset="P0D")
production_data = LlmData(
    data_column_names={
        "prompt_column": "question",
        "completion_column": "answer",
        "context_column": "context"
    },
    input_data=Input(type="uri_folder", path="endpoint-deployment-app_traces:1"),
    data_window=data_window,
)

# Create quality signal
gsq_signal = GenerationSafetyQualitySignal(
    connection_id=f"/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.MachineLearningServices/workspaces/{workspace}/connections/{aoai_connection}",
    metric_thresholds=quality_thresholds,
    production_data=[production_data],
    sampling_rate=1.0,  # Evaluate 100% of traffic
)

# Schedule daily evaluation
monitor_definition = MonitorDefinition(
    compute=ServerlessSparkCompute(instance_type="standard_e4s_v3", runtime_version="3.3"),
    monitoring_target=MonitoringTarget(
        ml_task=MonitorTargetTasks.QUESTION_ANSWERING,
        endpoint_deployment_id=f"azureml:{endpoint_name}:{deployment_name}"
    ),
    monitoring_signals={"quality_signal": gsq_signal},
    alert_notification=AlertNotification(emails=["genai-team@example.com"])
)

trigger = CronTrigger(expression="15 10 * * *")  # Daily at 10:15 AM

model_monitor = MonitorSchedule(
    name="chatbot_quality_monitor",
    trigger=trigger,
    create_monitor=monitor_definition
)

ml_client.schedules.begin_create_or_update(model_monitor)

Power Platform AI (Citizen Developer Scenario)

Feedback loop-komponenter:

Komponent	Power Platform-tjeneste	Formål
Automated feedback collection	Power Automate	Route low-confidence predictions til human review
Storage	Dataverse / SharePoint	Lagre feedback data
Model improvement	AI Builder Feedback Loop	Automatically add reviewed samples to training set
Retraining	AI Builder	Manual/scheduled retraining

Eksempel-workflow (Power Automate):

Trigger: AI Builder prediction (e.g., document processing)
Condition: If confidence score < 0.7
Action: Save file + prediction output to AI Builder feedback loop storage
Notification: Send email til reviewer

Resultat: Reviewed documents automatisk tilgjengelige i "Feedback loop" data source når modellen retraines.

Offentlig sektor (Norge)

Regulatoriske krav

EU AI Act + Norsk implementering:

Høyrisiko-AI: Kontinuerlig monitorering og logging obligatorisk (Article 61)
Sporbarhet: Automatiske logger av inputs, outputs, decisions
Human oversight: HITL review for kritiske beslutninger (Article 14)
Retesting: Periodisk evaluering mot original test set + new edge cases

Implementering i Microsoft-stakken:

# Compliant logging example (GDPR + AI Act)
import mlflow

# Log input/output + rationale (Article 61: Record-keeping)
mlflow.log_param("input_hash", hash(user_query))  # Pseudonymized
mlflow.log_metric("confidence_score", 0.85)
mlflow.log_text("rationale", "Retrieved relevant documents from internal KB")

# Human review trigger (Article 14: Human oversight)
if confidence_score < 0.7:
    send_to_human_review(trace_id, user_query, model_output)

Bærekraft (grønn AI)

Retraining frequency vs CO₂-footprint:

Strategi	CO₂-impact	Når bruke
Daily retraining	HIGH	Finansmarkeder, real-time fraud detection
Weekly retraining	MEDIUM	Customer support chatbots
Threshold-based	LOW	Retrain only når accuracy < 90%
Manual trigger	VERY LOW	Statisk domene (image classification)

Azure-støtte:

Carbon-aware deployment: Deploy til low-carbon regions (Sweden Central, Norway East)
Model decay detection: Unngå unødvendig retraining via threshold-based triggers
Efficient inference: Azure ML Managed Online Endpoints med auto-scaling

Datahåndtering (Personvern)

GDPR-compliance i feedback loops:

Right to explanation (Article 22): Trace-logginig må inkludere model reasoning
Right to be forgotten (Article 17): Mulighet til å slette user feedback data
Data minimization (Article 5): Kun logg nødvendige fields (ikke full user profile)

Implementering:

# Pseudonymization (GDPR-compliant)
import hashlib

user_id_hash = hashlib.sha256(user_id.encode()).hexdigest()

mlflow.log_param("user_id_hash", user_id_hash)  # Logged
# Original user_id IKKE lagret i MLflow

Kostnad og lisensiering

Compute-kostnader (Retraining)

Azure Machine Learning:

Scenario	Compute Type	Estimert kostnad (NOK/mnd)	Confidence
Daily retraining (tabular ML)	Standard_DS3_v2 (4 vCPU)	~15 000 - 25 000	HIGH
Weekly retraining (CV)	GPU (NC6s_v3)	~8 000 - 12 000	HIGH
Threshold-based (GenAI)	Minimal (only when triggered)	~2 000 - 5 000	MEDIUM

Databricks (GenAI Evaluation):

Scenario	Compute Type	Estimat (NOK/mnd)	Confidence
Daily LLM judge evaluation (10k traces)	Serverless Spark (standard_e4s_v3)	~10 000 - 15 000	MEDIUM
Human review (Review App)	Minimal (UI hosting)	~500 - 1 000	HIGH

Storage-kostnader

Inference tables + eval datasets:

Azure Storage (Delta Lake): ~0.50 NOK/GB/mnd
MLflow Tracking: ~1-2 NOK per experiment run (metadata)

Estimat: 10 000 daily inferences → ~5 GB/mnd → ~2.50 NOK/mnd storage

Lisenser

Microsoft Fabric + Azure ML:

Azure ML Enterprise: Inkludert i subscription, per-use compute pricing
Databricks (Unity Catalog): Premium tier (~$2-3 per DBU)

Power Platform:

License	AI Builder Credits/mnd	Feedback Loop Support
Per User	500	✅
Per App	Ikke inkludert	❌ (krever Per User)
AI Builder add-on	Custom (kjøp ekstra)	✅

For arkitekten (Cosmo)

Når anbefale automated feedback loops?

✅ Ja, anbefal:

Produksjonsmodell med > 1000 daily inferences
Clear performance metrics (accuracy, F1, RMSE)
Regulatory compliance krav (AI Act, ISO 27001)
Business-critical application (customer-facing, revenue impact)

⚠️ Vurder nøye:

Proof-of-concept eller pilot (manuell evaluering holder)
Lav inference volume (< 100/day)
Statisk domene (sjeldent endringer i data)
Begrensede MLOps-ressurser (prioriter automation later)

Anbefalte spørsmål til kunden

Volum: Hvor mange inferences per dag forventes i produksjon?
Kritikalitet: Hva er konsekvensen av feil predictions? (customer impact, revenue loss)
Data dynamics: Hvor ofte endrer input-dataene seg? (daily, weekly, seasonal)
Expertise: Har teamet MLOps-kompetanse, eller er dette first AI project?
Budget: Hva er akseptabel månedlig kostnad for monitoring + retraining?
Regulatory: Gjelder AI Act / GDPR high-risk classification?

Røde flagg (anti-patterns)

❌ "Vi retrainer hver natt uten å sjekke om det er nødvendig" → Forslag: Threshold-based retraining (spare compute + CO₂)

❌ "Vi har ingen monitoring, men deployer nye modeller hver uke" → Forslag: Implementer baseline monitoring før du øker deployment-frekvens

❌ "Brukerne klager på dårlig kvalitet, men vi har ingen feedback-mekanisme" → Forslag: Start med enkel thumbs up/down i UI, logg til Application Insights

❌ "Vi evaluerer kun på original test set, aldri production data" → Forslag: Exporter sample av inference tables til eval dataset (catch drift)

Suksess-metrikker for feedback loops

Metric	Target	Måleenhet
Mean time to detect (MTTD)	< 24 timer	Time fra quality degradation til alert
Retraining cycle time	< 7 dager	Time fra drift detection til ny model i prod
User feedback rate	> 5%	% av inferences hvor user gir feedback
False positive rate (monitoring)	< 10%	% av alerts som ikke krever action
Quality improvement per iteration	> 5%	Accuracy/F1 gain per retraining cycle

Kilder og verifisering

Primærkilder (Microsoft Learn):

MLflow for GenAI Apps and Agents - Continuous Improvement Cycle (Verified MCP 2026-04 — updated 10-step cycle; new: Trace UI for pattern identification, evaluation harness, version/prompt management tracking)
Machine Learning Operations v2 - Monitoring & Feedback
Generative AI App Developer Workflow - Production Monitoring
Azure AI Foundry - Observability in Generative AI
MLOps and GenAIOps for AI Workloads - Model Maintenance
AI Builder - Continuously Improve Your Model (Feedback Loop)

Code samples:

MLflow feedback logging: Azure Databricks - Agent Framework
Model monitoring setup: Azure ML - Monitor Model Performance (Verified MCP 2026-04 — supports data quality, data drift, prediction drift, feature attribution drift, and custom signals; integrates with Azure Event Grid for alerting)
GenAI evaluation: MLflow 3.x - Evaluate App (Verified MCP 2026-04 — tutorial covers RAG email app evaluation; new scorers: RetrievalGroundedness, Guidelines, RelevanceToQuery, Safety; version comparison with mlflow.genai.evaluate())

Dato for siste verifikasjon: 2026-04-10

MCP calls: 6 (microsoft_docs_search: 3, microsoft_docs_fetch: 3, microsoft_code_sample_search: 2)

For Cosmo

Dette dokumentet dekker hele feedback loop-syklusen for både classical ML og GenAI. Nøkkelpunkter å fremheve i konsultasjon:

Ikke one-size-fits-all: Automated retraining passer ikke alle (se beslutningsveiledning)
Start enkelt: Thumbs up/down + basic monitoring før du bygger kompleks MLOps-pipeline
GenAI ≠ Classical ML: GenAI krever LLM judges + human review, ikke bare accuracy metrics
Compliance: AI Act krever kontinuerlig monitorering for høyrisiko-systemer (ikke optional)
Kostnad: Threshold-based retraining kan spare 50-70% compute vs daily retraining

Bruk arkitekturmønstrene til å visualisere løsningen for kunden. Påpek at MLflow Tracing + Agent Evaluation gir "free" observability (built-in i Databricks).

MLflow 3 Evaluation & Feedback Loop (Verified MCP 2026-04)

MLflow 3 introduces a unified evaluation-monitoring lifecycle for GenAI feedback loops:

Iterative workflow:

Trace production requests (MLflow Tracing — end-to-end observability)
Evaluate against scorers during development (mlflow.genai.evaluate())
Monitor production with same scorers (consistent quality measurement)
Gather human feedback via Review App (expert annotations)
Improve prompts/models based on evaluation datasets

Built-in LLM judges (scorers):

RetrievalGroundedness — checks if response is grounded in retrieved data
RelevanceToQuery — checks if response addresses the user request
Safety — checks for harmful/inappropriate content
Guidelines(name, guidelines) — custom policy/tone/style checks
Correctness — factual correctness with expected_facts

Azure ML Model Monitoring signals:

Data quality: null values, out-of-range, type mismatch
Data drift: statistical distribution changes between training and production data
Prediction drift: distribution shift in model outputs
Feature attribution drift: changes in feature importance
Custom signals: user-defined metrics via custom scripts
Integrates with Azure Event Grid for alerting on threshold breaches

Evaluation dataset workflow (new 2026-04):

Search production traces → select problematic + high-quality examples
Save to versioned eval dataset in Unity Catalog (mlflow.genai.datasets.create_dataset())
Run evaluation harness with mlflow.genai.evaluate(data=eval_dataset, predict_fn=..., scorers=...)
Compare runs in UI (Evaluation runs view) or SDK (mlflow.search_runs)
Identify regressions per-metric before promoting new versions

Continuous improvement cycle: Production traces → MLflow evaluation datasets → Scorer alignment → Prompt/model update → A/B test → Production rollout

34 KiB Raw Blame History

Feedback Loops and Continuous Improvement

Introduksjon

Hvorfor dette er viktig

Kjernekomponenter

1. Production Monitoring & Telemetry

2. Data Collection & Evaluation Datasets

3. Automated Retraining & Model Promotion

4. Human-in-the-Loop (HITL) Workflows

5. Continuous Improvement Cycle (MLflow for GenAI)

Arkitekturmønstre

Pattern 1: Automated MLOps Loop (Classical ML)

Pattern 2: GenAI Feedback Loop (LLM Applications)

Pattern 3: Hybrid (CV/NLP med Human Annotation)

Beslutningsveiledning

Når implementere automated vs manual retraining?

Når bruke LLM judges vs human evaluation?

Integrasjon med Microsoft-stakken

Azure Machine Learning (Classical ML)

Azure AI Foundry (GenAI)

Power Platform AI (Citizen Developer Scenario)

Offentlig sektor (Norge)

Regulatoriske krav

Bærekraft (grønn AI)

Datahåndtering (Personvern)

Kostnad og lisensiering

Compute-kostnader (Retraining)

Storage-kostnader

Lisenser

For arkitekten (Cosmo)

Når anbefale automated feedback loops?

Anbefalte spørsmål til kunden

Røde flagg (anti-patterns)

Suksess-metrikker for feedback loops

Kilder og verifisering

For Cosmo

MLflow 3 Evaluation & Feedback Loop (Verified MCP 2026-04)

34 KiB

Raw Blame History