ktg-plugin-marketplace/plugins/ms-ai-architect/skills/ms-ai-engineering/references/mlops-genaiops/feedback-loops-continuous-improvement.md
Kjell Tore Guttormsen 34c6db36fa docs(architect): weekly KB update — 52 files refreshed (2026-04)
Key content changes:
- MLOps: MLflow 3 scorers expanded (RetrievalRelevance, Fluency, multi-turn judges)
- MLflow 3 A/B eval: mirror_traffic GA confirmed, new scorer catalog
- CI/CD: OIDC auth replaces deprecated --sdk-auth (Azure ML GitHub Actions)
- Agent framework A2A: updated SDK patterns (A2ACardResolver, BearerAuth)
- AG-UI backend tool rendering: accurate TOOL_CALL_* event shapes
- Computer Use agents: US region requirement, credentials patterns
- Purview governance: bulk term edit, expire/delete workflows
- CAF AI Secure: 3-phase structure confirmed current
- Copilot Studio: Claude Sonnet 4.5/4.6 GA, new orchestration controls
- M365 manifest: v1.26 GA (April 2026), copilotAgents node
- Power Platform: agent flow capacity enforcement corrected
- Azure Monitor: Simple Log Alerts GA, AMBA for policy-based alerting
- Security Copilot: SCU capacity model (400 SCU/1000 users)
- EU Data Boundary: all EU + EFTA countries confirmed
- gateway-multi-backend: added 4th topology, subscription-level quota note

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 11:31:11 +02:00

34 KiB

Feedback Loops and Continuous Improvement

Kategori: MLOps & GenAIOps Dato: 2026-02-04 Last updated: 2026-04 Confidence: HIGH (basert på offisiell Microsoft-dokumentasjon)

Verified: MCP 2026-04

Introduksjon

Feedback loops og kontinuerlig forbedring er kritiske komponenter i moderne AI-operasjoner. I motsetning til tradisjonell programvare, hvor funksjonalitet er deterministisk, kan AI-modeller vise kvalitetsdrift eller uventet oppførsel når de møter reelle data. Et velfungerende feedback-system sikrer at modeller forblir nøyaktige, relevante og trygge gjennom hele sin livssyklus.

Nøkkelkonsept: Feedback loops kobler produksjonsdata, brukerinnsikt og ytelsesmetrikker tilbake til utviklingsprosessen, og skaper en kontinuerlig syklus av måling, læring og forbedring.

Hvorfor dette er viktig

  • Modellforfall (model decay): AI-modeller degraderer over tid på grunn av endringer i data, brukermønstre eller forretningskontekst
  • Kvalitetssikring: Automatisert og manuell evaluering avdekker gap mellom forventet og faktisk ytelse
  • Brukerverdi: Direkte tilbakemelding fra sluttbrukere gir innsikt som ikke fanges av tekniske metrikker
  • Compliance: Regulatoriske krav (AI Act, GDPR) krever sporbarhet og kontinuerlig overvåking

Kjernekomponenter

1. Production Monitoring & Telemetry

Azure-tjenester:

  • Azure Monitor + Application Insights: Sanker telemetri fra endpoints, sporer latens, feilrater, token-forbruk
  • Azure Machine Learning Model Monitoring: Automatisk deteksjon av data drift, prediction drift og model performance degradation
  • MLflow Tracing: Detaljert sporing av hver inferens-interaksjon, inkludert inputs, outputs, mellomsteg

Nøkkelmetrikker:

Dimensjon Metrikker Confidence
Operational Request volume, latency (p50/p95), error rates, token usage HIGH
Quality Groundedness, relevance, coherence, safety pass rate HIGH (GenAI)
User Feedback Thumbs up/down, ratings, explicit reports MEDIUM

Kodeeksempel: Logging av user feedback (MLflow)

import mlflow
from mlflow.entities import AssessmentSource
import time

# Wait for trace to be ready
time.sleep(1)

# Extract span and trace IDs from response
response_dict = response.as_dict()
first_prediction = response_dict["predictions"][0]
first_result = first_prediction["results"][0]

span_id = first_result["span_id"]
trace_id = first_prediction["trace_id"]

# Log user feedback
mlflow.log_feedback(
    trace_id=trace_id,
    span_id=span_id,
    name="user_feedback",
    value=True,  # True for positive, False for negative
    source=AssessmentSource(source_type="HUMAN"),
    rationale="Answer was accurate and well-reasoned",
)

2. Data Collection & Evaluation Datasets

Prosess:

  1. Production traces → Evaluation set: Bruk inference table logs til å identifisere problematiske interaksjoner
  2. Synthetic data generation: Generer startdatasett før produksjonsdata er tilgjengelig
  3. Expert curation: SMEs validerer og annoterer edge cases, gold standard-svar

Azure-tjenester:

  • MLflow Datasets: Versjonert lagring av eval-datasett i Unity Catalog
  • Azure AI Foundry Agent Evaluation: Evaluering med LLM judges (correctness, relevance, groundedness, safety)
  • Databricks Review App: Samle feedback fra domeneeksperter på produksjonstracer

Best practices:

  • Inkluder både forventede og uventede bruksmønstre i eval-settet
  • Test for edge cases (lange/korte inputs, misspellings, prompt injection)
  • Kombiner expected_facts (fleksibelt) med guidelines (tone, style, policy)

Kodeeksempel: Evaluering med MLflow

import mlflow
from mlflow.genai.scorers import Correctness, RelevanceToQuery

# Define evaluation dataset
eval_data = [
    {
        "inputs": {"question": "What is MLflow?"},
        "expectations": {
            "expected_facts": ["open-source platform", "ML lifecycle management"]
        }
    },
    {
        "inputs": {"question": "How do I track experiments?"},
        "expectations": {
            "expected_facts": ["mlflow.start_run()", "log metrics", "log parameters"]
        }
    }
]

# Run evaluation
results = mlflow.genai.evaluate(
    data=eval_data,
    predict_fn=my_agent,
    scorers=[Correctness(), RelevanceToQuery()],
)

print(f"Correctness score: {results.metrics['correctness/mean']:.2f}")

3. Automated Retraining & Model Promotion

Strategier:

Strategi Når bruke Trade-offs
Online training Daglig/kontinuerlig oppdatering med nye data Høy kostnad, krever robust automation
Offline training Sjeldnere oppdatering (ukentlig/månedlig) Lavere kostnad, risiko for model decay
Threshold-based Retrain når ytelse faller under terskel Balanserer presisjon vs energiforbruk

Azure-tjenester:

  • Azure Machine Learning Pipelines: CI/CD for modelltrening og deployment
  • Azure DevOps / GitHub Actions: Automatiserte triggers ved model registration
  • Azure Arc: Hybrid/multicloud deployment-orkestrering

Triggers for retraining:

  • Data drift: Statistical properties of input data har endret seg (detektert via monitoring)
  • Prediction drift: Output-distribusjonen avviker fra baseline
  • Performance degradation: Metrics (accuracy, F1-score) faller under threshold
  • Manual trigger: Human-in-the-loop approval for kritiske modeller

Kodeeksempel: Model monitoring setup

from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    MonitorSchedule,
    RecurrenceTrigger,
    MonitorDefinition,
    ServerlessSparkCompute,
    MonitoringTarget,
    AlertNotification,
    DataDriftSignal,
    DataDriftMetricThreshold,
    NumericalDriftMetrics,
)

# Setup monitoring for data drift
ml_client = MLClient(...)

spark_compute = ServerlessSparkCompute(
    instance_type="standard_e4s_v3",
    runtime_version="3.3"
)

monitoring_target = MonitoringTarget(
    ml_task="classification",
    endpoint_deployment_id="azureml:fraud-detection-endpoint:main"
)

# Define drift thresholds
metric_thresholds = DataDriftMetricThreshold(
    numerical=NumericalDriftMetrics(
        jensen_shannon_distance=0.01  # Retrain when drift exceeds 1%
    )
)

data_drift_signal = DataDriftSignal(
    reference_data=training_data,
    metric_thresholds=metric_thresholds,
    alert_enabled=True
)

# Create monitoring schedule
monitor_definition = MonitorDefinition(
    compute=spark_compute,
    monitoring_target=monitoring_target,
    monitoring_signals={"data_drift": data_drift_signal},
    alert_notification=AlertNotification(emails=["ml-team@example.com"])
)

recurrence_trigger = RecurrenceTrigger(
    frequency="day",
    interval=1,
    schedule=RecurrencePattern(hours=3, minutes=0)
)

model_monitor = MonitorSchedule(
    name="fraud_detection_monitor",
    trigger=recurrence_trigger,
    create_monitor=monitor_definition
)

ml_client.schedules.begin_create_or_update(model_monitor)

4. Human-in-the-Loop (HITL) Workflows

Komponenter:

  • Review App (Databricks): Thumbs up/down, textual feedback på agent-svar
  • Expert labeling: SMEs annoterer traces med expected outputs, policy violations
  • Approval gates: Human godkjenning før deploy til prod (kritiske modeller)

Azure-tjenester:

  • Azure Logic Apps / Power Automate: Workflow automation for HITL review
  • AI Builder Feedback Loop: Automatisk routing av low-confidence predictions til human review

Best practices:

  • Balancer automation vs HITL: Kun review low-confidence outputs (< 70% score)
  • Unngå reviewer fatigue: Sample strategisk, ikke alle interaksjoner
  • Incorporate feedback raskt: Weekly review cycles, ikke månedlig

5. Continuous Improvement Cycle (MLflow for GenAI)

10-stegs syklus:

  1. 🚀 Production App: Deployed agent generer traces med inputs/outputs
  2. 👍 👎 User Feedback: Thumbs up/down på hver interaksjon
  3. 🔍 Monitor & Score: LLM judges (correctness, safety, relevance) scorer automatisk
  4. ⚠️ Identify Issues: Trace UI viser mønstre i low-scoring traces
  5. 👥 Domain Expert Review: Sample sendes til SMEs via Review App
  6. 📋 Build Eval Dataset: Kurater problematiske + high-quality traces til eval-sett
  7. 🎯 Tune Scorers: Bruk expert feedback til å align LLM judges med human judgment
  8. 🧪 Evaluate New Versions: Test forbedringer mot eval-settet med samme scorers
  9. 📈 Compare Results: MLflow evaluation runs sammenligner versioner
  10. Deploy or Iterate: Deploy hvis kvalitet forbedres uten regresjon

Kodeeksempel: Versjon-sammenligning

import mlflow

# Evaluate v1
with mlflow.start_run(run_name="v1"):
    eval_results_v1 = mlflow.genai.evaluate(
        data=eval_dataset,
        predict_fn=generate_sales_email_v1,
        scorers=email_judges,
    )

# Evaluate v2
with mlflow.start_run(run_name="v2"):
    eval_results_v2 = mlflow.genai.evaluate(
        data=eval_dataset,
        predict_fn=generate_sales_email_v2,
        scorers=email_judges,  # Same judges for fairness
    )

# Compare results
run_v1_df = mlflow.search_runs(filter_string=f"run_id = '{eval_results_v1.run_id}'")
run_v2_df = mlflow.search_runs(filter_string=f"run_id = '{eval_results_v2.run_id}'")

metric_cols = [col for col in run_v1_df.columns
               if col.startswith('metrics.') and col.endswith('/mean')]

for metric in metric_cols:
    v1_score = run_v1_df[metric].iloc[0]
    v2_score = run_v2_df[metric].iloc[0]
    improvement = v2_score - v1_score
    print(f"{metric}: {v1_score:.3f}{v2_score:.3f} ({improvement:+.3f})")

Arkitekturmønstre

Pattern 1: Automated MLOps Loop (Classical ML)

┌─────────────────────────────────────────────────────────┐
│ Production Deployment (Managed Online Endpoint)         │
│   ├─ Data Collection (inference tables)                │
│   └─ Monitoring (Azure Monitor, drift detection)       │
└─────────────────────┬───────────────────────────────────┘
                      │ Drift detected / Threshold reached
                      ▼
┌─────────────────────────────────────────────────────────┐
│ CI/CD Pipeline (Azure Pipelines / GitHub Actions)      │
│   ├─ Pull production data                              │
│   ├─ Retrain model (Azure ML Compute)                  │
│   ├─ Evaluate (test set + validation metrics)          │
│   └─ Promote to staging (if quality gates pass)        │
└─────────────────────┬───────────────────────────────────┘
                      │ Human approval (HITL)
                      ▼
┌─────────────────────────────────────────────────────────┐
│ Staging Environment                                     │
│   ├─ A/B testing (champion vs challenger)              │
│   ├─ Responsible AI checks (bias, fairness)            │
│   └─ Final validation                                  │
└─────────────────────┬───────────────────────────────────┘
                      │ Deploy to prod
                      ▼
                  [Production]

Når bruke:

  • Tabular ML (classification, regression, forecasting)
  • Automated retraining er justified (kostnadseffektivt)
  • Modellen har clear performance metrics (accuracy, RMSE, F1)

Pattern 2: GenAI Feedback Loop (LLM Applications)

┌─────────────────────────────────────────────────────────┐
│ Production Agent (Model Serving Endpoint)              │
│   ├─ MLflow Tracing (span-level telemetry)             │
│   ├─ User feedback (thumbs up/down)                    │
│   └─ Inference tables (Unity Catalog)                  │
└─────────────────────┬───────────────────────────────────┘
                      │ Daily batch evaluation
                      ▼
┌─────────────────────────────────────────────────────────┐
│ Production Monitoring (Agent Evaluation)                │
│   ├─ LLM Judges (correctness, safety, relevance)       │
│   ├─ Sampling rate: 10-100% of traffic                 │
│   └─ Alerts on quality degradation                     │
└─────────────────────┬───────────────────────────────────┘
                      │ Export low-scoring traces
                      ▼
┌─────────────────────────────────────────────────────────┐
│ Evaluation Dataset Curation                             │
│   ├─ Filter by user feedback + LLM judge scores        │
│   ├─ SME review (Review App)                           │
│   └─ Add to versioned eval dataset (MLflow Datasets)   │
└─────────────────────┬───────────────────────────────────┘
                      │ Trigger improvement cycle
                      ▼
┌─────────────────────────────────────────────────────────┐
│ Agent Development (Inner Loop)                          │
│   ├─ Refine prompts / retrieval logic / tools          │
│   ├─ Run offline evaluation (eval dataset + scorers)   │
│   └─ Compare to baseline (MLflow tracking)             │
└─────────────────────┬───────────────────────────────────┘
                      │ Quality improved?
                      ▼
                  [Yes: Deploy]  [No: Iterate]

Når bruke:

  • Agentic RAG, chatbots, content generation
  • Quality er subjektiv (tone, style, policy compliance)
  • Frequent prompt/logic changes, ikke bare model retraining

Pattern 3: Hybrid (CV/NLP med Human Annotation)

┌─────────────────────────────────────────────────────────┐
│ Production Model (Batch/Online Endpoint)                │
│   └─ Model performance monitoring (accuracy on new data)│
└─────────────────────┬───────────────────────────────────┘
                      │ Performance drops
                      ▼
┌─────────────────────────────────────────────────────────┐
│ Human-in-the-Loop Annotation                            │
│   ├─ Sample low-confidence predictions                 │
│   ├─ Annotators label new data (Azure ML Labeling)     │
│   └─ Quality review by SMEs                            │
└─────────────────────┬───────────────────────────────────┘
                      │ New labeled data
                      ▼
┌─────────────────────────────────────────────────────────┐
│ Model Development (Inner Loop)                          │
│   ├─ Update training set with new annotations          │
│   ├─ Retrain model (not automated)                     │
│   └─ Evaluate on test set + new edge cases             │
└─────────────────────┬───────────────────────────────────┘
                      │ Quality gates pass?
                      ▼
                  [Staging → Production]

Når bruke:

  • Computer vision (image classification, object detection)
  • NLP tasks (text classification, NER)
  • Automated retraining ikke ønskelig (ressurskrevende, krever human review)

Beslutningsveiledning

Når implementere automated vs manual retraining?

Factor Automated Retraining Manual Retraining
Data volume High (daglig nye data) Low (ukentlig/månedlig)
Model stability High (proven architecture) Low (experimental)
Cost tolerance High (compute budget ok) Low (kostnadssensitiv)
Regulatory Low risk (non-critical) High risk (health, finance)
Expertise Available (MLOps team) Limited (manual review nødvendig)

Tommelfingerregel:

  • Classical ML (tabular): Automatiser hvis data volume > 1000 nye rader/dag
  • GenAI (LLM): Manuell iteration (prompt refinement) oftere enn retraining
  • CV/NLP: Hybrid (automated monitoring → manual annotation → triggered retraining)

Når bruke LLM judges vs human evaluation?

Scenario LLM Judges Human Evaluation
Factual correctness (with expected_facts) (gold standard)
Safety (toxicity, bias) (high recall) (final validation)
Style/tone compliance (guidelines judge) (subjective quality)
Edge cases ⚠️ (may miss nuance) (domain expertise)
Volume (scale to 100% traffic) (sample 1-10%)
Cost Medium (LLM inference) High (SME time)

Best practice:

  1. Start med LLM judges for bulk evaluation (development + production monitoring)
  2. Sample 10-20% av low-scoring traces for human review
  3. Bruk human feedback til å tune LLM judges (few-shot examples)

Integrasjon med Microsoft-stakken

Azure Machine Learning (Classical ML)

Feedback loop-komponenter:

Komponent Azure-tjeneste Formål
Data collection Inference tables (managed endpoints) Capture production inputs/outputs
Monitoring Model Monitor (Azure ML) Data drift, prediction drift, performance
Alerting Azure Monitor Alerts Email/webhook ved threshold breach
Retraining Azure ML Pipelines Triggered retraining workflow
A/B testing Staging endpoints Champion vs challenger validation
Deployment Managed Online Endpoints Blue-green deployment

Kodeeksempel: Alert notification ved data drift

from azure.ai.ml.entities import AlertNotification

alert_notification = AlertNotification(
    emails=['ml-team@example.com', 'data-science-lead@example.com']
)

monitor_definition = MonitorDefinition(
    compute=spark_compute,
    monitoring_target=monitoring_target,
    monitoring_signals={"data_drift": data_drift_signal},
    alert_notification=alert_notification  # Sends email when drift detected
)

Azure AI Foundry (GenAI)

Feedback loop-komponenter:

Komponent Azure-tjeneste Formål
Production tracing MLflow Tracing (Databricks) Span-level telemetry
User feedback Review App Thumbs up/down, textual feedback
LLM judges Agent Evaluation Automated quality scoring
Monitoring dashboard Azure AI Foundry Observability Quality trends, latency, errors
Eval datasets MLflow Datasets (Unity Catalog) Versioned test sets
Red teaming AI Red Teaming Agent Adversarial testing for safety

Kodeeksempel: Production monitoring setup (GenAI)

from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    MonitorSchedule,
    CronTrigger,
    MonitorDefinition,
    ServerlessSparkCompute,
    MonitoringTarget,
    GenerationSafetyQualitySignal,
    GenerationSafetyQualityMonitoringMetricThreshold,
    LlmData,
    BaselineDataRange,
)

ml_client = MLClient(...)

# Define quality thresholds (70% passing rate)
quality_thresholds = GenerationSafetyQualityMonitoringMetricThreshold(
    groundedness={"aggregated_groundedness_pass_rate": 0.7},
    relevance={"aggregated_relevance_pass_rate": 0.7},
    coherence={"aggregated_coherence_pass_rate": 0.7},
    fluency={"aggregated_fluency_pass_rate": 0.7},
)

# Reference production data (app traces)
data_window = BaselineDataRange(lookback_window_size="P7D", lookback_window_offset="P0D")
production_data = LlmData(
    data_column_names={
        "prompt_column": "question",
        "completion_column": "answer",
        "context_column": "context"
    },
    input_data=Input(type="uri_folder", path="endpoint-deployment-app_traces:1"),
    data_window=data_window,
)

# Create quality signal
gsq_signal = GenerationSafetyQualitySignal(
    connection_id=f"/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.MachineLearningServices/workspaces/{workspace}/connections/{aoai_connection}",
    metric_thresholds=quality_thresholds,
    production_data=[production_data],
    sampling_rate=1.0,  # Evaluate 100% of traffic
)

# Schedule daily evaluation
monitor_definition = MonitorDefinition(
    compute=ServerlessSparkCompute(instance_type="standard_e4s_v3", runtime_version="3.3"),
    monitoring_target=MonitoringTarget(
        ml_task=MonitorTargetTasks.QUESTION_ANSWERING,
        endpoint_deployment_id=f"azureml:{endpoint_name}:{deployment_name}"
    ),
    monitoring_signals={"quality_signal": gsq_signal},
    alert_notification=AlertNotification(emails=["genai-team@example.com"])
)

trigger = CronTrigger(expression="15 10 * * *")  # Daily at 10:15 AM

model_monitor = MonitorSchedule(
    name="chatbot_quality_monitor",
    trigger=trigger,
    create_monitor=monitor_definition
)

ml_client.schedules.begin_create_or_update(model_monitor)

Power Platform AI (Citizen Developer Scenario)

Feedback loop-komponenter:

Komponent Power Platform-tjeneste Formål
Automated feedback collection Power Automate Route low-confidence predictions til human review
Storage Dataverse / SharePoint Lagre feedback data
Model improvement AI Builder Feedback Loop Automatically add reviewed samples to training set
Retraining AI Builder Manual/scheduled retraining

Eksempel-workflow (Power Automate):

  1. Trigger: AI Builder prediction (e.g., document processing)
  2. Condition: If confidence score < 0.7
  3. Action: Save file + prediction output to AI Builder feedback loop storage
  4. Notification: Send email til reviewer

Resultat: Reviewed documents automatisk tilgjengelige i "Feedback loop" data source når modellen retraines.

Offentlig sektor (Norge)

Regulatoriske krav

EU AI Act + Norsk implementering:

  • Høyrisiko-AI: Kontinuerlig monitorering og logging obligatorisk (Article 61)
  • Sporbarhet: Automatiske logger av inputs, outputs, decisions
  • Human oversight: HITL review for kritiske beslutninger (Article 14)
  • Retesting: Periodisk evaluering mot original test set + new edge cases

Implementering i Microsoft-stakken:

# Compliant logging example (GDPR + AI Act)
import mlflow

# Log input/output + rationale (Article 61: Record-keeping)
mlflow.log_param("input_hash", hash(user_query))  # Pseudonymized
mlflow.log_metric("confidence_score", 0.85)
mlflow.log_text("rationale", "Retrieved relevant documents from internal KB")

# Human review trigger (Article 14: Human oversight)
if confidence_score < 0.7:
    send_to_human_review(trace_id, user_query, model_output)

Bærekraft (grønn AI)

Retraining frequency vs CO₂-footprint:

Strategi CO₂-impact Når bruke
Daily retraining HIGH Finansmarkeder, real-time fraud detection
Weekly retraining MEDIUM Customer support chatbots
Threshold-based LOW Retrain only når accuracy < 90%
Manual trigger VERY LOW Statisk domene (image classification)

Azure-støtte:

  • Carbon-aware deployment: Deploy til low-carbon regions (Sweden Central, Norway East)
  • Model decay detection: Unngå unødvendig retraining via threshold-based triggers
  • Efficient inference: Azure ML Managed Online Endpoints med auto-scaling

Datahåndtering (Personvern)

GDPR-compliance i feedback loops:

  • Right to explanation (Article 22): Trace-logginig må inkludere model reasoning
  • Right to be forgotten (Article 17): Mulighet til å slette user feedback data
  • Data minimization (Article 5): Kun logg nødvendige fields (ikke full user profile)

Implementering:

# Pseudonymization (GDPR-compliant)
import hashlib

user_id_hash = hashlib.sha256(user_id.encode()).hexdigest()

mlflow.log_param("user_id_hash", user_id_hash)  # Logged
# Original user_id IKKE lagret i MLflow

Kostnad og lisensiering

Compute-kostnader (Retraining)

Azure Machine Learning:

Scenario Compute Type Estimert kostnad (NOK/mnd) Confidence
Daily retraining (tabular ML) Standard_DS3_v2 (4 vCPU) ~15 000 - 25 000 HIGH
Weekly retraining (CV) GPU (NC6s_v3) ~8 000 - 12 000 HIGH
Threshold-based (GenAI) Minimal (only when triggered) ~2 000 - 5 000 MEDIUM

Databricks (GenAI Evaluation):

Scenario Compute Type Estimat (NOK/mnd) Confidence
Daily LLM judge evaluation (10k traces) Serverless Spark (standard_e4s_v3) ~10 000 - 15 000 MEDIUM
Human review (Review App) Minimal (UI hosting) ~500 - 1 000 HIGH

Storage-kostnader

Inference tables + eval datasets:

  • Azure Storage (Delta Lake): ~0.50 NOK/GB/mnd
  • MLflow Tracking: ~1-2 NOK per experiment run (metadata)

Estimat: 10 000 daily inferences → ~5 GB/mnd → ~2.50 NOK/mnd storage

Lisenser

Microsoft Fabric + Azure ML:

  • Azure ML Enterprise: Inkludert i subscription, per-use compute pricing
  • Databricks (Unity Catalog): Premium tier (~$2-3 per DBU)

Power Platform:

License AI Builder Credits/mnd Feedback Loop Support
Per User 500
Per App Ikke inkludert (krever Per User)
AI Builder add-on Custom (kjøp ekstra)

For arkitekten (Cosmo)

Når anbefale automated feedback loops?

Ja, anbefal:

  • Produksjonsmodell med > 1000 daily inferences
  • Clear performance metrics (accuracy, F1, RMSE)
  • Regulatory compliance krav (AI Act, ISO 27001)
  • Business-critical application (customer-facing, revenue impact)

⚠️ Vurder nøye:

  • Proof-of-concept eller pilot (manuell evaluering holder)
  • Lav inference volume (< 100/day)
  • Statisk domene (sjeldent endringer i data)
  • Begrensede MLOps-ressurser (prioriter automation later)

Anbefalte spørsmål til kunden

  1. Volum: Hvor mange inferences per dag forventes i produksjon?
  2. Kritikalitet: Hva er konsekvensen av feil predictions? (customer impact, revenue loss)
  3. Data dynamics: Hvor ofte endrer input-dataene seg? (daily, weekly, seasonal)
  4. Expertise: Har teamet MLOps-kompetanse, eller er dette first AI project?
  5. Budget: Hva er akseptabel månedlig kostnad for monitoring + retraining?
  6. Regulatory: Gjelder AI Act / GDPR high-risk classification?

Røde flagg (anti-patterns)

"Vi retrainer hver natt uten å sjekke om det er nødvendig" → Forslag: Threshold-based retraining (spare compute + CO₂)

"Vi har ingen monitoring, men deployer nye modeller hver uke" → Forslag: Implementer baseline monitoring før du øker deployment-frekvens

"Brukerne klager på dårlig kvalitet, men vi har ingen feedback-mekanisme" → Forslag: Start med enkel thumbs up/down i UI, logg til Application Insights

"Vi evaluerer kun på original test set, aldri production data" → Forslag: Exporter sample av inference tables til eval dataset (catch drift)

Suksess-metrikker for feedback loops

Metric Target Måleenhet
Mean time to detect (MTTD) < 24 timer Time fra quality degradation til alert
Retraining cycle time < 7 dager Time fra drift detection til ny model i prod
User feedback rate > 5% % av inferences hvor user gir feedback
False positive rate (monitoring) < 10% % av alerts som ikke krever action
Quality improvement per iteration > 5% Accuracy/F1 gain per retraining cycle

Kilder og verifisering

Primærkilder (Microsoft Learn):

  1. MLflow for GenAI Apps and Agents - Continuous Improvement Cycle (Verified MCP 2026-04 — updated 10-step cycle; new: Trace UI for pattern identification, evaluation harness, version/prompt management tracking)
  2. Machine Learning Operations v2 - Monitoring & Feedback
  3. Generative AI App Developer Workflow - Production Monitoring
  4. Azure AI Foundry - Observability in Generative AI
  5. MLOps and GenAIOps for AI Workloads - Model Maintenance
  6. AI Builder - Continuously Improve Your Model (Feedback Loop)

Code samples:

  • MLflow feedback logging: Azure Databricks - Agent Framework
  • Model monitoring setup: Azure ML - Monitor Model Performance (Verified MCP 2026-04 — supports data quality, data drift, prediction drift, feature attribution drift, and custom signals; integrates with Azure Event Grid for alerting)
  • GenAI evaluation: MLflow 3.x - Evaluate App (Verified MCP 2026-04 — tutorial covers RAG email app evaluation; new scorers: RetrievalGroundedness, Guidelines, RelevanceToQuery, Safety; version comparison with mlflow.genai.evaluate())

Dato for siste verifikasjon: 2026-04-10

MCP calls: 6 (microsoft_docs_search: 3, microsoft_docs_fetch: 3, microsoft_code_sample_search: 2)


For Cosmo

Dette dokumentet dekker hele feedback loop-syklusen for både classical ML og GenAI. Nøkkelpunkter å fremheve i konsultasjon:

  1. Ikke one-size-fits-all: Automated retraining passer ikke alle (se beslutningsveiledning)
  2. Start enkelt: Thumbs up/down + basic monitoring før du bygger kompleks MLOps-pipeline
  3. GenAI ≠ Classical ML: GenAI krever LLM judges + human review, ikke bare accuracy metrics
  4. Compliance: AI Act krever kontinuerlig monitorering for høyrisiko-systemer (ikke optional)
  5. Kostnad: Threshold-based retraining kan spare 50-70% compute vs daily retraining

Bruk arkitekturmønstrene til å visualisere løsningen for kunden. Påpek at MLflow Tracing + Agent Evaluation gir "free" observability (built-in i Databricks).

MLflow 3 Evaluation & Feedback Loop (Verified MCP 2026-04)

MLflow 3 introduces a unified evaluation-monitoring lifecycle for GenAI feedback loops:

Iterative workflow:

  1. Trace production requests (MLflow Tracing — end-to-end observability)
  2. Evaluate against scorers during development (mlflow.genai.evaluate())
  3. Monitor production with same scorers (consistent quality measurement)
  4. Gather human feedback via Review App (expert annotations)
  5. Improve prompts/models based on evaluation datasets

Built-in LLM judges (scorers):

  • RetrievalGroundedness — checks if response is grounded in retrieved data
  • RelevanceToQuery — checks if response addresses the user request
  • Safety — checks for harmful/inappropriate content
  • Guidelines(name, guidelines) — custom policy/tone/style checks
  • Correctness — factual correctness with expected_facts

Azure ML Model Monitoring signals:

  • Data quality: null values, out-of-range, type mismatch
  • Data drift: statistical distribution changes between training and production data
  • Prediction drift: distribution shift in model outputs
  • Feature attribution drift: changes in feature importance
  • Custom signals: user-defined metrics via custom scripts
  • Integrates with Azure Event Grid for alerting on threshold breaches

Evaluation dataset workflow (new 2026-04):

  1. Search production traces → select problematic + high-quality examples
  2. Save to versioned eval dataset in Unity Catalog (mlflow.genai.datasets.create_dataset())
  3. Run evaluation harness with mlflow.genai.evaluate(data=eval_dataset, predict_fn=..., scorers=...)
  4. Compare runs in UI (Evaluation runs view) or SDK (mlflow.search_runs)
  5. Identify regressions per-metric before promoting new versions

Continuous improvement cycle: Production traces → MLflow evaluation datasets → Scorer alignment → Prompt/model update → A/B test → Production rollout