Key content changes: - MLOps: MLflow 3 scorers expanded (RetrievalRelevance, Fluency, multi-turn judges) - MLflow 3 A/B eval: mirror_traffic GA confirmed, new scorer catalog - CI/CD: OIDC auth replaces deprecated --sdk-auth (Azure ML GitHub Actions) - Agent framework A2A: updated SDK patterns (A2ACardResolver, BearerAuth) - AG-UI backend tool rendering: accurate TOOL_CALL_* event shapes - Computer Use agents: US region requirement, credentials patterns - Purview governance: bulk term edit, expire/delete workflows - CAF AI Secure: 3-phase structure confirmed current - Copilot Studio: Claude Sonnet 4.5/4.6 GA, new orchestration controls - M365 manifest: v1.26 GA (April 2026), copilotAgents node - Power Platform: agent flow capacity enforcement corrected - Azure Monitor: Simple Log Alerts GA, AMBA for policy-based alerting - Security Copilot: SCU capacity model (400 SCU/1000 users) - EU Data Boundary: all EU + EFTA countries confirmed - gateway-multi-backend: added 4th topology, subscription-level quota note Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
34 KiB
Feedback Loops and Continuous Improvement
Kategori: MLOps & GenAIOps Dato: 2026-02-04 Last updated: 2026-04 Confidence: HIGH (basert på offisiell Microsoft-dokumentasjon)
Verified: MCP 2026-04
Introduksjon
Feedback loops og kontinuerlig forbedring er kritiske komponenter i moderne AI-operasjoner. I motsetning til tradisjonell programvare, hvor funksjonalitet er deterministisk, kan AI-modeller vise kvalitetsdrift eller uventet oppførsel når de møter reelle data. Et velfungerende feedback-system sikrer at modeller forblir nøyaktige, relevante og trygge gjennom hele sin livssyklus.
Nøkkelkonsept: Feedback loops kobler produksjonsdata, brukerinnsikt og ytelsesmetrikker tilbake til utviklingsprosessen, og skaper en kontinuerlig syklus av måling, læring og forbedring.
Hvorfor dette er viktig
- Modellforfall (model decay): AI-modeller degraderer over tid på grunn av endringer i data, brukermønstre eller forretningskontekst
- Kvalitetssikring: Automatisert og manuell evaluering avdekker gap mellom forventet og faktisk ytelse
- Brukerverdi: Direkte tilbakemelding fra sluttbrukere gir innsikt som ikke fanges av tekniske metrikker
- Compliance: Regulatoriske krav (AI Act, GDPR) krever sporbarhet og kontinuerlig overvåking
Kjernekomponenter
1. Production Monitoring & Telemetry
Azure-tjenester:
- Azure Monitor + Application Insights: Sanker telemetri fra endpoints, sporer latens, feilrater, token-forbruk
- Azure Machine Learning Model Monitoring: Automatisk deteksjon av data drift, prediction drift og model performance degradation
- MLflow Tracing: Detaljert sporing av hver inferens-interaksjon, inkludert inputs, outputs, mellomsteg
Nøkkelmetrikker:
| Dimensjon | Metrikker | Confidence |
|---|---|---|
| Operational | Request volume, latency (p50/p95), error rates, token usage | HIGH |
| Quality | Groundedness, relevance, coherence, safety pass rate | HIGH (GenAI) |
| User Feedback | Thumbs up/down, ratings, explicit reports | MEDIUM |
Kodeeksempel: Logging av user feedback (MLflow)
import mlflow
from mlflow.entities import AssessmentSource
import time
# Wait for trace to be ready
time.sleep(1)
# Extract span and trace IDs from response
response_dict = response.as_dict()
first_prediction = response_dict["predictions"][0]
first_result = first_prediction["results"][0]
span_id = first_result["span_id"]
trace_id = first_prediction["trace_id"]
# Log user feedback
mlflow.log_feedback(
trace_id=trace_id,
span_id=span_id,
name="user_feedback",
value=True, # True for positive, False for negative
source=AssessmentSource(source_type="HUMAN"),
rationale="Answer was accurate and well-reasoned",
)
2. Data Collection & Evaluation Datasets
Prosess:
- Production traces → Evaluation set: Bruk inference table logs til å identifisere problematiske interaksjoner
- Synthetic data generation: Generer startdatasett før produksjonsdata er tilgjengelig
- Expert curation: SMEs validerer og annoterer edge cases, gold standard-svar
Azure-tjenester:
- MLflow Datasets: Versjonert lagring av eval-datasett i Unity Catalog
- Azure AI Foundry Agent Evaluation: Evaluering med LLM judges (correctness, relevance, groundedness, safety)
- Databricks Review App: Samle feedback fra domeneeksperter på produksjonstracer
Best practices:
- Inkluder både forventede og uventede bruksmønstre i eval-settet
- Test for edge cases (lange/korte inputs, misspellings, prompt injection)
- Kombiner
expected_facts(fleksibelt) medguidelines(tone, style, policy)
Kodeeksempel: Evaluering med MLflow
import mlflow
from mlflow.genai.scorers import Correctness, RelevanceToQuery
# Define evaluation dataset
eval_data = [
{
"inputs": {"question": "What is MLflow?"},
"expectations": {
"expected_facts": ["open-source platform", "ML lifecycle management"]
}
},
{
"inputs": {"question": "How do I track experiments?"},
"expectations": {
"expected_facts": ["mlflow.start_run()", "log metrics", "log parameters"]
}
}
]
# Run evaluation
results = mlflow.genai.evaluate(
data=eval_data,
predict_fn=my_agent,
scorers=[Correctness(), RelevanceToQuery()],
)
print(f"Correctness score: {results.metrics['correctness/mean']:.2f}")
3. Automated Retraining & Model Promotion
Strategier:
| Strategi | Når bruke | Trade-offs |
|---|---|---|
| Online training | Daglig/kontinuerlig oppdatering med nye data | Høy kostnad, krever robust automation |
| Offline training | Sjeldnere oppdatering (ukentlig/månedlig) | Lavere kostnad, risiko for model decay |
| Threshold-based | Retrain når ytelse faller under terskel | Balanserer presisjon vs energiforbruk |
Azure-tjenester:
- Azure Machine Learning Pipelines: CI/CD for modelltrening og deployment
- Azure DevOps / GitHub Actions: Automatiserte triggers ved model registration
- Azure Arc: Hybrid/multicloud deployment-orkestrering
Triggers for retraining:
- Data drift: Statistical properties of input data har endret seg (detektert via monitoring)
- Prediction drift: Output-distribusjonen avviker fra baseline
- Performance degradation: Metrics (accuracy, F1-score) faller under threshold
- Manual trigger: Human-in-the-loop approval for kritiske modeller
Kodeeksempel: Model monitoring setup
from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
MonitorSchedule,
RecurrenceTrigger,
MonitorDefinition,
ServerlessSparkCompute,
MonitoringTarget,
AlertNotification,
DataDriftSignal,
DataDriftMetricThreshold,
NumericalDriftMetrics,
)
# Setup monitoring for data drift
ml_client = MLClient(...)
spark_compute = ServerlessSparkCompute(
instance_type="standard_e4s_v3",
runtime_version="3.3"
)
monitoring_target = MonitoringTarget(
ml_task="classification",
endpoint_deployment_id="azureml:fraud-detection-endpoint:main"
)
# Define drift thresholds
metric_thresholds = DataDriftMetricThreshold(
numerical=NumericalDriftMetrics(
jensen_shannon_distance=0.01 # Retrain when drift exceeds 1%
)
)
data_drift_signal = DataDriftSignal(
reference_data=training_data,
metric_thresholds=metric_thresholds,
alert_enabled=True
)
# Create monitoring schedule
monitor_definition = MonitorDefinition(
compute=spark_compute,
monitoring_target=monitoring_target,
monitoring_signals={"data_drift": data_drift_signal},
alert_notification=AlertNotification(emails=["ml-team@example.com"])
)
recurrence_trigger = RecurrenceTrigger(
frequency="day",
interval=1,
schedule=RecurrencePattern(hours=3, minutes=0)
)
model_monitor = MonitorSchedule(
name="fraud_detection_monitor",
trigger=recurrence_trigger,
create_monitor=monitor_definition
)
ml_client.schedules.begin_create_or_update(model_monitor)
4. Human-in-the-Loop (HITL) Workflows
Komponenter:
- Review App (Databricks): Thumbs up/down, textual feedback på agent-svar
- Expert labeling: SMEs annoterer traces med expected outputs, policy violations
- Approval gates: Human godkjenning før deploy til prod (kritiske modeller)
Azure-tjenester:
- Azure Logic Apps / Power Automate: Workflow automation for HITL review
- AI Builder Feedback Loop: Automatisk routing av low-confidence predictions til human review
Best practices:
- Balancer automation vs HITL: Kun review low-confidence outputs (< 70% score)
- Unngå reviewer fatigue: Sample strategisk, ikke alle interaksjoner
- Incorporate feedback raskt: Weekly review cycles, ikke månedlig
5. Continuous Improvement Cycle (MLflow for GenAI)
10-stegs syklus:
- 🚀 Production App: Deployed agent generer traces med inputs/outputs
- 👍 👎 User Feedback: Thumbs up/down på hver interaksjon
- 🔍 Monitor & Score: LLM judges (correctness, safety, relevance) scorer automatisk
- ⚠️ Identify Issues: Trace UI viser mønstre i low-scoring traces
- 👥 Domain Expert Review: Sample sendes til SMEs via Review App
- 📋 Build Eval Dataset: Kurater problematiske + high-quality traces til eval-sett
- 🎯 Tune Scorers: Bruk expert feedback til å align LLM judges med human judgment
- 🧪 Evaluate New Versions: Test forbedringer mot eval-settet med samme scorers
- 📈 Compare Results: MLflow evaluation runs sammenligner versioner
- ✅ Deploy or Iterate: Deploy hvis kvalitet forbedres uten regresjon
Kodeeksempel: Versjon-sammenligning
import mlflow
# Evaluate v1
with mlflow.start_run(run_name="v1"):
eval_results_v1 = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=generate_sales_email_v1,
scorers=email_judges,
)
# Evaluate v2
with mlflow.start_run(run_name="v2"):
eval_results_v2 = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=generate_sales_email_v2,
scorers=email_judges, # Same judges for fairness
)
# Compare results
run_v1_df = mlflow.search_runs(filter_string=f"run_id = '{eval_results_v1.run_id}'")
run_v2_df = mlflow.search_runs(filter_string=f"run_id = '{eval_results_v2.run_id}'")
metric_cols = [col for col in run_v1_df.columns
if col.startswith('metrics.') and col.endswith('/mean')]
for metric in metric_cols:
v1_score = run_v1_df[metric].iloc[0]
v2_score = run_v2_df[metric].iloc[0]
improvement = v2_score - v1_score
print(f"{metric}: {v1_score:.3f} → {v2_score:.3f} ({improvement:+.3f})")
Arkitekturmønstre
Pattern 1: Automated MLOps Loop (Classical ML)
┌─────────────────────────────────────────────────────────┐
│ Production Deployment (Managed Online Endpoint) │
│ ├─ Data Collection (inference tables) │
│ └─ Monitoring (Azure Monitor, drift detection) │
└─────────────────────┬───────────────────────────────────┘
│ Drift detected / Threshold reached
▼
┌─────────────────────────────────────────────────────────┐
│ CI/CD Pipeline (Azure Pipelines / GitHub Actions) │
│ ├─ Pull production data │
│ ├─ Retrain model (Azure ML Compute) │
│ ├─ Evaluate (test set + validation metrics) │
│ └─ Promote to staging (if quality gates pass) │
└─────────────────────┬───────────────────────────────────┘
│ Human approval (HITL)
▼
┌─────────────────────────────────────────────────────────┐
│ Staging Environment │
│ ├─ A/B testing (champion vs challenger) │
│ ├─ Responsible AI checks (bias, fairness) │
│ └─ Final validation │
└─────────────────────┬───────────────────────────────────┘
│ Deploy to prod
▼
[Production]
Når bruke:
- Tabular ML (classification, regression, forecasting)
- Automated retraining er justified (kostnadseffektivt)
- Modellen har clear performance metrics (accuracy, RMSE, F1)
Pattern 2: GenAI Feedback Loop (LLM Applications)
┌─────────────────────────────────────────────────────────┐
│ Production Agent (Model Serving Endpoint) │
│ ├─ MLflow Tracing (span-level telemetry) │
│ ├─ User feedback (thumbs up/down) │
│ └─ Inference tables (Unity Catalog) │
└─────────────────────┬───────────────────────────────────┘
│ Daily batch evaluation
▼
┌─────────────────────────────────────────────────────────┐
│ Production Monitoring (Agent Evaluation) │
│ ├─ LLM Judges (correctness, safety, relevance) │
│ ├─ Sampling rate: 10-100% of traffic │
│ └─ Alerts on quality degradation │
└─────────────────────┬───────────────────────────────────┘
│ Export low-scoring traces
▼
┌─────────────────────────────────────────────────────────┐
│ Evaluation Dataset Curation │
│ ├─ Filter by user feedback + LLM judge scores │
│ ├─ SME review (Review App) │
│ └─ Add to versioned eval dataset (MLflow Datasets) │
└─────────────────────┬───────────────────────────────────┘
│ Trigger improvement cycle
▼
┌─────────────────────────────────────────────────────────┐
│ Agent Development (Inner Loop) │
│ ├─ Refine prompts / retrieval logic / tools │
│ ├─ Run offline evaluation (eval dataset + scorers) │
│ └─ Compare to baseline (MLflow tracking) │
└─────────────────────┬───────────────────────────────────┘
│ Quality improved?
▼
[Yes: Deploy] [No: Iterate]
Når bruke:
- Agentic RAG, chatbots, content generation
- Quality er subjektiv (tone, style, policy compliance)
- Frequent prompt/logic changes, ikke bare model retraining
Pattern 3: Hybrid (CV/NLP med Human Annotation)
┌─────────────────────────────────────────────────────────┐
│ Production Model (Batch/Online Endpoint) │
│ └─ Model performance monitoring (accuracy on new data)│
└─────────────────────┬───────────────────────────────────┘
│ Performance drops
▼
┌─────────────────────────────────────────────────────────┐
│ Human-in-the-Loop Annotation │
│ ├─ Sample low-confidence predictions │
│ ├─ Annotators label new data (Azure ML Labeling) │
│ └─ Quality review by SMEs │
└─────────────────────┬───────────────────────────────────┘
│ New labeled data
▼
┌─────────────────────────────────────────────────────────┐
│ Model Development (Inner Loop) │
│ ├─ Update training set with new annotations │
│ ├─ Retrain model (not automated) │
│ └─ Evaluate on test set + new edge cases │
└─────────────────────┬───────────────────────────────────┘
│ Quality gates pass?
▼
[Staging → Production]
Når bruke:
- Computer vision (image classification, object detection)
- NLP tasks (text classification, NER)
- Automated retraining ikke ønskelig (ressurskrevende, krever human review)
Beslutningsveiledning
Når implementere automated vs manual retraining?
| Factor | Automated Retraining | Manual Retraining |
|---|---|---|
| Data volume | High (daglig nye data) | Low (ukentlig/månedlig) |
| Model stability | High (proven architecture) | Low (experimental) |
| Cost tolerance | High (compute budget ok) | Low (kostnadssensitiv) |
| Regulatory | Low risk (non-critical) | High risk (health, finance) |
| Expertise | Available (MLOps team) | Limited (manual review nødvendig) |
Tommelfingerregel:
- Classical ML (tabular): Automatiser hvis data volume > 1000 nye rader/dag
- GenAI (LLM): Manuell iteration (prompt refinement) oftere enn retraining
- CV/NLP: Hybrid (automated monitoring → manual annotation → triggered retraining)
Når bruke LLM judges vs human evaluation?
| Scenario | LLM Judges | Human Evaluation |
|---|---|---|
| Factual correctness | ✅ (with expected_facts) | ✅ (gold standard) |
| Safety (toxicity, bias) | ✅ (high recall) | ✅ (final validation) |
| Style/tone compliance | ✅ (guidelines judge) | ✅ (subjective quality) |
| Edge cases | ⚠️ (may miss nuance) | ✅ (domain expertise) |
| Volume | ✅ (scale to 100% traffic) | ❌ (sample 1-10%) |
| Cost | Medium (LLM inference) | High (SME time) |
Best practice:
- Start med LLM judges for bulk evaluation (development + production monitoring)
- Sample 10-20% av low-scoring traces for human review
- Bruk human feedback til å tune LLM judges (few-shot examples)
Integrasjon med Microsoft-stakken
Azure Machine Learning (Classical ML)
Feedback loop-komponenter:
| Komponent | Azure-tjeneste | Formål |
|---|---|---|
| Data collection | Inference tables (managed endpoints) | Capture production inputs/outputs |
| Monitoring | Model Monitor (Azure ML) | Data drift, prediction drift, performance |
| Alerting | Azure Monitor Alerts | Email/webhook ved threshold breach |
| Retraining | Azure ML Pipelines | Triggered retraining workflow |
| A/B testing | Staging endpoints | Champion vs challenger validation |
| Deployment | Managed Online Endpoints | Blue-green deployment |
Kodeeksempel: Alert notification ved data drift
from azure.ai.ml.entities import AlertNotification
alert_notification = AlertNotification(
emails=['ml-team@example.com', 'data-science-lead@example.com']
)
monitor_definition = MonitorDefinition(
compute=spark_compute,
monitoring_target=monitoring_target,
monitoring_signals={"data_drift": data_drift_signal},
alert_notification=alert_notification # Sends email when drift detected
)
Azure AI Foundry (GenAI)
Feedback loop-komponenter:
| Komponent | Azure-tjeneste | Formål |
|---|---|---|
| Production tracing | MLflow Tracing (Databricks) | Span-level telemetry |
| User feedback | Review App | Thumbs up/down, textual feedback |
| LLM judges | Agent Evaluation | Automated quality scoring |
| Monitoring dashboard | Azure AI Foundry Observability | Quality trends, latency, errors |
| Eval datasets | MLflow Datasets (Unity Catalog) | Versioned test sets |
| Red teaming | AI Red Teaming Agent | Adversarial testing for safety |
Kodeeksempel: Production monitoring setup (GenAI)
from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
MonitorSchedule,
CronTrigger,
MonitorDefinition,
ServerlessSparkCompute,
MonitoringTarget,
GenerationSafetyQualitySignal,
GenerationSafetyQualityMonitoringMetricThreshold,
LlmData,
BaselineDataRange,
)
ml_client = MLClient(...)
# Define quality thresholds (70% passing rate)
quality_thresholds = GenerationSafetyQualityMonitoringMetricThreshold(
groundedness={"aggregated_groundedness_pass_rate": 0.7},
relevance={"aggregated_relevance_pass_rate": 0.7},
coherence={"aggregated_coherence_pass_rate": 0.7},
fluency={"aggregated_fluency_pass_rate": 0.7},
)
# Reference production data (app traces)
data_window = BaselineDataRange(lookback_window_size="P7D", lookback_window_offset="P0D")
production_data = LlmData(
data_column_names={
"prompt_column": "question",
"completion_column": "answer",
"context_column": "context"
},
input_data=Input(type="uri_folder", path="endpoint-deployment-app_traces:1"),
data_window=data_window,
)
# Create quality signal
gsq_signal = GenerationSafetyQualitySignal(
connection_id=f"/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.MachineLearningServices/workspaces/{workspace}/connections/{aoai_connection}",
metric_thresholds=quality_thresholds,
production_data=[production_data],
sampling_rate=1.0, # Evaluate 100% of traffic
)
# Schedule daily evaluation
monitor_definition = MonitorDefinition(
compute=ServerlessSparkCompute(instance_type="standard_e4s_v3", runtime_version="3.3"),
monitoring_target=MonitoringTarget(
ml_task=MonitorTargetTasks.QUESTION_ANSWERING,
endpoint_deployment_id=f"azureml:{endpoint_name}:{deployment_name}"
),
monitoring_signals={"quality_signal": gsq_signal},
alert_notification=AlertNotification(emails=["genai-team@example.com"])
)
trigger = CronTrigger(expression="15 10 * * *") # Daily at 10:15 AM
model_monitor = MonitorSchedule(
name="chatbot_quality_monitor",
trigger=trigger,
create_monitor=monitor_definition
)
ml_client.schedules.begin_create_or_update(model_monitor)
Power Platform AI (Citizen Developer Scenario)
Feedback loop-komponenter:
| Komponent | Power Platform-tjeneste | Formål |
|---|---|---|
| Automated feedback collection | Power Automate | Route low-confidence predictions til human review |
| Storage | Dataverse / SharePoint | Lagre feedback data |
| Model improvement | AI Builder Feedback Loop | Automatically add reviewed samples to training set |
| Retraining | AI Builder | Manual/scheduled retraining |
Eksempel-workflow (Power Automate):
- Trigger: AI Builder prediction (e.g., document processing)
- Condition: If confidence score < 0.7
- Action: Save file + prediction output to AI Builder feedback loop storage
- Notification: Send email til reviewer
Resultat: Reviewed documents automatisk tilgjengelige i "Feedback loop" data source når modellen retraines.
Offentlig sektor (Norge)
Regulatoriske krav
EU AI Act + Norsk implementering:
- Høyrisiko-AI: Kontinuerlig monitorering og logging obligatorisk (Article 61)
- Sporbarhet: Automatiske logger av inputs, outputs, decisions
- Human oversight: HITL review for kritiske beslutninger (Article 14)
- Retesting: Periodisk evaluering mot original test set + new edge cases
Implementering i Microsoft-stakken:
# Compliant logging example (GDPR + AI Act)
import mlflow
# Log input/output + rationale (Article 61: Record-keeping)
mlflow.log_param("input_hash", hash(user_query)) # Pseudonymized
mlflow.log_metric("confidence_score", 0.85)
mlflow.log_text("rationale", "Retrieved relevant documents from internal KB")
# Human review trigger (Article 14: Human oversight)
if confidence_score < 0.7:
send_to_human_review(trace_id, user_query, model_output)
Bærekraft (grønn AI)
Retraining frequency vs CO₂-footprint:
| Strategi | CO₂-impact | Når bruke |
|---|---|---|
| Daily retraining | HIGH | Finansmarkeder, real-time fraud detection |
| Weekly retraining | MEDIUM | Customer support chatbots |
| Threshold-based | LOW | Retrain only når accuracy < 90% |
| Manual trigger | VERY LOW | Statisk domene (image classification) |
Azure-støtte:
- Carbon-aware deployment: Deploy til low-carbon regions (Sweden Central, Norway East)
- Model decay detection: Unngå unødvendig retraining via threshold-based triggers
- Efficient inference: Azure ML Managed Online Endpoints med auto-scaling
Datahåndtering (Personvern)
GDPR-compliance i feedback loops:
- Right to explanation (Article 22): Trace-logginig må inkludere model reasoning
- Right to be forgotten (Article 17): Mulighet til å slette user feedback data
- Data minimization (Article 5): Kun logg nødvendige fields (ikke full user profile)
Implementering:
# Pseudonymization (GDPR-compliant)
import hashlib
user_id_hash = hashlib.sha256(user_id.encode()).hexdigest()
mlflow.log_param("user_id_hash", user_id_hash) # Logged
# Original user_id IKKE lagret i MLflow
Kostnad og lisensiering
Compute-kostnader (Retraining)
Azure Machine Learning:
| Scenario | Compute Type | Estimert kostnad (NOK/mnd) | Confidence |
|---|---|---|---|
| Daily retraining (tabular ML) | Standard_DS3_v2 (4 vCPU) | ~15 000 - 25 000 | HIGH |
| Weekly retraining (CV) | GPU (NC6s_v3) | ~8 000 - 12 000 | HIGH |
| Threshold-based (GenAI) | Minimal (only when triggered) | ~2 000 - 5 000 | MEDIUM |
Databricks (GenAI Evaluation):
| Scenario | Compute Type | Estimat (NOK/mnd) | Confidence |
|---|---|---|---|
| Daily LLM judge evaluation (10k traces) | Serverless Spark (standard_e4s_v3) | ~10 000 - 15 000 | MEDIUM |
| Human review (Review App) | Minimal (UI hosting) | ~500 - 1 000 | HIGH |
Storage-kostnader
Inference tables + eval datasets:
- Azure Storage (Delta Lake): ~0.50 NOK/GB/mnd
- MLflow Tracking: ~1-2 NOK per experiment run (metadata)
Estimat: 10 000 daily inferences → ~5 GB/mnd → ~2.50 NOK/mnd storage
Lisenser
Microsoft Fabric + Azure ML:
- Azure ML Enterprise: Inkludert i subscription, per-use compute pricing
- Databricks (Unity Catalog): Premium tier (~$2-3 per DBU)
Power Platform:
| License | AI Builder Credits/mnd | Feedback Loop Support |
|---|---|---|
| Per User | 500 | ✅ |
| Per App | Ikke inkludert | ❌ (krever Per User) |
| AI Builder add-on | Custom (kjøp ekstra) | ✅ |
For arkitekten (Cosmo)
Når anbefale automated feedback loops?
✅ Ja, anbefal:
- Produksjonsmodell med > 1000 daily inferences
- Clear performance metrics (accuracy, F1, RMSE)
- Regulatory compliance krav (AI Act, ISO 27001)
- Business-critical application (customer-facing, revenue impact)
⚠️ Vurder nøye:
- Proof-of-concept eller pilot (manuell evaluering holder)
- Lav inference volume (< 100/day)
- Statisk domene (sjeldent endringer i data)
- Begrensede MLOps-ressurser (prioriter automation later)
Anbefalte spørsmål til kunden
- Volum: Hvor mange inferences per dag forventes i produksjon?
- Kritikalitet: Hva er konsekvensen av feil predictions? (customer impact, revenue loss)
- Data dynamics: Hvor ofte endrer input-dataene seg? (daily, weekly, seasonal)
- Expertise: Har teamet MLOps-kompetanse, eller er dette first AI project?
- Budget: Hva er akseptabel månedlig kostnad for monitoring + retraining?
- Regulatory: Gjelder AI Act / GDPR high-risk classification?
Røde flagg (anti-patterns)
❌ "Vi retrainer hver natt uten å sjekke om det er nødvendig" → Forslag: Threshold-based retraining (spare compute + CO₂)
❌ "Vi har ingen monitoring, men deployer nye modeller hver uke" → Forslag: Implementer baseline monitoring før du øker deployment-frekvens
❌ "Brukerne klager på dårlig kvalitet, men vi har ingen feedback-mekanisme" → Forslag: Start med enkel thumbs up/down i UI, logg til Application Insights
❌ "Vi evaluerer kun på original test set, aldri production data" → Forslag: Exporter sample av inference tables til eval dataset (catch drift)
Suksess-metrikker for feedback loops
| Metric | Target | Måleenhet |
|---|---|---|
| Mean time to detect (MTTD) | < 24 timer | Time fra quality degradation til alert |
| Retraining cycle time | < 7 dager | Time fra drift detection til ny model i prod |
| User feedback rate | > 5% | % av inferences hvor user gir feedback |
| False positive rate (monitoring) | < 10% | % av alerts som ikke krever action |
| Quality improvement per iteration | > 5% | Accuracy/F1 gain per retraining cycle |
Kilder og verifisering
Primærkilder (Microsoft Learn):
- MLflow for GenAI Apps and Agents - Continuous Improvement Cycle (Verified MCP 2026-04 — updated 10-step cycle; new: Trace UI for pattern identification, evaluation harness, version/prompt management tracking)
- Machine Learning Operations v2 - Monitoring & Feedback
- Generative AI App Developer Workflow - Production Monitoring
- Azure AI Foundry - Observability in Generative AI
- MLOps and GenAIOps for AI Workloads - Model Maintenance
- AI Builder - Continuously Improve Your Model (Feedback Loop)
Code samples:
- MLflow feedback logging: Azure Databricks - Agent Framework
- Model monitoring setup: Azure ML - Monitor Model Performance (Verified MCP 2026-04 — supports data quality, data drift, prediction drift, feature attribution drift, and custom signals; integrates with Azure Event Grid for alerting)
- GenAI evaluation: MLflow 3.x - Evaluate App (Verified MCP 2026-04 — tutorial covers RAG email app evaluation; new scorers: RetrievalGroundedness, Guidelines, RelevanceToQuery, Safety; version comparison with mlflow.genai.evaluate())
Dato for siste verifikasjon: 2026-04-10
MCP calls: 6 (microsoft_docs_search: 3, microsoft_docs_fetch: 3, microsoft_code_sample_search: 2)
For Cosmo
Dette dokumentet dekker hele feedback loop-syklusen for både classical ML og GenAI. Nøkkelpunkter å fremheve i konsultasjon:
- Ikke one-size-fits-all: Automated retraining passer ikke alle (se beslutningsveiledning)
- Start enkelt: Thumbs up/down + basic monitoring før du bygger kompleks MLOps-pipeline
- GenAI ≠ Classical ML: GenAI krever LLM judges + human review, ikke bare accuracy metrics
- Compliance: AI Act krever kontinuerlig monitorering for høyrisiko-systemer (ikke optional)
- Kostnad: Threshold-based retraining kan spare 50-70% compute vs daily retraining
Bruk arkitekturmønstrene til å visualisere løsningen for kunden. Påpek at MLflow Tracing + Agent Evaluation gir "free" observability (built-in i Databricks).
MLflow 3 Evaluation & Feedback Loop (Verified MCP 2026-04)
MLflow 3 introduces a unified evaluation-monitoring lifecycle for GenAI feedback loops:
Iterative workflow:
- Trace production requests (MLflow Tracing — end-to-end observability)
- Evaluate against scorers during development (
mlflow.genai.evaluate()) - Monitor production with same scorers (consistent quality measurement)
- Gather human feedback via Review App (expert annotations)
- Improve prompts/models based on evaluation datasets
Built-in LLM judges (scorers):
RetrievalGroundedness— checks if response is grounded in retrieved dataRelevanceToQuery— checks if response addresses the user requestSafety— checks for harmful/inappropriate contentGuidelines(name, guidelines)— custom policy/tone/style checksCorrectness— factual correctness with expected_facts
Azure ML Model Monitoring signals:
- Data quality: null values, out-of-range, type mismatch
- Data drift: statistical distribution changes between training and production data
- Prediction drift: distribution shift in model outputs
- Feature attribution drift: changes in feature importance
- Custom signals: user-defined metrics via custom scripts
- Integrates with Azure Event Grid for alerting on threshold breaches
Evaluation dataset workflow (new 2026-04):
- Search production traces → select problematic + high-quality examples
- Save to versioned eval dataset in Unity Catalog (
mlflow.genai.datasets.create_dataset()) - Run evaluation harness with
mlflow.genai.evaluate(data=eval_dataset, predict_fn=..., scorers=...) - Compare runs in UI (
Evaluation runsview) or SDK (mlflow.search_runs) - Identify regressions per-metric before promoting new versions
Continuous improvement cycle: Production traces → MLflow evaluation datasets → Scorer alignment → Prompt/model update → A/B test → Production rollout