# Feedback Loops and Continuous Improvement **Kategori:** MLOps & GenAIOps **Dato:** 2026-02-04 **Last updated:** 2026-04 **Confidence:** HIGH (basert på offisiell Microsoft-dokumentasjon) **Verified:** MCP 2026-04 ## Introduksjon Feedback loops og kontinuerlig forbedring er kritiske komponenter i moderne AI-operasjoner. I motsetning til tradisjonell programvare, hvor funksjonalitet er deterministisk, kan AI-modeller vise kvalitetsdrift eller uventet oppførsel når de møter reelle data. Et velfungerende feedback-system sikrer at modeller forblir nøyaktige, relevante og trygge gjennom hele sin livssyklus. **Nøkkelkonsept:** Feedback loops kobler produksjonsdata, brukerinnsikt og ytelsesmetrikker tilbake til utviklingsprosessen, og skaper en kontinuerlig syklus av måling, læring og forbedring. ### Hvorfor dette er viktig - **Modellforfall (model decay):** AI-modeller degraderer over tid på grunn av endringer i data, brukermønstre eller forretningskontekst - **Kvalitetssikring:** Automatisert og manuell evaluering avdekker gap mellom forventet og faktisk ytelse - **Brukerverdi:** Direkte tilbakemelding fra sluttbrukere gir innsikt som ikke fanges av tekniske metrikker - **Compliance:** Regulatoriske krav (AI Act, GDPR) krever sporbarhet og kontinuerlig overvåking ## Kjernekomponenter ### 1. Production Monitoring & Telemetry **Azure-tjenester:** - **Azure Monitor + Application Insights:** Sanker telemetri fra endpoints, sporer latens, feilrater, token-forbruk - **Azure Machine Learning Model Monitoring:** Automatisk deteksjon av data drift, prediction drift og model performance degradation - **MLflow Tracing:** Detaljert sporing av hver inferens-interaksjon, inkludert inputs, outputs, mellomsteg **Nøkkelmetrikker:** | Dimensjon | Metrikker | Confidence | |-----------|-----------|------------| | **Operational** | Request volume, latency (p50/p95), error rates, token usage | HIGH | | **Quality** | Groundedness, relevance, coherence, safety pass rate | HIGH (GenAI) | | **User Feedback** | Thumbs up/down, ratings, explicit reports | MEDIUM | **Kodeeksempel: Logging av user feedback (MLflow)** ```python import mlflow from mlflow.entities import AssessmentSource import time # Wait for trace to be ready time.sleep(1) # Extract span and trace IDs from response response_dict = response.as_dict() first_prediction = response_dict["predictions"][0] first_result = first_prediction["results"][0] span_id = first_result["span_id"] trace_id = first_prediction["trace_id"] # Log user feedback mlflow.log_feedback( trace_id=trace_id, span_id=span_id, name="user_feedback", value=True, # True for positive, False for negative source=AssessmentSource(source_type="HUMAN"), rationale="Answer was accurate and well-reasoned", ) ``` ### 2. Data Collection & Evaluation Datasets **Prosess:** 1. **Production traces → Evaluation set:** Bruk inference table logs til å identifisere problematiske interaksjoner 2. **Synthetic data generation:** Generer startdatasett før produksjonsdata er tilgjengelig 3. **Expert curation:** SMEs validerer og annoterer edge cases, gold standard-svar **Azure-tjenester:** - **MLflow Datasets:** Versjonert lagring av eval-datasett i Unity Catalog - **Azure AI Foundry Agent Evaluation:** Evaluering med LLM judges (correctness, relevance, groundedness, safety) - **Databricks Review App:** Samle feedback fra domeneeksperter på produksjonstracer **Best practices:** - Inkluder både forventede og uventede bruksmønstre i eval-settet - Test for edge cases (lange/korte inputs, misspellings, prompt injection) - Kombiner `expected_facts` (fleksibelt) med `guidelines` (tone, style, policy) **Kodeeksempel: Evaluering med MLflow** ```python import mlflow from mlflow.genai.scorers import Correctness, RelevanceToQuery # Define evaluation dataset eval_data = [ { "inputs": {"question": "What is MLflow?"}, "expectations": { "expected_facts": ["open-source platform", "ML lifecycle management"] } }, { "inputs": {"question": "How do I track experiments?"}, "expectations": { "expected_facts": ["mlflow.start_run()", "log metrics", "log parameters"] } } ] # Run evaluation results = mlflow.genai.evaluate( data=eval_data, predict_fn=my_agent, scorers=[Correctness(), RelevanceToQuery()], ) print(f"Correctness score: {results.metrics['correctness/mean']:.2f}") ``` ### 3. Automated Retraining & Model Promotion **Strategier:** | Strategi | Når bruke | Trade-offs | |----------|-----------|------------| | **Online training** | Daglig/kontinuerlig oppdatering med nye data | Høy kostnad, krever robust automation | | **Offline training** | Sjeldnere oppdatering (ukentlig/månedlig) | Lavere kostnad, risiko for model decay | | **Threshold-based** | Retrain når ytelse faller under terskel | Balanserer presisjon vs energiforbruk | **Azure-tjenester:** - **Azure Machine Learning Pipelines:** CI/CD for modelltrening og deployment - **Azure DevOps / GitHub Actions:** Automatiserte triggers ved model registration - **Azure Arc:** Hybrid/multicloud deployment-orkestrering **Triggers for retraining:** - **Data drift:** Statistical properties of input data har endret seg (detektert via monitoring) - **Prediction drift:** Output-distribusjonen avviker fra baseline - **Performance degradation:** Metrics (accuracy, F1-score) faller under threshold - **Manual trigger:** Human-in-the-loop approval for kritiske modeller **Kodeeksempel: Model monitoring setup** ```python from azure.ai.ml import MLClient from azure.ai.ml.entities import ( MonitorSchedule, RecurrenceTrigger, MonitorDefinition, ServerlessSparkCompute, MonitoringTarget, AlertNotification, DataDriftSignal, DataDriftMetricThreshold, NumericalDriftMetrics, ) # Setup monitoring for data drift ml_client = MLClient(...) spark_compute = ServerlessSparkCompute( instance_type="standard_e4s_v3", runtime_version="3.3" ) monitoring_target = MonitoringTarget( ml_task="classification", endpoint_deployment_id="azureml:fraud-detection-endpoint:main" ) # Define drift thresholds metric_thresholds = DataDriftMetricThreshold( numerical=NumericalDriftMetrics( jensen_shannon_distance=0.01 # Retrain when drift exceeds 1% ) ) data_drift_signal = DataDriftSignal( reference_data=training_data, metric_thresholds=metric_thresholds, alert_enabled=True ) # Create monitoring schedule monitor_definition = MonitorDefinition( compute=spark_compute, monitoring_target=monitoring_target, monitoring_signals={"data_drift": data_drift_signal}, alert_notification=AlertNotification(emails=["ml-team@example.com"]) ) recurrence_trigger = RecurrenceTrigger( frequency="day", interval=1, schedule=RecurrencePattern(hours=3, minutes=0) ) model_monitor = MonitorSchedule( name="fraud_detection_monitor", trigger=recurrence_trigger, create_monitor=monitor_definition ) ml_client.schedules.begin_create_or_update(model_monitor) ``` ### 4. Human-in-the-Loop (HITL) Workflows **Komponenter:** - **Review App (Databricks):** Thumbs up/down, textual feedback på agent-svar - **Expert labeling:** SMEs annoterer traces med expected outputs, policy violations - **Approval gates:** Human godkjenning før deploy til prod (kritiske modeller) **Azure-tjenester:** - **Azure Logic Apps / Power Automate:** Workflow automation for HITL review - **AI Builder Feedback Loop:** Automatisk routing av low-confidence predictions til human review **Best practices:** - Balancer automation vs HITL: Kun review low-confidence outputs (< 70% score) - Unngå reviewer fatigue: Sample strategisk, ikke alle interaksjoner - Incorporate feedback raskt: Weekly review cycles, ikke månedlig ### 5. Continuous Improvement Cycle (MLflow for GenAI) **10-stegs syklus:** 1. **🚀 Production App:** Deployed agent generer traces med inputs/outputs 2. **👍 👎 User Feedback:** Thumbs up/down på hver interaksjon 3. **🔍 Monitor & Score:** LLM judges (correctness, safety, relevance) scorer automatisk 4. **⚠️ Identify Issues:** Trace UI viser mønstre i low-scoring traces 5. **👥 Domain Expert Review:** Sample sendes til SMEs via Review App 6. **📋 Build Eval Dataset:** Kurater problematiske + high-quality traces til eval-sett 7. **🎯 Tune Scorers:** Bruk expert feedback til å align LLM judges med human judgment 8. **🧪 Evaluate New Versions:** Test forbedringer mot eval-settet med samme scorers 9. **📈 Compare Results:** MLflow evaluation runs sammenligner versioner 10. **✅ Deploy or Iterate:** Deploy hvis kvalitet forbedres uten regresjon **Kodeeksempel: Versjon-sammenligning** ```python import mlflow # Evaluate v1 with mlflow.start_run(run_name="v1"): eval_results_v1 = mlflow.genai.evaluate( data=eval_dataset, predict_fn=generate_sales_email_v1, scorers=email_judges, ) # Evaluate v2 with mlflow.start_run(run_name="v2"): eval_results_v2 = mlflow.genai.evaluate( data=eval_dataset, predict_fn=generate_sales_email_v2, scorers=email_judges, # Same judges for fairness ) # Compare results run_v1_df = mlflow.search_runs(filter_string=f"run_id = '{eval_results_v1.run_id}'") run_v2_df = mlflow.search_runs(filter_string=f"run_id = '{eval_results_v2.run_id}'") metric_cols = [col for col in run_v1_df.columns if col.startswith('metrics.') and col.endswith('/mean')] for metric in metric_cols: v1_score = run_v1_df[metric].iloc[0] v2_score = run_v2_df[metric].iloc[0] improvement = v2_score - v1_score print(f"{metric}: {v1_score:.3f} → {v2_score:.3f} ({improvement:+.3f})") ``` ## Arkitekturmønstre ### Pattern 1: Automated MLOps Loop (Classical ML) ``` ┌─────────────────────────────────────────────────────────┐ │ Production Deployment (Managed Online Endpoint) │ │ ├─ Data Collection (inference tables) │ │ └─ Monitoring (Azure Monitor, drift detection) │ └─────────────────────┬───────────────────────────────────┘ │ Drift detected / Threshold reached ▼ ┌─────────────────────────────────────────────────────────┐ │ CI/CD Pipeline (Azure Pipelines / GitHub Actions) │ │ ├─ Pull production data │ │ ├─ Retrain model (Azure ML Compute) │ │ ├─ Evaluate (test set + validation metrics) │ │ └─ Promote to staging (if quality gates pass) │ └─────────────────────┬───────────────────────────────────┘ │ Human approval (HITL) ▼ ┌─────────────────────────────────────────────────────────┐ │ Staging Environment │ │ ├─ A/B testing (champion vs challenger) │ │ ├─ Responsible AI checks (bias, fairness) │ │ └─ Final validation │ └─────────────────────┬───────────────────────────────────┘ │ Deploy to prod ▼ [Production] ``` **Når bruke:** - Tabular ML (classification, regression, forecasting) - Automated retraining er justified (kostnadseffektivt) - Modellen har clear performance metrics (accuracy, RMSE, F1) ### Pattern 2: GenAI Feedback Loop (LLM Applications) ``` ┌─────────────────────────────────────────────────────────┐ │ Production Agent (Model Serving Endpoint) │ │ ├─ MLflow Tracing (span-level telemetry) │ │ ├─ User feedback (thumbs up/down) │ │ └─ Inference tables (Unity Catalog) │ └─────────────────────┬───────────────────────────────────┘ │ Daily batch evaluation ▼ ┌─────────────────────────────────────────────────────────┐ │ Production Monitoring (Agent Evaluation) │ │ ├─ LLM Judges (correctness, safety, relevance) │ │ ├─ Sampling rate: 10-100% of traffic │ │ └─ Alerts on quality degradation │ └─────────────────────┬───────────────────────────────────┘ │ Export low-scoring traces ▼ ┌─────────────────────────────────────────────────────────┐ │ Evaluation Dataset Curation │ │ ├─ Filter by user feedback + LLM judge scores │ │ ├─ SME review (Review App) │ │ └─ Add to versioned eval dataset (MLflow Datasets) │ └─────────────────────┬───────────────────────────────────┘ │ Trigger improvement cycle ▼ ┌─────────────────────────────────────────────────────────┐ │ Agent Development (Inner Loop) │ │ ├─ Refine prompts / retrieval logic / tools │ │ ├─ Run offline evaluation (eval dataset + scorers) │ │ └─ Compare to baseline (MLflow tracking) │ └─────────────────────┬───────────────────────────────────┘ │ Quality improved? ▼ [Yes: Deploy] [No: Iterate] ``` **Når bruke:** - Agentic RAG, chatbots, content generation - Quality er subjektiv (tone, style, policy compliance) - Frequent prompt/logic changes, ikke bare model retraining ### Pattern 3: Hybrid (CV/NLP med Human Annotation) ``` ┌─────────────────────────────────────────────────────────┐ │ Production Model (Batch/Online Endpoint) │ │ └─ Model performance monitoring (accuracy on new data)│ └─────────────────────┬───────────────────────────────────┘ │ Performance drops ▼ ┌─────────────────────────────────────────────────────────┐ │ Human-in-the-Loop Annotation │ │ ├─ Sample low-confidence predictions │ │ ├─ Annotators label new data (Azure ML Labeling) │ │ └─ Quality review by SMEs │ └─────────────────────┬───────────────────────────────────┘ │ New labeled data ▼ ┌─────────────────────────────────────────────────────────┐ │ Model Development (Inner Loop) │ │ ├─ Update training set with new annotations │ │ ├─ Retrain model (not automated) │ │ └─ Evaluate on test set + new edge cases │ └─────────────────────┬───────────────────────────────────┘ │ Quality gates pass? ▼ [Staging → Production] ``` **Når bruke:** - Computer vision (image classification, object detection) - NLP tasks (text classification, NER) - Automated retraining ikke ønskelig (ressurskrevende, krever human review) ## Beslutningsveiledning ### Når implementere automated vs manual retraining? | Factor | Automated Retraining | Manual Retraining | |--------|----------------------|-------------------| | **Data volume** | High (daglig nye data) | Low (ukentlig/månedlig) | | **Model stability** | High (proven architecture) | Low (experimental) | | **Cost tolerance** | High (compute budget ok) | Low (kostnadssensitiv) | | **Regulatory** | Low risk (non-critical) | High risk (health, finance) | | **Expertise** | Available (MLOps team) | Limited (manual review nødvendig) | **Tommelfingerregel:** - **Classical ML (tabular):** Automatiser hvis data volume > 1000 nye rader/dag - **GenAI (LLM):** Manuell iteration (prompt refinement) oftere enn retraining - **CV/NLP:** Hybrid (automated monitoring → manual annotation → triggered retraining) ### Når bruke LLM judges vs human evaluation? | Scenario | LLM Judges | Human Evaluation | |----------|------------|------------------| | **Factual correctness** | ✅ (with expected_facts) | ✅ (gold standard) | | **Safety (toxicity, bias)** | ✅ (high recall) | ✅ (final validation) | | **Style/tone compliance** | ✅ (guidelines judge) | ✅ (subjective quality) | | **Edge cases** | ⚠️ (may miss nuance) | ✅ (domain expertise) | | **Volume** | ✅ (scale to 100% traffic) | ❌ (sample 1-10%) | | **Cost** | Medium (LLM inference) | High (SME time) | **Best practice:** 1. Start med LLM judges for bulk evaluation (development + production monitoring) 2. Sample 10-20% av low-scoring traces for human review 3. Bruk human feedback til å tune LLM judges (few-shot examples) ## Integrasjon med Microsoft-stakken ### Azure Machine Learning (Classical ML) **Feedback loop-komponenter:** | Komponent | Azure-tjeneste | Formål | |-----------|----------------|--------| | **Data collection** | Inference tables (managed endpoints) | Capture production inputs/outputs | | **Monitoring** | Model Monitor (Azure ML) | Data drift, prediction drift, performance | | **Alerting** | Azure Monitor Alerts | Email/webhook ved threshold breach | | **Retraining** | Azure ML Pipelines | Triggered retraining workflow | | **A/B testing** | Staging endpoints | Champion vs challenger validation | | **Deployment** | Managed Online Endpoints | Blue-green deployment | **Kodeeksempel: Alert notification ved data drift** ```python from azure.ai.ml.entities import AlertNotification alert_notification = AlertNotification( emails=['ml-team@example.com', 'data-science-lead@example.com'] ) monitor_definition = MonitorDefinition( compute=spark_compute, monitoring_target=monitoring_target, monitoring_signals={"data_drift": data_drift_signal}, alert_notification=alert_notification # Sends email when drift detected ) ``` ### Azure AI Foundry (GenAI) **Feedback loop-komponenter:** | Komponent | Azure-tjeneste | Formål | |-----------|----------------|--------| | **Production tracing** | MLflow Tracing (Databricks) | Span-level telemetry | | **User feedback** | Review App | Thumbs up/down, textual feedback | | **LLM judges** | Agent Evaluation | Automated quality scoring | | **Monitoring dashboard** | Azure AI Foundry Observability | Quality trends, latency, errors | | **Eval datasets** | MLflow Datasets (Unity Catalog) | Versioned test sets | | **Red teaming** | AI Red Teaming Agent | Adversarial testing for safety | **Kodeeksempel: Production monitoring setup (GenAI)** ```python from azure.ai.ml import MLClient from azure.ai.ml.entities import ( MonitorSchedule, CronTrigger, MonitorDefinition, ServerlessSparkCompute, MonitoringTarget, GenerationSafetyQualitySignal, GenerationSafetyQualityMonitoringMetricThreshold, LlmData, BaselineDataRange, ) ml_client = MLClient(...) # Define quality thresholds (70% passing rate) quality_thresholds = GenerationSafetyQualityMonitoringMetricThreshold( groundedness={"aggregated_groundedness_pass_rate": 0.7}, relevance={"aggregated_relevance_pass_rate": 0.7}, coherence={"aggregated_coherence_pass_rate": 0.7}, fluency={"aggregated_fluency_pass_rate": 0.7}, ) # Reference production data (app traces) data_window = BaselineDataRange(lookback_window_size="P7D", lookback_window_offset="P0D") production_data = LlmData( data_column_names={ "prompt_column": "question", "completion_column": "answer", "context_column": "context" }, input_data=Input(type="uri_folder", path="endpoint-deployment-app_traces:1"), data_window=data_window, ) # Create quality signal gsq_signal = GenerationSafetyQualitySignal( connection_id=f"/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.MachineLearningServices/workspaces/{workspace}/connections/{aoai_connection}", metric_thresholds=quality_thresholds, production_data=[production_data], sampling_rate=1.0, # Evaluate 100% of traffic ) # Schedule daily evaluation monitor_definition = MonitorDefinition( compute=ServerlessSparkCompute(instance_type="standard_e4s_v3", runtime_version="3.3"), monitoring_target=MonitoringTarget( ml_task=MonitorTargetTasks.QUESTION_ANSWERING, endpoint_deployment_id=f"azureml:{endpoint_name}:{deployment_name}" ), monitoring_signals={"quality_signal": gsq_signal}, alert_notification=AlertNotification(emails=["genai-team@example.com"]) ) trigger = CronTrigger(expression="15 10 * * *") # Daily at 10:15 AM model_monitor = MonitorSchedule( name="chatbot_quality_monitor", trigger=trigger, create_monitor=monitor_definition ) ml_client.schedules.begin_create_or_update(model_monitor) ``` ### Power Platform AI (Citizen Developer Scenario) **Feedback loop-komponenter:** | Komponent | Power Platform-tjeneste | Formål | |-----------|-------------------------|--------| | **Automated feedback collection** | Power Automate | Route low-confidence predictions til human review | | **Storage** | Dataverse / SharePoint | Lagre feedback data | | **Model improvement** | AI Builder Feedback Loop | Automatically add reviewed samples to training set | | **Retraining** | AI Builder | Manual/scheduled retraining | **Eksempel-workflow (Power Automate):** 1. **Trigger:** AI Builder prediction (e.g., document processing) 2. **Condition:** If confidence score < 0.7 3. **Action:** Save file + prediction output to AI Builder feedback loop storage 4. **Notification:** Send email til reviewer **Resultat:** Reviewed documents automatisk tilgjengelige i "Feedback loop" data source når modellen retraines. ## Offentlig sektor (Norge) ### Regulatoriske krav **EU AI Act + Norsk implementering:** - **Høyrisiko-AI:** Kontinuerlig monitorering og logging obligatorisk (Article 61) - **Sporbarhet:** Automatiske logger av inputs, outputs, decisions - **Human oversight:** HITL review for kritiske beslutninger (Article 14) - **Retesting:** Periodisk evaluering mot original test set + new edge cases **Implementering i Microsoft-stakken:** ```python # Compliant logging example (GDPR + AI Act) import mlflow # Log input/output + rationale (Article 61: Record-keeping) mlflow.log_param("input_hash", hash(user_query)) # Pseudonymized mlflow.log_metric("confidence_score", 0.85) mlflow.log_text("rationale", "Retrieved relevant documents from internal KB") # Human review trigger (Article 14: Human oversight) if confidence_score < 0.7: send_to_human_review(trace_id, user_query, model_output) ``` ### Bærekraft (grønn AI) **Retraining frequency vs CO₂-footprint:** | Strategi | CO₂-impact | Når bruke | |----------|------------|-----------| | **Daily retraining** | HIGH | Finansmarkeder, real-time fraud detection | | **Weekly retraining** | MEDIUM | Customer support chatbots | | **Threshold-based** | LOW | Retrain only når accuracy < 90% | | **Manual trigger** | VERY LOW | Statisk domene (image classification) | **Azure-støtte:** - **Carbon-aware deployment:** Deploy til low-carbon regions (Sweden Central, Norway East) - **Model decay detection:** Unngå unødvendig retraining via threshold-based triggers - **Efficient inference:** Azure ML Managed Online Endpoints med auto-scaling ### Datahåndtering (Personvern) **GDPR-compliance i feedback loops:** - **Right to explanation (Article 22):** Trace-logginig må inkludere model reasoning - **Right to be forgotten (Article 17):** Mulighet til å slette user feedback data - **Data minimization (Article 5):** Kun logg nødvendige fields (ikke full user profile) **Implementering:** ```python # Pseudonymization (GDPR-compliant) import hashlib user_id_hash = hashlib.sha256(user_id.encode()).hexdigest() mlflow.log_param("user_id_hash", user_id_hash) # Logged # Original user_id IKKE lagret i MLflow ``` ## Kostnad og lisensiering ### Compute-kostnader (Retraining) **Azure Machine Learning:** | Scenario | Compute Type | Estimert kostnad (NOK/mnd) | Confidence | |----------|--------------|----------------------------|------------| | **Daily retraining (tabular ML)** | Standard_DS3_v2 (4 vCPU) | ~15 000 - 25 000 | HIGH | | **Weekly retraining (CV)** | GPU (NC6s_v3) | ~8 000 - 12 000 | HIGH | | **Threshold-based (GenAI)** | Minimal (only when triggered) | ~2 000 - 5 000 | MEDIUM | **Databricks (GenAI Evaluation):** | Scenario | Compute Type | Estimat (NOK/mnd) | Confidence | |----------|--------------|-------------------|------------| | **Daily LLM judge evaluation (10k traces)** | Serverless Spark (standard_e4s_v3) | ~10 000 - 15 000 | MEDIUM | | **Human review (Review App)** | Minimal (UI hosting) | ~500 - 1 000 | HIGH | ### Storage-kostnader **Inference tables + eval datasets:** - **Azure Storage (Delta Lake):** ~0.50 NOK/GB/mnd - **MLflow Tracking:** ~1-2 NOK per experiment run (metadata) **Estimat:** 10 000 daily inferences → ~5 GB/mnd → ~2.50 NOK/mnd storage ### Lisenser **Microsoft Fabric + Azure ML:** - **Azure ML Enterprise:** Inkludert i subscription, per-use compute pricing - **Databricks (Unity Catalog):** Premium tier (~$2-3 per DBU) **Power Platform:** | License | AI Builder Credits/mnd | Feedback Loop Support | |---------|------------------------|----------------------| | **Per User** | 500 | ✅ | | **Per App** | Ikke inkludert | ❌ (krever Per User) | | **AI Builder add-on** | Custom (kjøp ekstra) | ✅ | ## For arkitekten (Cosmo) ### Når anbefale automated feedback loops? **✅ Ja, anbefal:** - Produksjonsmodell med > 1000 daily inferences - Clear performance metrics (accuracy, F1, RMSE) - Regulatory compliance krav (AI Act, ISO 27001) - Business-critical application (customer-facing, revenue impact) **⚠️ Vurder nøye:** - Proof-of-concept eller pilot (manuell evaluering holder) - Lav inference volume (< 100/day) - Statisk domene (sjeldent endringer i data) - Begrensede MLOps-ressurser (prioriter automation later) ### Anbefalte spørsmål til kunden 1. **Volum:** Hvor mange inferences per dag forventes i produksjon? 2. **Kritikalitet:** Hva er konsekvensen av feil predictions? (customer impact, revenue loss) 3. **Data dynamics:** Hvor ofte endrer input-dataene seg? (daily, weekly, seasonal) 4. **Expertise:** Har teamet MLOps-kompetanse, eller er dette first AI project? 5. **Budget:** Hva er akseptabel månedlig kostnad for monitoring + retraining? 6. **Regulatory:** Gjelder AI Act / GDPR high-risk classification? ### Røde flagg (anti-patterns) ❌ **"Vi retrainer hver natt uten å sjekke om det er nødvendig"** → Forslag: Threshold-based retraining (spare compute + CO₂) ❌ **"Vi har ingen monitoring, men deployer nye modeller hver uke"** → Forslag: Implementer baseline monitoring før du øker deployment-frekvens ❌ **"Brukerne klager på dårlig kvalitet, men vi har ingen feedback-mekanisme"** → Forslag: Start med enkel thumbs up/down i UI, logg til Application Insights ❌ **"Vi evaluerer kun på original test set, aldri production data"** → Forslag: Exporter sample av inference tables til eval dataset (catch drift) ### Suksess-metrikker for feedback loops | Metric | Target | Måleenhet | |--------|--------|-----------| | **Mean time to detect (MTTD)** | < 24 timer | Time fra quality degradation til alert | | **Retraining cycle time** | < 7 dager | Time fra drift detection til ny model i prod | | **User feedback rate** | > 5% | % av inferences hvor user gir feedback | | **False positive rate (monitoring)** | < 10% | % av alerts som ikke krever action | | **Quality improvement per iteration** | > 5% | Accuracy/F1 gain per retraining cycle | ## Kilder og verifisering **Primærkilder (Microsoft Learn):** 1. [MLflow for GenAI Apps and Agents - Continuous Improvement Cycle](https://learn.microsoft.com/en-us/azure/databricks/mlflow3/genai/overview/) (Verified MCP 2026-04 — updated 10-step cycle; new: Trace UI for pattern identification, evaluation harness, version/prompt management tracking) 2. [Machine Learning Operations v2 - Monitoring & Feedback](https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/machine-learning-operations-v2) 3. [Generative AI App Developer Workflow - Production Monitoring](https://learn.microsoft.com/en-us/azure/databricks/generative-ai/tutorials/ai-cookbook/genai-developer-workflow) 4. [Azure AI Foundry - Observability in Generative AI](https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/observability) 5. [MLOps and GenAIOps for AI Workloads - Model Maintenance](https://learn.microsoft.com/en-us/azure/well-architected/ai/mlops-genaiops#model-maintenance) 6. [AI Builder - Continuously Improve Your Model (Feedback Loop)](https://learn.microsoft.com/en-us/ai-builder/feedback-loop) **Code samples:** - MLflow feedback logging: [Azure Databricks - Agent Framework](https://learn.microsoft.com/en-us/azure/databricks/generative-ai/agent-framework/non-conversational-agents#log-user-feedback) - Model monitoring setup: [Azure ML - Monitor Model Performance](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-monitor-model-performance?view=azureml-api-2) (Verified MCP 2026-04 — supports data quality, data drift, prediction drift, feature attribution drift, and custom signals; integrates with Azure Event Grid for alerting) - GenAI evaluation: [MLflow 3.x - Evaluate App](https://learn.microsoft.com/en-us/azure/databricks/mlflow3/genai/eval-monitor/evaluate-app) (Verified MCP 2026-04 — tutorial covers RAG email app evaluation; new scorers: RetrievalGroundedness, Guidelines, RelevanceToQuery, Safety; version comparison with mlflow.genai.evaluate()) **Dato for siste verifikasjon:** 2026-04-10 **MCP calls:** 6 (microsoft_docs_search: 3, microsoft_docs_fetch: 3, microsoft_code_sample_search: 2) --- ## For Cosmo Dette dokumentet dekker hele feedback loop-syklusen for både classical ML og GenAI. Nøkkelpunkter å fremheve i konsultasjon: 1. **Ikke one-size-fits-all:** Automated retraining passer ikke alle (se beslutningsveiledning) 2. **Start enkelt:** Thumbs up/down + basic monitoring før du bygger kompleks MLOps-pipeline 3. **GenAI ≠ Classical ML:** GenAI krever LLM judges + human review, ikke bare accuracy metrics 4. **Compliance:** AI Act krever kontinuerlig monitorering for høyrisiko-systemer (ikke optional) 5. **Kostnad:** Threshold-based retraining kan spare 50-70% compute vs daily retraining Bruk arkitekturmønstrene til å visualisere løsningen for kunden. Påpek at MLflow Tracing + Agent Evaluation gir "free" observability (built-in i Databricks). ### MLflow 3 Evaluation & Feedback Loop (Verified MCP 2026-04) MLflow 3 introduces a unified evaluation-monitoring lifecycle for GenAI feedback loops: **Iterative workflow**: 1. **Trace** production requests (MLflow Tracing — end-to-end observability) 2. **Evaluate** against scorers during development (`mlflow.genai.evaluate()`) 3. **Monitor** production with same scorers (consistent quality measurement) 4. **Gather human feedback** via Review App (expert annotations) 5. **Improve** prompts/models based on evaluation datasets **Built-in LLM judges (scorers)**: - `RetrievalGroundedness` — checks if response is grounded in retrieved data - `RelevanceToQuery` — checks if response addresses the user request - `Safety` — checks for harmful/inappropriate content - `Guidelines(name, guidelines)` — custom policy/tone/style checks - `Correctness` — factual correctness with expected_facts **Azure ML Model Monitoring signals**: - Data quality: null values, out-of-range, type mismatch - Data drift: statistical distribution changes between training and production data - Prediction drift: distribution shift in model outputs - Feature attribution drift: changes in feature importance - Custom signals: user-defined metrics via custom scripts - Integrates with **Azure Event Grid** for alerting on threshold breaches **Evaluation dataset workflow (new 2026-04)**: 1. Search production traces → select problematic + high-quality examples 2. Save to versioned eval dataset in Unity Catalog (`mlflow.genai.datasets.create_dataset()`) 3. Run evaluation harness with `mlflow.genai.evaluate(data=eval_dataset, predict_fn=..., scorers=...)` 4. Compare runs in UI (`Evaluation runs` view) or SDK (`mlflow.search_runs`) 5. Identify regressions per-metric before promoting new versions **Continuous improvement cycle**: Production traces → MLflow evaluation datasets → Scorer alignment → Prompt/model update → A/B test → Production rollout