# Inferencing Optimization and Caching **Kategori:** MLOps & GenAIOps **Dato:** 2026-04 **Forfattet av:** Cosmo Skyberg, Senior Microsoft AI Solution Architect **Verified:** MCP 2026-04 ## Introduksjon Inferencing optimization og caching representerer kritiske teknikker for å maksimere ytelse og minimere kostnader når AI-modeller skal serve prediksjoner i produksjon. Mens model training handler om å oppnå høy accuracy, handler inferencing om å levere disse prediksjonene raskt, pålitelig og kostnadseffektivt til brukere og systemer. **Hva er inferencing?** Inferencing (eller model scoring) er prosessen med å bruke en trent modell til å generere prediksjoner på produksjonsdata. Dette skjer kontinuerlig etter at modellen er deployet, og kan involvere alt fra enkeltforespørsler (online inference) til batch-prosessering av store datasett. **Hvorfor er optimalisering kritisk?** Selv veltrente modeller kan feile i produksjon hvis de ikke er optimalisert for inferencing. Dårlig inferencing-ytelse manifesterer seg som høy latency, lav throughput, høye infrastrukturkostnader og dårlig brukeropplevelse. I Microsoft-økosystemet er dette spesielt relevant for Azure Machine Learning, Azure AI Foundry, og embedded scenarios som Azure SQL Edge og Windows ML. **Tre pilarer for inferencing optimization:** 1. **Model optimization** — konvertering til effektive formater (ONNX), quantization, pruning 2. **Compute optimization** — riktig hardware-akselerasjon (CPU vs GPU vs NPU), autoscaling, resource tuning 3. **Caching strategies** — multi-layer caching for å unngå redundant compute Denne referansen dekker alle tre områdene med fokus på Microsoft-verktøy og best practices for offentlig sektor. --- ## Kjernekomponenter ### 1. ONNX Runtime — High-Performance Inference Engine **ONNX (Open Neural Network Exchange)** er en åpen standard for å representere machine learning-modeller på tvers av frameworks. ONNX Runtime er Microsofts høyytelsesmotor for å kjøre disse modellene i produksjon. **Nøkkelfunksjoner:** - **Cross-platform:** Linux, Windows, macOS, cloud og edge - **Cross-framework:** Støtter modeller fra TensorFlow, PyTorch, scikit-learn, Keras, MXNet, MATLAB - **Hardware acceleration:** Integrerer med TensorRT (NVIDIA GPUs), OpenVINO (Intel), DirectML (Windows) - **Production-proven:** Brukes av Bing, Office, Azure AI — Microsoft-tjenester rapporterer gjennomsnittlig 2x ytelsesgevinst på CPU **Når bruke ONNX Runtime:** - Du trenger å deploy samme modell på flere plattformer (cloud + edge) - Du vil unngå vendor lock-in til et spesifikt framework - Du trenger maksimal inferencing-ytelse på CPU eller spesialisert hardware - Du skal deploy modeller i Windows ML, Azure SQL Edge, eller ML.NET **Python-eksempel — ONNX Runtime inference:** ```python import onnxruntime # Opprett inference session session = onnxruntime.InferenceSession("model.onnx") # Hent input/output metadata first_input_name = session.get_inputs()[0].name first_output_name = session.get_outputs()[0].name # Kjør inferencing results = session.run( ["output1", "output2"], {"input1": input_data} ) ``` **Installation:** ```bash pip install onnxruntime # CPU build pip install onnxruntime-gpu # GPU build ``` **[Confidence: HIGH]** — ONNX Runtime er mature, veldokumentert, og aktivt utviklet av Microsoft. --- ### 2. Model Optimization Techniques #### A. Model Conversion to ONNX Konvertering fra native framework til ONNX lar deg dra nytte av ONNX Runtime's optimaliseringer. **Konvertering fra PyTorch:** ```python import torch.onnx # Sett modell i inference mode model.eval() # Dummy input for shape tracing dummy_input = torch.randn(1, 3, 224, 224, requires_grad=True) # Eksporter til ONNX torch.onnx.export( model, dummy_input, "model.onnx", export_params=True, opset_version=11, do_constant_folding=True, # Optimization input_names=['input'], output_names=['output'], dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}} ) ``` **Frameworks med ONNX-støtte:** - TensorFlow, PyTorch, scikit-learn, Keras, Chainer, MXNet, MATLAB - AutoML-modeller fra Azure Machine Learning (image classification, object detection) #### B. Batch Inference Optimization For AutoML-modeller (spesielt vision) kan du generere batch-optimaliserte ONNX-modeller: ```python # Object detection batch model parameters inputs = { 'model_name': 'fasterrcnn_resnet34_fpn', 'batch_size': 8, 'height_onnx': 600, 'width_onnx': 800, 'job_name': job_name, 'task_type': 'image-object-detection', 'min_size': 600, 'max_size': 1333, 'box_score_thresh': 0.3, 'box_nms_thresh': 0.5, 'box_detections_per_img': 100 } ``` **[Confidence: HIGH]** — Batch inference støttes godt i Azure ML for both training og deployment. --- ### 3. Multi-Layer Caching Strategies Caching er en av de mest effektive måtene å redusere inferencing-kostnader og latency, spesielt for generative AI-workloads. #### A. Prompt Caching (Azure OpenAI / AI Foundry) **Hva er prompt caching?** I stedet for å reprosessere samme input-tokens om og om igjen, beholder tjenesten en midlertidig cache av prosesserte token-computations. **Krav for å utnytte prompt caching:** - Minimum 1 024 tokens i lengde - De første 1 024 tokens må være identiske - Cache hits rapporteres som `cached_tokens` i response **Støttede modeller:** - Alle Azure OpenAI-modeller GPT-4o eller nyere - Gjelder chat-completion, completion, responses, real-time operations **Pricing:** - Standard deployment: rabatt på input token pricing - Provisioned deployment: opptil 100% rabatt på input tokens **Cache-lifecycle:** - Caches cleares innen 24 timer - Ikke delt mellom Azure subscriptions **Response-eksempel med cache hit:** ```json { "usage": { "completion_tokens": 1518, "prompt_tokens": 1566, "total_tokens": 3084, "prompt_tokens_details": { "cached_tokens": 1408 } } } ``` **Optimalisering:** - Strukturer requests slik at repetitivt innhold ligger i starten av messages array - Bruk `prompt_cache_key` parameter for å påvirke routing og forbedre cache hit rates - Vær obs på at >15 requests/min med samme prefix kan overflow og redusere effektivitet **[Confidence: HIGH]** — Prompt caching er production-ready og automatisk enabled for støttede modeller. #### B. Application-Layer Caching **Multi-layer caching approach** for AI-applikasjoner: 1. **Result and answer caching** — Gjenbruk responses for identiske eller semantisk like queries 2. **Retrieval and grounding snippet caching** — Cache hyppig hentede knowledge fragments 3. **Model output caching** — Cache intermediate outputs som kan gjenbrukes **Cache key components (kritisk for sikkerhet):** - Tenant eller user identity - Policy context - Model version - Prompt version **TTL policies:** - Sett expiration basert på data freshness requirements - Kortere TTL for sensitive data - Lengre TTL for static catalog data **Invalidation hooks:** - Data updates - Model changes - Prompt modifications **Security considerations:** - **ALDRI cache user-private content** uten proper scoping - Caching fungerer best for data som gjelder på tvers av flere brukere - Eksempel på farlig caching: "How many hours of paid time off do I have left?" — kun gyldig for én bruker **[Confidence: MEDIUM-HIGH]** — Pattern er godt dokumentert, men krever nøye implementering for å unngå security leaks. #### C. Databricks Disk Caching For batch inference på Databricks kan du bruke disk cache for å forbedre I/O performance: ```python spark.conf.set("spark.databricks.io.cache.enabled", "true") spark.conf.set("spark.databricks.io.cache.maxDiskUsage", "50g") spark.conf.set("spark.databricks.io.cache.maxMetaDataCache", "1g") spark.conf.set("spark.databricks.io.cache.compression.enabled", "false") ``` **Best practice:** - Velg cache-accelerated worker instance types - Vær obs på at cache går tapt ved autoscaling (worker decommission) --- ### 4. Compute Resource Optimization #### A. CPU vs GPU Selection **CPU inference:** - Generelle ML-modeller (scikit-learn, XGBoost) - Small to medium deep learning models - Cost-sensitive scenarios - ONNX Runtime gir 2x speedup på CPU for mange workloads **GPU inference:** - Deep learning models (transformers, CNNs) - High-throughput batch processing - Latency-kritiske online inference - Computer vision, NLP-modeller **NPU (Neural Processing Unit):** - Edge deployment scenarios (Windows ML) - Power-efficient inference på mobile/IoT devices **ONNX Runtime execution provider selection:** ```python import onnxruntime as ort # Automatisk select EP basert på MAX_EFFICIENCY policy (prioriterer NPU > CPU) options = ort.SessionOptions() options.set_provider_selection_policy(ort.OrtExecutionProviderDevicePolicy.MAX_EFFICIENCY) session = ort.InferenceSession(model_path, sess_options=options) ``` #### B. Autoscaling for Inference Endpoints **Azure Machine Learning — Managed Online Endpoints:** Autoscaling basert på Azure Monitor metrics (CPU, requests per second, latency). **Azure Kubernetes Service (AKS) — azureml-fe router:** ```yaml # deployment.yaml scale_setting: type: target_utilization min_instances: 3 max_instances: 15 target_utilization_percentage: 70 polling_interval: 10 ``` **Utilization formula:** ``` utilization_percentage = (busy_replicas + queued_requests) / total_replicas ``` - Scale up: eager and fast (når utilization > 70%) - Scale down: conservative (~20x slower enn scale up) **Performance characteristics:** - azureml-fe kan håndtere 5K requests/second med <3ms average latency, 15ms p99 - For >10K RPS: øk `azureml-fe` pods eller vCPU/memory limits **[Confidence: HIGH]** — Autoscaling er production-proven i Azure ML. --- ### 5. Batch vs Online Inference Optimization #### A. Batch Inference Best Practices **Når bruke batch:** - Large datasets i filer (ikke krever low latency) - Scheduled scoring (daily/weekly) - Cost-sensitive scenarios (batch er billigere enn online) **Azure Machine Learning Batch Endpoints:** ```python from azure.ai.ml.entities import BatchEndpoint endpoint = BatchEndpoint( name=endpoint_name, description="Batch inference for predictions" ) ws_client.batch_endpoints.begin_create_or_update(endpoint) ``` **Parallel processing optimization:** ```python from azure.ai.ml import parallel_run_function file_batch_inference = parallel_run_function( name="batch_score", inputs=dict(job_data_path=Input(type=AssetTypes.MLTABLE)), outputs=dict(job_output_path=Output(type=AssetTypes.MLTABLE)), input_data="${{inputs.job_data_path}}", instance_count=2, max_concurrency_per_instance=1, mini_batch_size="1", task=RunFunction( code="./src", entry_script="batch_inference.py", environment="azureml://registries/azureml/environments/sklearn-1.5/labels/latest" ) ) ``` **Databricks batch inference tips:** - Bruk Spark Pandas UDFs for å scale inference across cluster - Separer preprocessing fra inference for optimal hardware selection (CPU for ETL, GPU for inference) - Bruk Delta Lake tables for data som leses flere ganger #### B. Online Inference Best Practices **Når bruke online:** - Real-time user-facing applications - Low-latency requirements (<100ms) - Single or small-batch predictions **Azure AI Foundry Serverless API:** - PaaS, minimal operational burden - Best for foundation models (Azure OpenAI) **Azure Machine Learning Managed Online Endpoints:** - Custom models med full kontroll - Autoscaling, blue/green deployment - Integration med Application Insights for monitoring --- ### 6. Azure OpenAI Batch API for Cost-Efficient Inference For foundation models som ikke krever real-time response: **Batch API benefits:** - 50% lavere cost enn standard API - 24-hour completion window - Støtte for chat completions, embeddings, completions **Batch job creation:** ```python from openai import OpenAI from azure.identity import DefaultAzureCredential, get_bearer_token_provider token_provider = get_bearer_token_provider( DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default" ) client = OpenAI( base_url="https://YOUR-RESOURCE.openai.azure.com/openai/v1/", api_key=token_provider ) batch_response = client.batches.create( input_file_id=None, endpoint="/chat/completions", completion_window="24h", extra_body={ "input_blob": "https://storage.blob.core.windows.net/batch-input/test.jsonl", "output_folder": { "url": "https://storage.blob.core.windows.net/batch-output" } } ) ``` **[Confidence: HIGH]** — Batch API er production-ready for non-latency-sensitive workloads. --- ## Arkitekturmønstre ### Pattern 1: Multi-Layer Caching Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ Client Layer │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ AI Gateway (APIM) │ │ • Authentication, rate limiting, token caps │ │ • Result cache (Redis) — Level 1 │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Intelligence Layer (Orchestrator) │ │ • Prompt cache (Azure OpenAI) — Level 2 │ │ • Model routing, agent coordination │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Knowledge Layer │ │ • Grounding snippet cache (Cosmos DB) — Level 3 │ │ • Azure AI Search, SQL, Graph │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Inferencing Layer │ │ • Model output cache — Level 4 │ │ • ONNX Runtime, Azure ML endpoints │ └─────────────────────────────────────────────────────────────┘ ``` **Cache key strategy per layer:** - Level 1 (Result): `hash(user_id + query + model_version + prompt_version)` - Level 2 (Prompt): automatisk basert på første 1024 tokens + `prompt_cache_key` - Level 3 (Grounding): `hash(query_embedding + user_groups + data_timestamp)` - Level 4 (Model output): `hash(input_features + model_version)` --- ### Pattern 2: ONNX-Based Cross-Platform Deployment ``` ┌─────────────────────────────────────────────────────────────┐ │ Training (Azure ML) │ │ PyTorch / TensorFlow / scikit-learn │ └─────────────────────────────────────────────────────────────┘ │ ▼ ONNX Export ┌─────────────────────────────────────────────────────────────┐ │ ONNX Model Registry │ │ • Model versioning, metadata, governance │ └─────────────────────────────────────────────────────────────┘ │ ┌────────────┴────────────┐ ▼ ▼ ┌──────────────────────────┐ ┌──────────────────────────┐ │ Cloud Inference │ │ Edge Inference │ │ • Azure ML Endpoints │ │ • Azure SQL Edge │ │ • AKS + ONNX Runtime │ │ • Windows ML │ │ • GPU acceleration │ │ • IoT Edge │ │ (TensorRT) │ │ • NPU acceleration │ └──────────────────────────┘ └──────────────────────────┘ ``` **Fordeler:** - Train once, deploy everywhere - Framework-agnostic - Consistent performance optimization - Hardware acceleration på tvers av plattformer --- ### Pattern 3: Autoscaling Inference Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ Load Balancer │ │ (Azure Front Door / App Gateway) │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ azureml-fe (Inference Router) │ │ • Smart routing, autoscaling coordination │ │ • 3 instances (HA), 5K RPS capacity │ └─────────────────────────────────────────────────────────────┘ │ ┌────────────┴────────────┐ ▼ ▼ ┌──────────────────────────┐ ┌──────────────────────────┐ │ Model Pod Replicas │ │ Model Pod Replicas │ │ (min: 3, max: 15) │ │ (min: 3, max: 15) │ │ • ONNX Runtime │ │ • ONNX Runtime │ │ • CPU or GPU │ │ • CPU or GPU │ └──────────────────────────┘ └──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Azure Monitor / App Insights │ │ • Metrics: latency, throughput, utilization │ │ • Autoscaling triggers │ └─────────────────────────────────────────────────────────────┘ ``` **Scaling logic:** ``` utilization = (busy_replicas + queued_requests) / total_replicas if utilization > 70%: scale_up() if utilization < 50%: scale_down() # conservative ``` --- ## Beslutningsveiledning ### 1. Velge Inferencing Platform | Scenario | Anbefalt Platform | Rationale | |----------|-------------------|-----------| | **Foundation models** (GPT-4o, embeddings) | Azure OpenAI / AI Foundry Serverless | PaaS, automatisk scaling, prompt caching | | **Custom ML models** (scikit-learn, XGBoost) | Azure ML Managed Endpoints | Full kontroll, autoscaling, ONNX-support | | **High-throughput batch** | Azure ML Batch Endpoints / Databricks | Cost-efficient, parallelization | | **Edge deployment** | ONNX Runtime + Windows ML / IoT Edge | Cross-platform, hardware acceleration | | **Real-time inference** (<50ms) | Azure ML Online Endpoints (GPU) | Low latency, high throughput | | **SQL-integrated inference** | Azure SQL Edge (ONNX) | Native scoring, no external API calls | **[Confidence: HIGH]** — Basert på Microsoft's offisielle deployment guidance. --- ### 2. Velge Compute for Inference | Model Type | Recommended Compute | Rationale | |------------|---------------------|-----------| | **Small tabular models** | CPU (Standard_DS3_v2) | Cost-efficient, sufficient performance | | **Deep learning vision** | GPU (Standard_NC6s_v3) | Parallel processing, low latency | | **Large language models** | GPU (Standard_NC24s_v3 eller PTU) | High throughput, batch support | | **Batch scoring** | CPU clusters (autoscale 0-N) | Cost optimization, scale to zero | | **Edge scenarios** | NPU (Windows devices) | Power-efficient, local inference | **Testing strategy:** 1. Start med CPU baseline 2. Test GPU for latency-kritiske workloads 3. Sammenlign cost vs performance 4. Dokumenter resultatene som baseline for re-evaluation **[Confidence: HIGH]** — Standard industry practice i Azure ML. --- ### 3. Velge Caching Strategy | Use Case | Caching Layer | TTL | Cache Key Components | |----------|---------------|-----|---------------------| | **Chatbot FAQ** | Result cache (Redis) | 24h | `query_hash + model_version` | | **Product catalog search** | Grounding cache (Cosmos DB) | 1h | `query_embedding + catalog_version` | | **RAG knowledge retrieval** | Snippet cache (Cosmos DB) | 6h | `query + user_groups + doc_timestamp` | | **GPT-4o prompts** | Prompt cache (automatic) | 24h | Første 1024 tokens (automatic) | | **Batch predictions** | Model output cache | N/A | Not recommended (one-time use) | **Security checklist:** - [ ] Cache keys include user/tenant identity for private data? - [ ] TTL aligns with data freshness requirements? - [ ] Invalidation hooks implemented for data/model updates? - [ ] No user-private content cached cross-user? **[Confidence: MEDIUM-HIGH]** — Pattern er godt dokumentert, men må tilpasses per use case. --- ### 4. Online vs Batch Inference Decision Tree ``` Start: Har du real-time latency krav (<1s)? │ ├─ YES → Online Inference │ │ │ ├─ Throughput <100 RPS? → Managed Online Endpoint (CPU) │ ├─ Throughput >100 RPS? → Managed Online Endpoint (GPU) + autoscaling │ └─ Need 99.9% SLA? → Multi-region deployment │ └─ NO → Batch Inference │ ├─ Data size <1GB? → Azure ML Batch Endpoint ├─ Data size >1GB? → Databricks Batch (Spark) └─ Foundation model? → Azure OpenAI Batch API (50% discount) ``` **[Confidence: HIGH]** — Klar beslutningslogikk basert på Microsoft docs. --- ## Integrasjon med Microsoft-stakken ### Azure Machine Learning **Deployment options:** 1. **Managed Online Endpoints** — Real-time inference, autoscaling, monitoring 2. **Batch Endpoints** — Scheduled/on-demand batch scoring 3. **Kubernetes Endpoints** — Deploy to AKS, on-prem, eller edge Kubernetes **ONNX integration:** - Export modeller direkte fra AutoML (image classification, object detection) - Deploy ONNX models via MLflow eller custom scoring script - Automatic optimization via ONNX Runtime execution providers **Monitoring:** - Application Insights for latency, throughput, errors - Model performance monitoring for drift detection - Cost tracking per deployment --- ### Azure AI Foundry **Serverless API:** - Deploy foundation models uten å administrere infrastructure - Automatisk prompt caching for GPT-4o-modeller - Pay-per-token pricing **Model Catalog:** - Pretrained models fra Hugging Face, Meta, Mistral - One-click deployment to serverless endpoints - ONNX-modeller for cross-platform scenarios **Global Standard Deployments:** - Cost savings vs standard deployments - Custom model weights kan midlertidig lagres utenfor resource geography (vær obs på compliance) --- ### Azure OpenAI **Deployment types:** - **Standard** — Pay-per-token, regional data residency - **Provisioned Throughput (PTU)** — Reserved capacity, up to 100% discount on cached input tokens - **Global Standard** — Cost savings, global routing - **Developer Tier** — No hourly fee, no SLA (for testing) **Batch API:** - 50% cost reduction for non-real-time workloads - 24-hour completion window - Azure Blob Storage integration --- ### Windows ML **Edge inference scenarios:** - Deploy ONNX models directly i Windows apps - NPU acceleration via Windows AI runtime - Execution provider discovery og registration: ```python import winui3.microsoft.windows.ai.machinelearning as winml catalog = winml.ExecutionProviderCatalog.get_default() providers = catalog.find_all_providers() for provider in providers: provider.ensure_ready_async().get() if provider.library_path: ort.register_execution_provider_library(provider.name, provider.library_path) ``` --- ### Azure SQL Edge **Native ONNX scoring:** - Deploy ONNX models directly i SQL Edge - `PREDICT` T-SQL function for inference - No external API calls, low-latency scoring - Ideal for IoT/edge scenarios med connectivity constraints --- ### Databricks **Batch inference optimization:** - Spark Pandas UDFs for distributed inference - Delta Lake integration for data caching - GPU clusters for deep learning models **Disk cache configuration:** ```python spark.conf.set("spark.databricks.io.cache.enabled", "true") spark.conf.set("spark.databricks.io.cache.maxDiskUsage", "50g") ``` --- ## Offentlig sektor (Norge) ### Compliance og Data Residency **Prompt caching compliance:** - Azure OpenAI prompt caches er **ikke delt mellom subscriptions** — OK for multi-tenant scenarios innad i én subscription - Cache lifetime: 24 timer — vurder om dette er akseptabelt for sensitive data - Vær obs på at cached tokens **ikke påvirker output content** — kun performance/cost **Global Standard deployments:** - Custom model weights kan **midlertidig lagres utenfor region** — vurder mot Schrems II og data residency-krav - For offentlig sektor: foretrekk **Standard deployments** (regional data residency) over Global Standard **ONNX edge deployment:** - For edge scenarios (Azure SQL Edge, Windows ML) — data forlater **ikke device** hvis modell er embedded - Ideelt for kommuner/sykehus med connectivity constraints eller privacy-krav --- ### Cost Optimization for Offentlig Sektor **Batch API for budsjett-beskrankede prosjekter:** - 50% lavere cost enn real-time API - Egnet for daglige rapporter, batch-analyser, data enrichment **Prompt caching for cost reduction:** - Standard deployment: rabatt på input tokens - Provisioned deployment: opptil 100% rabatt på cached tokens - Eksempel: Knowledge base Q&A med repetitiv grounding context — store savings **Autoscaling for variabel demand:** - Sett `min_instances: 0` for ikke-kritiske workloads (scale to zero when idle) - Bruk `target_utilization_percentage: 70` for å balansere cost vs responsiveness **TCO-vurdering:** - Online inference: høyere cost, men nødvendig for brukervendte apps - Batch inference: lavere cost, egnet for interne analyser/rapporter - Edge inference: ingen inference API cost, men krever on-prem hardware --- ### Sikkerhet og Personvern **Cache security best practices:** - **ALDRI cache personidentifiserbare data** (fødselsnummer, helseopplysninger) uten kryptering og user-scoped keys - Implementer `cache_key = hash(user_id + query + model_version)` for user-private content - Bruk kort TTL (5-15 min) for sensitive queries **Authorization-aware retrieval:** - Pass Microsoft Entra group claims til knowledge layer - Grounding services må enforces ACL-based filtering - Eksempel: RAG-system for saksdokumenter — kun returner dokumenter bruker har tilgang til **Audit logging:** - Log alle cache hits/misses for compliance - Track hvilke brukere har accesset cached results - Integrer med Azure Monitor for SIEM-forwarding **[Confidence: MEDIUM-HIGH]** — Security patterns er godt dokumentert, men krever nøye implementering. --- ## Kostnad og lisensiering ### Azure Machine Learning Pricing **Compute costs:** - **Managed Online Endpoints:** Pay for VM uptime (even if idle) + inference requests - **Batch Endpoints:** Pay only for compute time during job execution - **Autoscaling:** Kan redusere cost ved å scale to zero (min_instances: 0) **Estimat (Standard_DS3_v2, 2 vCPU, 14GB RAM):** - ~$0.192/hour per instance - Med autoscaling (avg 5 instances, 8h/day): ~$230/måned - Batch (4h/dag): ~$92/måned **Cost optimization tips:** - Bruk Reserved Instances for predictable workloads (opptil 72% discount) - Leverage Spot VMs for non-critical batch jobs (opptil 90% discount) - Monitor idle instances og adjust min_instances --- ### Azure OpenAI Pricing **Standard deployment:** - Pay-per-token (input + output) - **Prompt caching discount:** reduced rate for cached input tokens (varies by model) - Eksempel (GPT-4o): $5/1M input tokens, $15/1M output tokens — cached input tokens $2.50/1M (estimated) **Provisioned Throughput (PTU):** - Fixed monthly cost basert på reserved capacity - **Up to 100% discount on cached input tokens** - Egnet for high-volume, predictable workloads **Batch API:** - **50% lavere cost** enn standard API - Eksempel: $2.50/1M tokens (vs $5/1M for real-time) **Cost estimation example:** - RAG chatbot: 1M requests/måned, avg 2000 tokens/request (1500 prompt, 500 completion) - Med prompt caching (70% cache hit rate): **$10,500/måned** (vs $18,000 uten caching) --- ### Lisensiering **ONNX Runtime:** - **MIT License** — free for commercial use - No licensing cost for deployment **Azure Services:** - Azure ML, Azure OpenAI, AI Foundry: **pay-per-use** (no upfront license fees) - Windows ML: inkludert i Windows (no additional license) **Power Platform AI:** - AI Builder capacity: $500/måned for 1M AI Builder service credits - Custom models (ONNX): **ingen ekstra cost** utover AI Builder capacity **[Confidence: HIGH]** — Pricing er transparent og godt dokumentert på azure.com. --- ## For arkitekten (Cosmo) ### Typiske Spørsmål fra Kunder **Q: "Hvorfor er inferencing så tregt sammenlignet med training?"** A: Misforståelse! Training og inferencing har ulike optimaliseringsmål. Training fokuserer på accuracy (kan ta timer/dager), mens inferencing må levere prediksjoner i sanntid (<100ms). Løsning: ONNX-konvertering, GPU-akselerasjon, caching, batch inference for ikke-latency-kritiske scenarios. **Q: "Vi har deployet en modell, men Azure ML-costs eksploderer. Hva gjør vi?"** A: Sjekk følgende: 1. Er `min_instances` satt til >0 for idle endpoints? → Sett til 0 eller sllett endpoint 2. Bruker dere GPU for enkel ML-modell? → Bytt til CPU 3. Har dere implementert caching? → Implementer result cache (Redis) for repetitive queries 4. Er autoscaling konfiguert? → Sett target_utilization til 70% og max_instances til realistisk verdi **Q: "Kan vi bruke samme modell i Azure ML, Power Platform og edge devices?"** A: Ja, med ONNX! Konverter modell til ONNX, deploy til: - Azure ML Managed Endpoints (cloud) - AI Builder custom models (Power Platform) - Azure SQL Edge (edge database) - Windows ML (client apps) **Q: "Hvordan balanserer vi cost vs performance?"** A: Følg denne prioriteringen: 1. **Implementer caching først** — største ROI for generative AI workloads 2. **Velg riktig compute** — CPU for de fleste ML-modeller, GPU kun for deep learning 3. **Batch vs online** — bruk batch hvor mulig (50% lavere cost) 4. **Autoscaling** — scale to zero for ikke-kritiske workloads 5. **Reserved capacity** — for predictable workloads (opptil 72% discount) --- ### Anti-Patterns å Unngå ❌ **Deploying GPU instances for simple ML models** - Scikit-learn, XGBoost kjører fint på CPU - GPU gir minimal speedup, men 3-5x høyere cost ❌ **No caching for repetitive queries** - Eksempel: chatbot med FAQ — samme spørsmål stilles om og om igjen - Løsning: Redis cache med 1-hour TTL ❌ **Ignoring autoscaling (min_instances = max_instances)** - Fastlåst antall instances betyr du betaler for idle capacity - Løsning: Sett min_instances til 0-1, max_instances til realistic peak ❌ **Using online inference for batch workloads** - Daglige rapporter kjørt via online API → unødvendig dyrt - Løsning: Azure ML Batch Endpoint eller Azure OpenAI Batch API ❌ **Not converting to ONNX for cross-platform deployment** - Deploying PyTorch modell direkte til edge → store dependencies, treg inferencing - Løsning: Konverter til ONNX, deploy via Windows ML/IoT Edge --- ### Troubleshooting Guide **Problem: High latency (>500ms) for simple predictions** Diagnostikk: 1. Sjekk `Application Insights` → identifiser bottleneck (network, model, preprocessing) 2. Profiler modell med `azureml.core.Model.profile()` → se CPU/memory usage 3. Sjekk om modell er ONNX-konvertert → hvis ikke, konverter for speedup **Problem: Autoscaling ikke fungerer** Diagnostikk: 1. Sjekk at `azureml-fe` ikke konkurrerer med Kubernetes HPA → disable HPA 2. Verify `scale_settings` i deployment YAML 3. Monitor `utilization_percentage` metric → skal trigger ved 70% **Problem: Cache hit rate lav (<20%)** Diagnostikk: 1. Prompt caching: Er første 1024 tokens identiske? → restructure prompts 2. Result cache: Er `cache_key` for granular? → reduser til færre dimensjoner 3. TTL for kort? → øk TTL for static data **Problem: Out-of-memory errors på inference endpoint** Diagnostikk: 1. Sjekk batch size → reduser for å unngå OOM 2. Upgrade VM SKU → mer memory (Standard_DS3_v2 → Standard_DS4_v2) 3. Vurder model quantization → reduser model size --- ### Decision Framework: Når Bruke Hva **Scenario: Real-time chatbot (consumer-facing)** - **Platform:** Azure OpenAI (Standard deployment) - **Caching:** Prompt caching (automatic) + Result cache (Redis, 1h TTL) - **Compute:** Serverless (automatic scaling) - **Monitoring:** Application Insights for latency/errors **Scenario: Batch document classification (internal)** - **Platform:** Azure ML Batch Endpoint - **Caching:** N/A (one-time processing) - **Compute:** CPU cluster (Standard_DS3_v2, autoscale 0-10) - **Monitoring:** Job logs for throughput/errors **Scenario: Edge inference på IoT devices** - **Platform:** Azure IoT Edge + ONNX Runtime - **Caching:** Local model cache (embedded i device) - **Compute:** NPU (hvis tilgjengelig) eller CPU - **Monitoring:** IoT Hub telemetry **Scenario: RAG system for kunnskapsdatabase** - **Platform:** Azure AI Foundry + Azure AI Search - **Caching:** Grounding snippet cache (Cosmos DB, 6h TTL) + Prompt cache - **Compute:** Serverless (Azure OpenAI) - **Monitoring:** Cache hit rate, latency, token usage --- ## Kilder og verifisering ### Microsoft Learn Dokumentasjon 1. **ONNX and Azure Machine Learning** https://learn.microsoft.com/en-us/azure/machine-learning/concept-onnx?view=azureml-api-2 *Verifisert: 2026-02-04* — Komplett guide til ONNX Runtime, model conversion, deployment 2. **Prompt Caching (Azure OpenAI)** https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/prompt-caching?view=foundry-classic *Verifisert: 2026-02-04* — Official docs for prompt caching, supported models, pricing 3. **Application Design for AI Workloads on Azure** https://learn.microsoft.com/en-us/azure/well-architected/ai/application-design *Verifisert: 2026-02-04* — Multi-layer caching strategies, security best practices 4. **Azure Machine Learning Inference Router** https://learn.microsoft.com/en-us/azure/machine-learning/how-to-kubernetes-inference-routing-azureml-fe?view=azureml-api-2 *Verifisert: 2026-02-04* — Autoscaling, performance characteristics 5. **Best Practices for Deep Learning on Azure Databricks** https://learn.microsoft.com/en-us/azure/databricks/machine-learning/train-model/dl-best-practices *Verifisert: 2026-02-04* — Batch inference optimization, Spark Pandas UDFs 6. **Make Predictions with ONNX (AutoML)** https://learn.microsoft.com/en-us/azure/machine-learning/how-to-inference-onnx-automl-image-models?view=azureml-api-2 *Verifisert: 2026-02-04* — ONNX inference for computer vision models 7. **Sustainable AI Design for Workloads on Azure** https://learn.microsoft.com/en-us/azure/well-architected/sustainability/sustainable-ai-design *Verifisert: 2026-02-04* — Model caching for carbon reduction 8. **Azure Machine Learning Architecture Best Practices** https://learn.microsoft.com/en-us/azure/well-architected/service-guides/azure-machine-learning *Verifisert: 2026-02-04* — Performance efficiency, cost optimization ### Code Samples (MCP microsoft-learn) - ONNX Runtime inference session creation (Python) - Batch inference with Azure ML SDK - Prompt caching response parsing - Autoscaling configuration (YAML) - Databricks disk cache configuration **Total MCP-kall:** 7 (docs search) + 3 (docs fetch) + 2 (code samples) = **12** **Kilder totalt:** 8 Microsoft Learn-artikler + 15+ kodeeksempler --- ## Oppsummering for Cosmo **Key Takeaways:** 1. **ONNX Runtime er game-changer** for cross-platform deployment og performance optimization (2x speedup på CPU) 2. **Prompt caching** (Azure OpenAI) gir opptil 100% discount på cached input tokens — kritisk for cost optimization 3. **Multi-layer caching** (result → prompt → grounding → model output) er obligatorisk for production AI apps 4. **Batch inference** er 50% billigere enn online, men kun egnet for ikke-latency-kritiske workloads 5. **Autoscaling** må konfigureres riktig (min_instances: 0, target_utilization: 70%) for å unngå waste **Anbefalinger til kunde:** - Start med **CPU + ONNX Runtime** for ML-modeller (unless deep learning) - Implementer **prompt caching** for generative AI workloads (automatisk i Azure OpenAI) - Bruk **Azure ML Batch Endpoints** for rapporter/analyser - Deploy **ONNX models til edge** (Azure SQL Edge, Windows ML) for low-latency/privacy-kritiske scenarios - Monitor **cache hit rate** og **autoscaling metrics** kontinuerlig **Confidence nivå: HIGH** — Denne referansen er basert på 12 MCP-kall til offisiell Microsoft-dokumentasjon og kodeeksempler. ### ONNX Inferencing Optimization for Computer Vision (Azure ML AutoML 2026) — Verified (MCP 2026-04) ONNX (Open Neural Network Exchange) enables cross-framework interoperability and inference optimization: **Supported AutoML computer vision tasks**: - Image classification (binary and multi-class) - Object detection - Instance segmentation **ONNX inference workflow**: 1. Download ONNX model files from AutoML training run 2. Understand model inputs/outputs (image format requirements) 3. Preprocess data to required input format 4. Run inference with ONNX Runtime Python API (`onnxruntime`) 5. Post-process predictions (bounding boxes for detection, masks for segmentation) **Python ONNX Runtime**: ```python import onnxruntime as rt sess = rt.InferenceSession("model.onnx") # Works across languages: Python, C++, C#, Java, JavaScript ``` **Cross-platform benefits**: - Deploy on any platform without framework dependencies - Reduced inference latency vs Python framework - Edge deployment: Azure IoT Edge, on-premises - Language flexibility post-export **SDK**: `azure-ai-ml v2 (current)` — use AutoML image tasks to generate ONNX models automatically