ktg-plugin-marketplace/plugins/ms-ai-architect/skills/ms-ai-engineering/references/mlops-genaiops/inferencing-optimization-caching.md
Kjell Tore Guttormsen ff6a50d14f docs(architect): weekly KB update — 106 files refreshed (2026-04)
Updates across all 5 skills: ms-ai-advisor, ms-ai-engineering,
ms-ai-governance, ms-ai-security, ms-ai-infrastructure.

Key changes:
- Language Services (Custom Text Classification, Text Analytics, QnA):
  retirement warning 2029-03-31, migration guides to Foundry/GPT-4o
- Agentic Retrieval: 50M free reasoning tokens/month (Public Preview)
- Computer Use: Claude Sonnet 4.5 (preview) + OpenAI CUA models
- Agent Registry: Risks column (M365 E7), user-shared/org-published types
- Declarative agents: schema v1.5 → v1.6, Store validation requirements
- MLflow 3: 13 built-in LLM judges, production monitoring, Genie Code
- AG-UI HITL: ApprovalRequiredAIFunction (C#) + @tool(approval_mode) (Python)
- Entra ID Ignite 2025: Agent ID Admin/Developer RBAC roles, Conditional Access
- Security Copilot: 400 SCU/month per 1000 M365 E5 licenses, auto-provisioned
- Fast Transcription API: phrase lists, 14-language multi-lingual transcription
- Azure Monitor Workbooks: Bicep support, RBAC specifics
- Power Platform Copilot: data residency (Norway/Europe → EU DB, Bing → USA)
- RAG security-rbac: 4-approach table (GA + 3 preview access control methods)
- IaC MLOps: Well-Architected OE:05 principles, Bicep/Terraform patterns
- Translator: image file batch translation Preview (JPEG/PNG/BMP/WebP)

All 106 files: Last updated 2026-04 | Verified: MCP 2026-04

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 09:13:24 +02:00

41 KiB

Inferencing Optimization and Caching

Kategori: MLOps & GenAIOps Dato: 2026-02-04 Forfattet av: Cosmo Skyberg, Senior Microsoft AI Solution Architect

Verified: MCP 2026-04

Introduksjon

Inferencing optimization og caching representerer kritiske teknikker for å maksimere ytelse og minimere kostnader når AI-modeller skal serve prediksjoner i produksjon. Mens model training handler om å oppnå høy accuracy, handler inferencing om å levere disse prediksjonene raskt, pålitelig og kostnadseffektivt til brukere og systemer.

Hva er inferencing? Inferencing (eller model scoring) er prosessen med å bruke en trent modell til å generere prediksjoner på produksjonsdata. Dette skjer kontinuerlig etter at modellen er deployet, og kan involvere alt fra enkeltforespørsler (online inference) til batch-prosessering av store datasett.

Hvorfor er optimalisering kritisk? Selv veltrente modeller kan feile i produksjon hvis de ikke er optimalisert for inferencing. Dårlig inferencing-ytelse manifesterer seg som høy latency, lav throughput, høye infrastrukturkostnader og dårlig brukeropplevelse. I Microsoft-økosystemet er dette spesielt relevant for Azure Machine Learning, Azure AI Foundry, og embedded scenarios som Azure SQL Edge og Windows ML.

Tre pilarer for inferencing optimization:

  1. Model optimization — konvertering til effektive formater (ONNX), quantization, pruning
  2. Compute optimization — riktig hardware-akselerasjon (CPU vs GPU vs NPU), autoscaling, resource tuning
  3. Caching strategies — multi-layer caching for å unngå redundant compute

Denne referansen dekker alle tre områdene med fokus på Microsoft-verktøy og best practices for offentlig sektor.


Kjernekomponenter

1. ONNX Runtime — High-Performance Inference Engine

ONNX (Open Neural Network Exchange) er en åpen standard for å representere machine learning-modeller på tvers av frameworks. ONNX Runtime er Microsofts høyytelsesmotor for å kjøre disse modellene i produksjon.

Nøkkelfunksjoner:

  • Cross-platform: Linux, Windows, macOS, cloud og edge
  • Cross-framework: Støtter modeller fra TensorFlow, PyTorch, scikit-learn, Keras, MXNet, MATLAB
  • Hardware acceleration: Integrerer med TensorRT (NVIDIA GPUs), OpenVINO (Intel), DirectML (Windows)
  • Production-proven: Brukes av Bing, Office, Azure AI — Microsoft-tjenester rapporterer gjennomsnittlig 2x ytelsesgevinst på CPU

Når bruke ONNX Runtime:

  • Du trenger å deploy samme modell på flere plattformer (cloud + edge)
  • Du vil unngå vendor lock-in til et spesifikt framework
  • Du trenger maksimal inferencing-ytelse på CPU eller spesialisert hardware
  • Du skal deploy modeller i Windows ML, Azure SQL Edge, eller ML.NET

Python-eksempel — ONNX Runtime inference:

import onnxruntime

# Opprett inference session
session = onnxruntime.InferenceSession("model.onnx")

# Hent input/output metadata
first_input_name = session.get_inputs()[0].name
first_output_name = session.get_outputs()[0].name

# Kjør inferencing
results = session.run(
    ["output1", "output2"],
    {"input1": input_data}
)

Installation:

pip install onnxruntime       # CPU build
pip install onnxruntime-gpu   # GPU build

[Confidence: HIGH] — ONNX Runtime er mature, veldokumentert, og aktivt utviklet av Microsoft.


2. Model Optimization Techniques

A. Model Conversion to ONNX

Konvertering fra native framework til ONNX lar deg dra nytte av ONNX Runtime's optimaliseringer.

Konvertering fra PyTorch:

import torch.onnx

# Sett modell i inference mode
model.eval()

# Dummy input for shape tracing
dummy_input = torch.randn(1, 3, 224, 224, requires_grad=True)

# Eksporter til ONNX
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    export_params=True,
    opset_version=11,
    do_constant_folding=True,  # Optimization
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}
)

Frameworks med ONNX-støtte:

  • TensorFlow, PyTorch, scikit-learn, Keras, Chainer, MXNet, MATLAB
  • AutoML-modeller fra Azure Machine Learning (image classification, object detection)

B. Batch Inference Optimization

For AutoML-modeller (spesielt vision) kan du generere batch-optimaliserte ONNX-modeller:

# Object detection batch model parameters
inputs = {
    'model_name': 'fasterrcnn_resnet34_fpn',
    'batch_size': 8,
    'height_onnx': 600,
    'width_onnx': 800,
    'job_name': job_name,
    'task_type': 'image-object-detection',
    'min_size': 600,
    'max_size': 1333,
    'box_score_thresh': 0.3,
    'box_nms_thresh': 0.5,
    'box_detections_per_img': 100
}

[Confidence: HIGH] — Batch inference støttes godt i Azure ML for both training og deployment.


3. Multi-Layer Caching Strategies

Caching er en av de mest effektive måtene å redusere inferencing-kostnader og latency, spesielt for generative AI-workloads.

A. Prompt Caching (Azure OpenAI / AI Foundry)

Hva er prompt caching? I stedet for å reprosessere samme input-tokens om og om igjen, beholder tjenesten en midlertidig cache av prosesserte token-computations.

Krav for å utnytte prompt caching:

  • Minimum 1 024 tokens i lengde
  • De første 1 024 tokens må være identiske
  • Cache hits rapporteres som cached_tokens i response

Støttede modeller:

  • Alle Azure OpenAI-modeller GPT-4o eller nyere
  • Gjelder chat-completion, completion, responses, real-time operations

Pricing:

  • Standard deployment: rabatt på input token pricing
  • Provisioned deployment: opptil 100% rabatt på input tokens

Cache-lifecycle:

  • Caches cleares innen 24 timer
  • Ikke delt mellom Azure subscriptions

Response-eksempel med cache hit:

{
  "usage": {
    "completion_tokens": 1518,
    "prompt_tokens": 1566,
    "total_tokens": 3084,
    "prompt_tokens_details": {
      "cached_tokens": 1408
    }
  }
}

Optimalisering:

  • Strukturer requests slik at repetitivt innhold ligger i starten av messages array
  • Bruk prompt_cache_key parameter for å påvirke routing og forbedre cache hit rates
  • Vær obs på at >15 requests/min med samme prefix kan overflow og redusere effektivitet

[Confidence: HIGH] — Prompt caching er production-ready og automatisk enabled for støttede modeller.

B. Application-Layer Caching

Multi-layer caching approach for AI-applikasjoner:

  1. Result and answer caching — Gjenbruk responses for identiske eller semantisk like queries
  2. Retrieval and grounding snippet caching — Cache hyppig hentede knowledge fragments
  3. Model output caching — Cache intermediate outputs som kan gjenbrukes

Cache key components (kritisk for sikkerhet):

  • Tenant eller user identity
  • Policy context
  • Model version
  • Prompt version

TTL policies:

  • Sett expiration basert på data freshness requirements
  • Kortere TTL for sensitive data
  • Lengre TTL for static catalog data

Invalidation hooks:

  • Data updates
  • Model changes
  • Prompt modifications

Security considerations:

  • ALDRI cache user-private content uten proper scoping
  • Caching fungerer best for data som gjelder på tvers av flere brukere
  • Eksempel på farlig caching: "How many hours of paid time off do I have left?" — kun gyldig for én bruker

[Confidence: MEDIUM-HIGH] — Pattern er godt dokumentert, men krever nøye implementering for å unngå security leaks.

C. Databricks Disk Caching

For batch inference på Databricks kan du bruke disk cache for å forbedre I/O performance:

spark.conf.set("spark.databricks.io.cache.enabled", "true")
spark.conf.set("spark.databricks.io.cache.maxDiskUsage", "50g")
spark.conf.set("spark.databricks.io.cache.maxMetaDataCache", "1g")
spark.conf.set("spark.databricks.io.cache.compression.enabled", "false")

Best practice:

  • Velg cache-accelerated worker instance types
  • Vær obs på at cache går tapt ved autoscaling (worker decommission)

4. Compute Resource Optimization

A. CPU vs GPU Selection

CPU inference:

  • Generelle ML-modeller (scikit-learn, XGBoost)
  • Small to medium deep learning models
  • Cost-sensitive scenarios
  • ONNX Runtime gir 2x speedup på CPU for mange workloads

GPU inference:

  • Deep learning models (transformers, CNNs)
  • High-throughput batch processing
  • Latency-kritiske online inference
  • Computer vision, NLP-modeller

NPU (Neural Processing Unit):

  • Edge deployment scenarios (Windows ML)
  • Power-efficient inference på mobile/IoT devices

ONNX Runtime execution provider selection:

import onnxruntime as ort

# Automatisk select EP basert på MAX_EFFICIENCY policy (prioriterer NPU > CPU)
options = ort.SessionOptions()
options.set_provider_selection_policy(ort.OrtExecutionProviderDevicePolicy.MAX_EFFICIENCY)

session = ort.InferenceSession(model_path, sess_options=options)

B. Autoscaling for Inference Endpoints

Azure Machine Learning — Managed Online Endpoints:

Autoscaling basert på Azure Monitor metrics (CPU, requests per second, latency).

Azure Kubernetes Service (AKS) — azureml-fe router:

# deployment.yaml
scale_setting:
  type: target_utilization
  min_instances: 3
  max_instances: 15
  target_utilization_percentage: 70
  polling_interval: 10

Utilization formula:

utilization_percentage = (busy_replicas + queued_requests) / total_replicas
  • Scale up: eager and fast (når utilization > 70%)
  • Scale down: conservative (~20x slower enn scale up)

Performance characteristics:

  • azureml-fe kan håndtere 5K requests/second med <3ms average latency, 15ms p99
  • For >10K RPS: øk azureml-fe pods eller vCPU/memory limits

[Confidence: HIGH] — Autoscaling er production-proven i Azure ML.


5. Batch vs Online Inference Optimization

A. Batch Inference Best Practices

Når bruke batch:

  • Large datasets i filer (ikke krever low latency)
  • Scheduled scoring (daily/weekly)
  • Cost-sensitive scenarios (batch er billigere enn online)

Azure Machine Learning Batch Endpoints:

from azure.ai.ml.entities import BatchEndpoint

endpoint = BatchEndpoint(
    name=endpoint_name,
    description="Batch inference for predictions"
)

ws_client.batch_endpoints.begin_create_or_update(endpoint)

Parallel processing optimization:

from azure.ai.ml import parallel_run_function

file_batch_inference = parallel_run_function(
    name="batch_score",
    inputs=dict(job_data_path=Input(type=AssetTypes.MLTABLE)),
    outputs=dict(job_output_path=Output(type=AssetTypes.MLTABLE)),
    input_data="${{inputs.job_data_path}}",
    instance_count=2,
    max_concurrency_per_instance=1,
    mini_batch_size="1",
    task=RunFunction(
        code="./src",
        entry_script="batch_inference.py",
        environment="azureml://registries/azureml/environments/sklearn-1.5/labels/latest"
    )
)

Databricks batch inference tips:

  • Bruk Spark Pandas UDFs for å scale inference across cluster
  • Separer preprocessing fra inference for optimal hardware selection (CPU for ETL, GPU for inference)
  • Bruk Delta Lake tables for data som leses flere ganger

B. Online Inference Best Practices

Når bruke online:

  • Real-time user-facing applications
  • Low-latency requirements (<100ms)
  • Single or small-batch predictions

Azure AI Foundry Serverless API:

  • PaaS, minimal operational burden
  • Best for foundation models (Azure OpenAI)

Azure Machine Learning Managed Online Endpoints:

  • Custom models med full kontroll
  • Autoscaling, blue/green deployment
  • Integration med Application Insights for monitoring

6. Azure OpenAI Batch API for Cost-Efficient Inference

For foundation models som ikke krever real-time response:

Batch API benefits:

  • 50% lavere cost enn standard API
  • 24-hour completion window
  • Støtte for chat completions, embeddings, completions

Batch job creation:

from openai import OpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(),
    "https://cognitiveservices.azure.com/.default"
)

client = OpenAI(
    base_url="https://YOUR-RESOURCE.openai.azure.com/openai/v1/",
    api_key=token_provider
)

batch_response = client.batches.create(
    input_file_id=None,
    endpoint="/chat/completions",
    completion_window="24h",
    extra_body={
        "input_blob": "https://storage.blob.core.windows.net/batch-input/test.jsonl",
        "output_folder": {
            "url": "https://storage.blob.core.windows.net/batch-output"
        }
    }
)

[Confidence: HIGH] — Batch API er production-ready for non-latency-sensitive workloads.


Arkitekturmønstre

Pattern 1: Multi-Layer Caching Architecture

┌─────────────────────────────────────────────────────────────┐
│                       Client Layer                          │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    AI Gateway (APIM)                        │
│  • Authentication, rate limiting, token caps                │
│  • Result cache (Redis) — Level 1                           │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│               Intelligence Layer (Orchestrator)             │
│  • Prompt cache (Azure OpenAI) — Level 2                    │
│  • Model routing, agent coordination                        │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Knowledge Layer                          │
│  • Grounding snippet cache (Cosmos DB) — Level 3           │
│  • Azure AI Search, SQL, Graph                              │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   Inferencing Layer                         │
│  • Model output cache — Level 4                             │
│  • ONNX Runtime, Azure ML endpoints                         │
└─────────────────────────────────────────────────────────────┘

Cache key strategy per layer:

  • Level 1 (Result): hash(user_id + query + model_version + prompt_version)
  • Level 2 (Prompt): automatisk basert på første 1024 tokens + prompt_cache_key
  • Level 3 (Grounding): hash(query_embedding + user_groups + data_timestamp)
  • Level 4 (Model output): hash(input_features + model_version)

Pattern 2: ONNX-Based Cross-Platform Deployment

┌─────────────────────────────────────────────────────────────┐
│                   Training (Azure ML)                       │
│  PyTorch / TensorFlow / scikit-learn                        │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼ ONNX Export
┌─────────────────────────────────────────────────────────────┐
│                   ONNX Model Registry                       │
│  • Model versioning, metadata, governance                   │
└─────────────────────────────────────────────────────────────┘
                              │
                 ┌────────────┴────────────┐
                 ▼                         ▼
┌──────────────────────────┐   ┌──────────────────────────┐
│   Cloud Inference        │   │   Edge Inference         │
│  • Azure ML Endpoints    │   │  • Azure SQL Edge        │
│  • AKS + ONNX Runtime    │   │  • Windows ML            │
│  • GPU acceleration      │   │  • IoT Edge              │
│    (TensorRT)            │   │  • NPU acceleration      │
└──────────────────────────┘   └──────────────────────────┘

Fordeler:

  • Train once, deploy everywhere
  • Framework-agnostic
  • Consistent performance optimization
  • Hardware acceleration på tvers av plattformer

Pattern 3: Autoscaling Inference Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Load Balancer                            │
│  (Azure Front Door / App Gateway)                           │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│            azureml-fe (Inference Router)                    │
│  • Smart routing, autoscaling coordination                  │
│  • 3 instances (HA), 5K RPS capacity                        │
└─────────────────────────────────────────────────────────────┘
                              │
                 ┌────────────┴────────────┐
                 ▼                         ▼
┌──────────────────────────┐   ┌──────────────────────────┐
│  Model Pod Replicas      │   │  Model Pod Replicas      │
│  (min: 3, max: 15)       │   │  (min: 3, max: 15)       │
│  • ONNX Runtime          │   │  • ONNX Runtime          │
│  • CPU or GPU            │   │  • CPU or GPU            │
└──────────────────────────┘   └──────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              Azure Monitor / App Insights                   │
│  • Metrics: latency, throughput, utilization                │
│  • Autoscaling triggers                                     │
└─────────────────────────────────────────────────────────────┘

Scaling logic:

utilization = (busy_replicas + queued_requests) / total_replicas
if utilization > 70%: scale_up()
if utilization < 50%: scale_down()  # conservative

Beslutningsveiledning

1. Velge Inferencing Platform

Scenario Anbefalt Platform Rationale
Foundation models (GPT-4o, embeddings) Azure OpenAI / AI Foundry Serverless PaaS, automatisk scaling, prompt caching
Custom ML models (scikit-learn, XGBoost) Azure ML Managed Endpoints Full kontroll, autoscaling, ONNX-support
High-throughput batch Azure ML Batch Endpoints / Databricks Cost-efficient, parallelization
Edge deployment ONNX Runtime + Windows ML / IoT Edge Cross-platform, hardware acceleration
Real-time inference (<50ms) Azure ML Online Endpoints (GPU) Low latency, high throughput
SQL-integrated inference Azure SQL Edge (ONNX) Native scoring, no external API calls

[Confidence: HIGH] — Basert på Microsoft's offisielle deployment guidance.


2. Velge Compute for Inference

Model Type Recommended Compute Rationale
Small tabular models CPU (Standard_DS3_v2) Cost-efficient, sufficient performance
Deep learning vision GPU (Standard_NC6s_v3) Parallel processing, low latency
Large language models GPU (Standard_NC24s_v3 eller PTU) High throughput, batch support
Batch scoring CPU clusters (autoscale 0-N) Cost optimization, scale to zero
Edge scenarios NPU (Windows devices) Power-efficient, local inference

Testing strategy:

  1. Start med CPU baseline
  2. Test GPU for latency-kritiske workloads
  3. Sammenlign cost vs performance
  4. Dokumenter resultatene som baseline for re-evaluation

[Confidence: HIGH] — Standard industry practice i Azure ML.


3. Velge Caching Strategy

Use Case Caching Layer TTL Cache Key Components
Chatbot FAQ Result cache (Redis) 24h query_hash + model_version
Product catalog search Grounding cache (Cosmos DB) 1h query_embedding + catalog_version
RAG knowledge retrieval Snippet cache (Cosmos DB) 6h query + user_groups + doc_timestamp
GPT-4o prompts Prompt cache (automatic) 24h Første 1024 tokens (automatic)
Batch predictions Model output cache N/A Not recommended (one-time use)

Security checklist:

  • Cache keys include user/tenant identity for private data?
  • TTL aligns with data freshness requirements?
  • Invalidation hooks implemented for data/model updates?
  • No user-private content cached cross-user?

[Confidence: MEDIUM-HIGH] — Pattern er godt dokumentert, men må tilpasses per use case.


4. Online vs Batch Inference Decision Tree

Start: Har du real-time latency krav (<1s)?
  │
  ├─ YES → Online Inference
  │         │
  │         ├─ Throughput <100 RPS? → Managed Online Endpoint (CPU)
  │         ├─ Throughput >100 RPS? → Managed Online Endpoint (GPU) + autoscaling
  │         └─ Need 99.9% SLA? → Multi-region deployment
  │
  └─ NO → Batch Inference
            │
            ├─ Data size <1GB? → Azure ML Batch Endpoint
            ├─ Data size >1GB? → Databricks Batch (Spark)
            └─ Foundation model? → Azure OpenAI Batch API (50% discount)

[Confidence: HIGH] — Klar beslutningslogikk basert på Microsoft docs.


Integrasjon med Microsoft-stakken

Azure Machine Learning

Deployment options:

  1. Managed Online Endpoints — Real-time inference, autoscaling, monitoring
  2. Batch Endpoints — Scheduled/on-demand batch scoring
  3. Kubernetes Endpoints — Deploy to AKS, on-prem, eller edge Kubernetes

ONNX integration:

  • Export modeller direkte fra AutoML (image classification, object detection)
  • Deploy ONNX models via MLflow eller custom scoring script
  • Automatic optimization via ONNX Runtime execution providers

Monitoring:

  • Application Insights for latency, throughput, errors
  • Model performance monitoring for drift detection
  • Cost tracking per deployment

Azure AI Foundry

Serverless API:

  • Deploy foundation models uten å administrere infrastructure
  • Automatisk prompt caching for GPT-4o-modeller
  • Pay-per-token pricing

Model Catalog:

  • Pretrained models fra Hugging Face, Meta, Mistral
  • One-click deployment to serverless endpoints
  • ONNX-modeller for cross-platform scenarios

Global Standard Deployments:

  • Cost savings vs standard deployments
  • Custom model weights kan midlertidig lagres utenfor resource geography (vær obs på compliance)

Azure OpenAI

Deployment types:

  • Standard — Pay-per-token, regional data residency
  • Provisioned Throughput (PTU) — Reserved capacity, up to 100% discount on cached input tokens
  • Global Standard — Cost savings, global routing
  • Developer Tier — No hourly fee, no SLA (for testing)

Batch API:

  • 50% cost reduction for non-real-time workloads
  • 24-hour completion window
  • Azure Blob Storage integration

Windows ML

Edge inference scenarios:

  • Deploy ONNX models directly i Windows apps
  • NPU acceleration via Windows AI runtime
  • Execution provider discovery og registration:
import winui3.microsoft.windows.ai.machinelearning as winml

catalog = winml.ExecutionProviderCatalog.get_default()
providers = catalog.find_all_providers()

for provider in providers:
    provider.ensure_ready_async().get()
    if provider.library_path:
        ort.register_execution_provider_library(provider.name, provider.library_path)

Azure SQL Edge

Native ONNX scoring:

  • Deploy ONNX models directly i SQL Edge
  • PREDICT T-SQL function for inference
  • No external API calls, low-latency scoring
  • Ideal for IoT/edge scenarios med connectivity constraints

Databricks

Batch inference optimization:

  • Spark Pandas UDFs for distributed inference
  • Delta Lake integration for data caching
  • GPU clusters for deep learning models

Disk cache configuration:

spark.conf.set("spark.databricks.io.cache.enabled", "true")
spark.conf.set("spark.databricks.io.cache.maxDiskUsage", "50g")

Offentlig sektor (Norge)

Compliance og Data Residency

Prompt caching compliance:

  • Azure OpenAI prompt caches er ikke delt mellom subscriptions — OK for multi-tenant scenarios innad i én subscription
  • Cache lifetime: 24 timer — vurder om dette er akseptabelt for sensitive data
  • Vær obs på at cached tokens ikke påvirker output content — kun performance/cost

Global Standard deployments:

  • Custom model weights kan midlertidig lagres utenfor region — vurder mot Schrems II og data residency-krav
  • For offentlig sektor: foretrekk Standard deployments (regional data residency) over Global Standard

ONNX edge deployment:

  • For edge scenarios (Azure SQL Edge, Windows ML) — data forlater ikke device hvis modell er embedded
  • Ideelt for kommuner/sykehus med connectivity constraints eller privacy-krav

Cost Optimization for Offentlig Sektor

Batch API for budsjett-beskrankede prosjekter:

  • 50% lavere cost enn real-time API
  • Egnet for daglige rapporter, batch-analyser, data enrichment

Prompt caching for cost reduction:

  • Standard deployment: rabatt på input tokens
  • Provisioned deployment: opptil 100% rabatt på cached tokens
  • Eksempel: Knowledge base Q&A med repetitiv grounding context — store savings

Autoscaling for variabel demand:

  • Sett min_instances: 0 for ikke-kritiske workloads (scale to zero when idle)
  • Bruk target_utilization_percentage: 70 for å balansere cost vs responsiveness

TCO-vurdering:

  • Online inference: høyere cost, men nødvendig for brukervendte apps
  • Batch inference: lavere cost, egnet for interne analyser/rapporter
  • Edge inference: ingen inference API cost, men krever on-prem hardware

Sikkerhet og Personvern

Cache security best practices:

  • ALDRI cache personidentifiserbare data (fødselsnummer, helseopplysninger) uten kryptering og user-scoped keys
  • Implementer cache_key = hash(user_id + query + model_version) for user-private content
  • Bruk kort TTL (5-15 min) for sensitive queries

Authorization-aware retrieval:

  • Pass Microsoft Entra group claims til knowledge layer
  • Grounding services må enforces ACL-based filtering
  • Eksempel: RAG-system for saksdokumenter — kun returner dokumenter bruker har tilgang til

Audit logging:

  • Log alle cache hits/misses for compliance
  • Track hvilke brukere har accesset cached results
  • Integrer med Azure Monitor for SIEM-forwarding

[Confidence: MEDIUM-HIGH] — Security patterns er godt dokumentert, men krever nøye implementering.


Kostnad og lisensiering

Azure Machine Learning Pricing

Compute costs:

  • Managed Online Endpoints: Pay for VM uptime (even if idle) + inference requests
  • Batch Endpoints: Pay only for compute time during job execution
  • Autoscaling: Kan redusere cost ved å scale to zero (min_instances: 0)

Estimat (Standard_DS3_v2, 2 vCPU, 14GB RAM):

  • ~$0.192/hour per instance
  • Med autoscaling (avg 5 instances, 8h/day): ~$230/måned
  • Batch (4h/dag): ~$92/måned

Cost optimization tips:

  • Bruk Reserved Instances for predictable workloads (opptil 72% discount)
  • Leverage Spot VMs for non-critical batch jobs (opptil 90% discount)
  • Monitor idle instances og adjust min_instances

Azure OpenAI Pricing

Standard deployment:

  • Pay-per-token (input + output)
  • Prompt caching discount: reduced rate for cached input tokens (varies by model)
  • Eksempel (GPT-4o): $5/1M input tokens, $15/1M output tokens — cached input tokens $2.50/1M (estimated)

Provisioned Throughput (PTU):

  • Fixed monthly cost basert på reserved capacity
  • Up to 100% discount on cached input tokens
  • Egnet for high-volume, predictable workloads

Batch API:

  • 50% lavere cost enn standard API
  • Eksempel: $2.50/1M tokens (vs $5/1M for real-time)

Cost estimation example:

  • RAG chatbot: 1M requests/måned, avg 2000 tokens/request (1500 prompt, 500 completion)
  • Med prompt caching (70% cache hit rate): $10,500/måned (vs $18,000 uten caching)

Lisensiering

ONNX Runtime:

  • MIT License — free for commercial use
  • No licensing cost for deployment

Azure Services:

  • Azure ML, Azure OpenAI, AI Foundry: pay-per-use (no upfront license fees)
  • Windows ML: inkludert i Windows (no additional license)

Power Platform AI:

  • AI Builder capacity: $500/måned for 1M AI Builder service credits
  • Custom models (ONNX): ingen ekstra cost utover AI Builder capacity

[Confidence: HIGH] — Pricing er transparent og godt dokumentert på azure.com.


For arkitekten (Cosmo)

Typiske Spørsmål fra Kunder

Q: "Hvorfor er inferencing så tregt sammenlignet med training?"

A: Misforståelse! Training og inferencing har ulike optimaliseringsmål. Training fokuserer på accuracy (kan ta timer/dager), mens inferencing må levere prediksjoner i sanntid (<100ms). Løsning: ONNX-konvertering, GPU-akselerasjon, caching, batch inference for ikke-latency-kritiske scenarios.

Q: "Vi har deployet en modell, men Azure ML-costs eksploderer. Hva gjør vi?"

A: Sjekk følgende:

  1. Er min_instances satt til >0 for idle endpoints? → Sett til 0 eller sllett endpoint
  2. Bruker dere GPU for enkel ML-modell? → Bytt til CPU
  3. Har dere implementert caching? → Implementer result cache (Redis) for repetitive queries
  4. Er autoscaling konfiguert? → Sett target_utilization til 70% og max_instances til realistisk verdi

Q: "Kan vi bruke samme modell i Azure ML, Power Platform og edge devices?"

A: Ja, med ONNX! Konverter modell til ONNX, deploy til:

  • Azure ML Managed Endpoints (cloud)
  • AI Builder custom models (Power Platform)
  • Azure SQL Edge (edge database)
  • Windows ML (client apps)

Q: "Hvordan balanserer vi cost vs performance?"

A: Følg denne prioriteringen:

  1. Implementer caching først — største ROI for generative AI workloads
  2. Velg riktig compute — CPU for de fleste ML-modeller, GPU kun for deep learning
  3. Batch vs online — bruk batch hvor mulig (50% lavere cost)
  4. Autoscaling — scale to zero for ikke-kritiske workloads
  5. Reserved capacity — for predictable workloads (opptil 72% discount)

Anti-Patterns å Unngå

Deploying GPU instances for simple ML models

  • Scikit-learn, XGBoost kjører fint på CPU
  • GPU gir minimal speedup, men 3-5x høyere cost

No caching for repetitive queries

  • Eksempel: chatbot med FAQ — samme spørsmål stilles om og om igjen
  • Løsning: Redis cache med 1-hour TTL

Ignoring autoscaling (min_instances = max_instances)

  • Fastlåst antall instances betyr du betaler for idle capacity
  • Løsning: Sett min_instances til 0-1, max_instances til realistic peak

Using online inference for batch workloads

  • Daglige rapporter kjørt via online API → unødvendig dyrt
  • Løsning: Azure ML Batch Endpoint eller Azure OpenAI Batch API

Not converting to ONNX for cross-platform deployment

  • Deploying PyTorch modell direkte til edge → store dependencies, treg inferencing
  • Løsning: Konverter til ONNX, deploy via Windows ML/IoT Edge

Troubleshooting Guide

Problem: High latency (>500ms) for simple predictions

Diagnostikk:

  1. Sjekk Application Insights → identifiser bottleneck (network, model, preprocessing)
  2. Profiler modell med azureml.core.Model.profile() → se CPU/memory usage
  3. Sjekk om modell er ONNX-konvertert → hvis ikke, konverter for speedup

Problem: Autoscaling ikke fungerer

Diagnostikk:

  1. Sjekk at azureml-fe ikke konkurrerer med Kubernetes HPA → disable HPA
  2. Verify scale_settings i deployment YAML
  3. Monitor utilization_percentage metric → skal trigger ved 70%

Problem: Cache hit rate lav (<20%)

Diagnostikk:

  1. Prompt caching: Er første 1024 tokens identiske? → restructure prompts
  2. Result cache: Er cache_key for granular? → reduser til færre dimensjoner
  3. TTL for kort? → øk TTL for static data

Problem: Out-of-memory errors på inference endpoint

Diagnostikk:

  1. Sjekk batch size → reduser for å unngå OOM
  2. Upgrade VM SKU → mer memory (Standard_DS3_v2 → Standard_DS4_v2)
  3. Vurder model quantization → reduser model size

Decision Framework: Når Bruke Hva

Scenario: Real-time chatbot (consumer-facing)

  • Platform: Azure OpenAI (Standard deployment)
  • Caching: Prompt caching (automatic) + Result cache (Redis, 1h TTL)
  • Compute: Serverless (automatic scaling)
  • Monitoring: Application Insights for latency/errors

Scenario: Batch document classification (internal)

  • Platform: Azure ML Batch Endpoint
  • Caching: N/A (one-time processing)
  • Compute: CPU cluster (Standard_DS3_v2, autoscale 0-10)
  • Monitoring: Job logs for throughput/errors

Scenario: Edge inference på IoT devices

  • Platform: Azure IoT Edge + ONNX Runtime
  • Caching: Local model cache (embedded i device)
  • Compute: NPU (hvis tilgjengelig) eller CPU
  • Monitoring: IoT Hub telemetry

Scenario: RAG system for kunnskapsdatabase

  • Platform: Azure AI Foundry + Azure AI Search
  • Caching: Grounding snippet cache (Cosmos DB, 6h TTL) + Prompt cache
  • Compute: Serverless (Azure OpenAI)
  • Monitoring: Cache hit rate, latency, token usage

Kilder og verifisering

Microsoft Learn Dokumentasjon

  1. ONNX and Azure Machine Learning https://learn.microsoft.com/en-us/azure/machine-learning/concept-onnx?view=azureml-api-2 Verifisert: 2026-02-04 — Komplett guide til ONNX Runtime, model conversion, deployment

  2. Prompt Caching (Azure OpenAI) https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/prompt-caching?view=foundry-classic Verifisert: 2026-02-04 — Official docs for prompt caching, supported models, pricing

  3. Application Design for AI Workloads on Azure https://learn.microsoft.com/en-us/azure/well-architected/ai/application-design Verifisert: 2026-02-04 — Multi-layer caching strategies, security best practices

  4. Azure Machine Learning Inference Router https://learn.microsoft.com/en-us/azure/machine-learning/how-to-kubernetes-inference-routing-azureml-fe?view=azureml-api-2 Verifisert: 2026-02-04 — Autoscaling, performance characteristics

  5. Best Practices for Deep Learning on Azure Databricks https://learn.microsoft.com/en-us/azure/databricks/machine-learning/train-model/dl-best-practices Verifisert: 2026-02-04 — Batch inference optimization, Spark Pandas UDFs

  6. Make Predictions with ONNX (AutoML) https://learn.microsoft.com/en-us/azure/machine-learning/how-to-inference-onnx-automl-image-models?view=azureml-api-2 Verifisert: 2026-02-04 — ONNX inference for computer vision models

  7. Sustainable AI Design for Workloads on Azure https://learn.microsoft.com/en-us/azure/well-architected/sustainability/sustainable-ai-design Verifisert: 2026-02-04 — Model caching for carbon reduction

  8. Azure Machine Learning Architecture Best Practices https://learn.microsoft.com/en-us/azure/well-architected/service-guides/azure-machine-learning Verifisert: 2026-02-04 — Performance efficiency, cost optimization

Code Samples (MCP microsoft-learn)

  • ONNX Runtime inference session creation (Python)
  • Batch inference with Azure ML SDK
  • Prompt caching response parsing
  • Autoscaling configuration (YAML)
  • Databricks disk cache configuration

Total MCP-kall: 7 (docs search) + 3 (docs fetch) + 2 (code samples) = 12

Kilder totalt: 8 Microsoft Learn-artikler + 15+ kodeeksempler


Oppsummering for Cosmo

Key Takeaways:

  1. ONNX Runtime er game-changer for cross-platform deployment og performance optimization (2x speedup på CPU)
  2. Prompt caching (Azure OpenAI) gir opptil 100% discount på cached input tokens — kritisk for cost optimization
  3. Multi-layer caching (result → prompt → grounding → model output) er obligatorisk for production AI apps
  4. Batch inference er 50% billigere enn online, men kun egnet for ikke-latency-kritiske workloads
  5. Autoscaling må konfigureres riktig (min_instances: 0, target_utilization: 70%) for å unngå waste

Anbefalinger til kunde:

  • Start med CPU + ONNX Runtime for ML-modeller (unless deep learning)
  • Implementer prompt caching for generative AI workloads (automatisk i Azure OpenAI)
  • Bruk Azure ML Batch Endpoints for rapporter/analyser
  • Deploy ONNX models til edge (Azure SQL Edge, Windows ML) for low-latency/privacy-kritiske scenarios
  • Monitor cache hit rate og autoscaling metrics kontinuerlig

Confidence nivå: HIGH — Denne referansen er basert på 12 MCP-kall til offisiell Microsoft-dokumentasjon og kodeeksempler.

ONNX Inferencing Optimization for Computer Vision (Azure ML AutoML 2026)

ONNX (Open Neural Network Exchange) enables cross-framework interoperability and inference optimization:

Supported AutoML computer vision tasks:

  • Image classification (binary and multi-class)
  • Object detection
  • Instance segmentation

ONNX inference workflow:

  1. Download ONNX model files from AutoML training run
  2. Understand model inputs/outputs (image format requirements)
  3. Preprocess data to required input format
  4. Run inference with ONNX Runtime Python API (onnxruntime)
  5. Post-process predictions (bounding boxes for detection, masks for segmentation)

Python ONNX Runtime:

import onnxruntime as rt
sess = rt.InferenceSession("model.onnx")
# Works across languages: Python, C++, C#, Java, JavaScript

Cross-platform benefits:

  • Deploy on any platform without framework dependencies
  • Reduced inference latency vs Python framework
  • Edge deployment: Azure IoT Edge, on-premises
  • Language flexibility post-export

SDK: azure-ai-ml v2 (current) — use AutoML image tasks to generate ONNX models automatically