Kjell Tore Guttormsen ff6a50d14f docs(architect): weekly KB update — 106 files refreshed (2026-04)

Updates across all 5 skills: ms-ai-advisor, ms-ai-engineering,
ms-ai-governance, ms-ai-security, ms-ai-infrastructure.

Key changes:
- Language Services (Custom Text Classification, Text Analytics, QnA):
  retirement warning 2029-03-31, migration guides to Foundry/GPT-4o
- Agentic Retrieval: 50M free reasoning tokens/month (Public Preview)
- Computer Use: Claude Sonnet 4.5 (preview) + OpenAI CUA models
- Agent Registry: Risks column (M365 E7), user-shared/org-published types
- Declarative agents: schema v1.5 → v1.6, Store validation requirements
- MLflow 3: 13 built-in LLM judges, production monitoring, Genie Code
- AG-UI HITL: ApprovalRequiredAIFunction (C#) + @tool(approval_mode) (Python)
- Entra ID Ignite 2025: Agent ID Admin/Developer RBAC roles, Conditional Access
- Security Copilot: 400 SCU/month per 1000 M365 E5 licenses, auto-provisioned
- Fast Transcription API: phrase lists, 14-language multi-lingual transcription
- Azure Monitor Workbooks: Bicep support, RBAC specifics
- Power Platform Copilot: data residency (Norway/Europe → EU DB, Bing → USA)
- RAG security-rbac: 4-approach table (GA + 3 preview access control methods)
- IaC MLOps: Well-Architected OE:05 principles, Bicep/Terraform patterns
- Translator: image file batch translation Preview (JPEG/PNG/BMP/WebP)

All 106 files: Last updated 2026-04 | Verified: MCP 2026-04

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-10 09:13:24 +02:00

41 KiB

Raw Blame History

Inferencing Optimization and Caching

Kategori: MLOps & GenAIOps Dato: 2026-02-04 Forfattet av: Cosmo Skyberg, Senior Microsoft AI Solution Architect

Verified: MCP 2026-04

Introduksjon

Inferencing optimization og caching representerer kritiske teknikker for å maksimere ytelse og minimere kostnader når AI-modeller skal serve prediksjoner i produksjon. Mens model training handler om å oppnå høy accuracy, handler inferencing om å levere disse prediksjonene raskt, pålitelig og kostnadseffektivt til brukere og systemer.

Hva er inferencing? Inferencing (eller model scoring) er prosessen med å bruke en trent modell til å generere prediksjoner på produksjonsdata. Dette skjer kontinuerlig etter at modellen er deployet, og kan involvere alt fra enkeltforespørsler (online inference) til batch-prosessering av store datasett.

Hvorfor er optimalisering kritisk? Selv veltrente modeller kan feile i produksjon hvis de ikke er optimalisert for inferencing. Dårlig inferencing-ytelse manifesterer seg som høy latency, lav throughput, høye infrastrukturkostnader og dårlig brukeropplevelse. I Microsoft-økosystemet er dette spesielt relevant for Azure Machine Learning, Azure AI Foundry, og embedded scenarios som Azure SQL Edge og Windows ML.

Tre pilarer for inferencing optimization:

Model optimization — konvertering til effektive formater (ONNX), quantization, pruning
Compute optimization — riktig hardware-akselerasjon (CPU vs GPU vs NPU), autoscaling, resource tuning
Caching strategies — multi-layer caching for å unngå redundant compute

Denne referansen dekker alle tre områdene med fokus på Microsoft-verktøy og best practices for offentlig sektor.

Kjernekomponenter

1. ONNX Runtime — High-Performance Inference Engine

ONNX (Open Neural Network Exchange) er en åpen standard for å representere machine learning-modeller på tvers av frameworks. ONNX Runtime er Microsofts høyytelsesmotor for å kjøre disse modellene i produksjon.

Nøkkelfunksjoner:

Cross-platform: Linux, Windows, macOS, cloud og edge
Cross-framework: Støtter modeller fra TensorFlow, PyTorch, scikit-learn, Keras, MXNet, MATLAB
Hardware acceleration: Integrerer med TensorRT (NVIDIA GPUs), OpenVINO (Intel), DirectML (Windows)
Production-proven: Brukes av Bing, Office, Azure AI — Microsoft-tjenester rapporterer gjennomsnittlig 2x ytelsesgevinst på CPU

Når bruke ONNX Runtime:

Du trenger å deploy samme modell på flere plattformer (cloud + edge)
Du vil unngå vendor lock-in til et spesifikt framework
Du trenger maksimal inferencing-ytelse på CPU eller spesialisert hardware
Du skal deploy modeller i Windows ML, Azure SQL Edge, eller ML.NET

Python-eksempel — ONNX Runtime inference:

import onnxruntime

# Opprett inference session
session = onnxruntime.InferenceSession("model.onnx")

# Hent input/output metadata
first_input_name = session.get_inputs()[0].name
first_output_name = session.get_outputs()[0].name

# Kjør inferencing
results = session.run(
    ["output1", "output2"],
    {"input1": input_data}
)

Installation:

pip install onnxruntime       # CPU build
pip install onnxruntime-gpu   # GPU build

[Confidence: HIGH] — ONNX Runtime er mature, veldokumentert, og aktivt utviklet av Microsoft.

2. Model Optimization Techniques

A. Model Conversion to ONNX

Konvertering fra native framework til ONNX lar deg dra nytte av ONNX Runtime's optimaliseringer.

Konvertering fra PyTorch:

import torch.onnx

# Sett modell i inference mode
model.eval()

# Dummy input for shape tracing
dummy_input = torch.randn(1, 3, 224, 224, requires_grad=True)

# Eksporter til ONNX
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    export_params=True,
    opset_version=11,
    do_constant_folding=True,  # Optimization
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}
)

Frameworks med ONNX-støtte:

TensorFlow, PyTorch, scikit-learn, Keras, Chainer, MXNet, MATLAB
AutoML-modeller fra Azure Machine Learning (image classification, object detection)

B. Batch Inference Optimization

For AutoML-modeller (spesielt vision) kan du generere batch-optimaliserte ONNX-modeller:

# Object detection batch model parameters
inputs = {
    'model_name': 'fasterrcnn_resnet34_fpn',
    'batch_size': 8,
    'height_onnx': 600,
    'width_onnx': 800,
    'job_name': job_name,
    'task_type': 'image-object-detection',
    'min_size': 600,
    'max_size': 1333,
    'box_score_thresh': 0.3,
    'box_nms_thresh': 0.5,
    'box_detections_per_img': 100
}

[Confidence: HIGH] — Batch inference støttes godt i Azure ML for both training og deployment.

3. Multi-Layer Caching Strategies

Caching er en av de mest effektive måtene å redusere inferencing-kostnader og latency, spesielt for generative AI-workloads.

A. Prompt Caching (Azure OpenAI / AI Foundry)

Hva er prompt caching? I stedet for å reprosessere samme input-tokens om og om igjen, beholder tjenesten en midlertidig cache av prosesserte token-computations.

Krav for å utnytte prompt caching:

Minimum 1 024 tokens i lengde
De første 1 024 tokens må være identiske
Cache hits rapporteres som cached_tokens i response

Støttede modeller:

Alle Azure OpenAI-modeller GPT-4o eller nyere
Gjelder chat-completion, completion, responses, real-time operations

Pricing:

Standard deployment: rabatt på input token pricing
Provisioned deployment: opptil 100% rabatt på input tokens

Cache-lifecycle:

Caches cleares innen 24 timer
Ikke delt mellom Azure subscriptions

Response-eksempel med cache hit:

{
  "usage": {
    "completion_tokens": 1518,
    "prompt_tokens": 1566,
    "total_tokens": 3084,
    "prompt_tokens_details": {
      "cached_tokens": 1408
    }
  }
}

Optimalisering:

Strukturer requests slik at repetitivt innhold ligger i starten av messages array
Bruk prompt_cache_key parameter for å påvirke routing og forbedre cache hit rates
Vær obs på at >15 requests/min med samme prefix kan overflow og redusere effektivitet

[Confidence: HIGH] — Prompt caching er production-ready og automatisk enabled for støttede modeller.

B. Application-Layer Caching

Multi-layer caching approach for AI-applikasjoner:

Result and answer caching — Gjenbruk responses for identiske eller semantisk like queries
Retrieval and grounding snippet caching — Cache hyppig hentede knowledge fragments
Model output caching — Cache intermediate outputs som kan gjenbrukes

Cache key components (kritisk for sikkerhet):

Tenant eller user identity
Policy context
Model version
Prompt version

TTL policies:

Sett expiration basert på data freshness requirements
Kortere TTL for sensitive data
Lengre TTL for static catalog data

Invalidation hooks:

Data updates
Model changes
Prompt modifications

Security considerations:

ALDRI cache user-private content uten proper scoping
Caching fungerer best for data som gjelder på tvers av flere brukere
Eksempel på farlig caching: "How many hours of paid time off do I have left?" — kun gyldig for én bruker

[Confidence: MEDIUM-HIGH] — Pattern er godt dokumentert, men krever nøye implementering for å unngå security leaks.

C. Databricks Disk Caching

For batch inference på Databricks kan du bruke disk cache for å forbedre I/O performance:

spark.conf.set("spark.databricks.io.cache.enabled", "true")
spark.conf.set("spark.databricks.io.cache.maxDiskUsage", "50g")
spark.conf.set("spark.databricks.io.cache.maxMetaDataCache", "1g")
spark.conf.set("spark.databricks.io.cache.compression.enabled", "false")

Best practice:

Velg cache-accelerated worker instance types
Vær obs på at cache går tapt ved autoscaling (worker decommission)

4. Compute Resource Optimization

A. CPU vs GPU Selection

CPU inference:

Generelle ML-modeller (scikit-learn, XGBoost)
Small to medium deep learning models
Cost-sensitive scenarios
ONNX Runtime gir 2x speedup på CPU for mange workloads

GPU inference:

Deep learning models (transformers, CNNs)
High-throughput batch processing
Latency-kritiske online inference
Computer vision, NLP-modeller

NPU (Neural Processing Unit):

Edge deployment scenarios (Windows ML)
Power-efficient inference på mobile/IoT devices

ONNX Runtime execution provider selection:

import onnxruntime as ort

# Automatisk select EP basert på MAX_EFFICIENCY policy (prioriterer NPU > CPU)
options = ort.SessionOptions()
options.set_provider_selection_policy(ort.OrtExecutionProviderDevicePolicy.MAX_EFFICIENCY)

session = ort.InferenceSession(model_path, sess_options=options)

B. Autoscaling for Inference Endpoints

Azure Machine Learning — Managed Online Endpoints:

Autoscaling basert på Azure Monitor metrics (CPU, requests per second, latency).

Azure Kubernetes Service (AKS) — azureml-fe router:

# deployment.yaml
scale_setting:
  type: target_utilization
  min_instances: 3
  max_instances: 15
  target_utilization_percentage: 70
  polling_interval: 10

Utilization formula:

utilization_percentage = (busy_replicas + queued_requests) / total_replicas

Scale up: eager and fast (når utilization > 70%)
Scale down: conservative (~20x slower enn scale up)

Performance characteristics:

azureml-fe kan håndtere 5K requests/second med <3ms average latency, 15ms p99
For >10K RPS: øk azureml-fe pods eller vCPU/memory limits

[Confidence: HIGH] — Autoscaling er production-proven i Azure ML.

5. Batch vs Online Inference Optimization

A. Batch Inference Best Practices

Når bruke batch:

Large datasets i filer (ikke krever low latency)
Scheduled scoring (daily/weekly)
Cost-sensitive scenarios (batch er billigere enn online)

Azure Machine Learning Batch Endpoints:

from azure.ai.ml.entities import BatchEndpoint

endpoint = BatchEndpoint(
    name=endpoint_name,
    description="Batch inference for predictions"
)

ws_client.batch_endpoints.begin_create_or_update(endpoint)

Parallel processing optimization:

from azure.ai.ml import parallel_run_function

file_batch_inference = parallel_run_function(
    name="batch_score",
    inputs=dict(job_data_path=Input(type=AssetTypes.MLTABLE)),
    outputs=dict(job_output_path=Output(type=AssetTypes.MLTABLE)),
    input_data="${{inputs.job_data_path}}",
    instance_count=2,
    max_concurrency_per_instance=1,
    mini_batch_size="1",
    task=RunFunction(
        code="./src",
        entry_script="batch_inference.py",
        environment="azureml://registries/azureml/environments/sklearn-1.5/labels/latest"
    )
)

Databricks batch inference tips:

Bruk Spark Pandas UDFs for å scale inference across cluster
Separer preprocessing fra inference for optimal hardware selection (CPU for ETL, GPU for inference)
Bruk Delta Lake tables for data som leses flere ganger

B. Online Inference Best Practices

Når bruke online:

Real-time user-facing applications
Low-latency requirements (<100ms)
Single or small-batch predictions

Azure AI Foundry Serverless API:

PaaS, minimal operational burden
Best for foundation models (Azure OpenAI)

Azure Machine Learning Managed Online Endpoints:

Custom models med full kontroll
Autoscaling, blue/green deployment
Integration med Application Insights for monitoring

6. Azure OpenAI Batch API for Cost-Efficient Inference

For foundation models som ikke krever real-time response:

Batch API benefits:

50% lavere cost enn standard API
24-hour completion window
Støtte for chat completions, embeddings, completions

Batch job creation:

from openai import OpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(),
    "https://cognitiveservices.azure.com/.default"
)

client = OpenAI(
    base_url="https://YOUR-RESOURCE.openai.azure.com/openai/v1/",
    api_key=token_provider
)

batch_response = client.batches.create(
    input_file_id=None,
    endpoint="/chat/completions",
    completion_window="24h",
    extra_body={
        "input_blob": "https://storage.blob.core.windows.net/batch-input/test.jsonl",
        "output_folder": {
            "url": "https://storage.blob.core.windows.net/batch-output"
        }
    }
)

[Confidence: HIGH] — Batch API er production-ready for non-latency-sensitive workloads.

Arkitekturmønstre

Pattern 1: Multi-Layer Caching Architecture

┌─────────────────────────────────────────────────────────────┐
│                       Client Layer                          │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    AI Gateway (APIM)                        │
│  • Authentication, rate limiting, token caps                │
│  • Result cache (Redis) — Level 1                           │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│               Intelligence Layer (Orchestrator)             │
│  • Prompt cache (Azure OpenAI) — Level 2                    │
│  • Model routing, agent coordination                        │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Knowledge Layer                          │
│  • Grounding snippet cache (Cosmos DB) — Level 3           │
│  • Azure AI Search, SQL, Graph                              │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   Inferencing Layer                         │
│  • Model output cache — Level 4                             │
│  • ONNX Runtime, Azure ML endpoints                         │
└─────────────────────────────────────────────────────────────┘

Cache key strategy per layer:

Level 1 (Result): hash(user_id + query + model_version + prompt_version)
Level 2 (Prompt): automatisk basert på første 1024 tokens + prompt_cache_key
Level 3 (Grounding): hash(query_embedding + user_groups + data_timestamp)
Level 4 (Model output): hash(input_features + model_version)

Pattern 2: ONNX-Based Cross-Platform Deployment

┌─────────────────────────────────────────────────────────────┐
│                   Training (Azure ML)                       │
│  PyTorch / TensorFlow / scikit-learn                        │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼ ONNX Export
┌─────────────────────────────────────────────────────────────┐
│                   ONNX Model Registry                       │
│  • Model versioning, metadata, governance                   │
└─────────────────────────────────────────────────────────────┘
                              │
                 ┌────────────┴────────────┐
                 ▼                         ▼
┌──────────────────────────┐   ┌──────────────────────────┐
│   Cloud Inference        │   │   Edge Inference         │
│  • Azure ML Endpoints    │   │  • Azure SQL Edge        │
│  • AKS + ONNX Runtime    │   │  • Windows ML            │
│  • GPU acceleration      │   │  • IoT Edge              │
│    (TensorRT)            │   │  • NPU acceleration      │
└──────────────────────────┘   └──────────────────────────┘

Fordeler:

Train once, deploy everywhere
Framework-agnostic
Consistent performance optimization
Hardware acceleration på tvers av plattformer

Pattern 3: Autoscaling Inference Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Load Balancer                            │
│  (Azure Front Door / App Gateway)                           │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│            azureml-fe (Inference Router)                    │
│  • Smart routing, autoscaling coordination                  │
│  • 3 instances (HA), 5K RPS capacity                        │
└─────────────────────────────────────────────────────────────┘
                              │
                 ┌────────────┴────────────┐
                 ▼                         ▼
┌──────────────────────────┐   ┌──────────────────────────┐
│  Model Pod Replicas      │   │  Model Pod Replicas      │
│  (min: 3, max: 15)       │   │  (min: 3, max: 15)       │
│  • ONNX Runtime          │   │  • ONNX Runtime          │
│  • CPU or GPU            │   │  • CPU or GPU            │
└──────────────────────────┘   └──────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              Azure Monitor / App Insights                   │
│  • Metrics: latency, throughput, utilization                │
│  • Autoscaling triggers                                     │
└─────────────────────────────────────────────────────────────┘

Scaling logic:

utilization = (busy_replicas + queued_requests) / total_replicas
if utilization > 70%: scale_up()
if utilization < 50%: scale_down()  # conservative

Beslutningsveiledning

1. Velge Inferencing Platform

Scenario	Anbefalt Platform	Rationale
Foundation models (GPT-4o, embeddings)	Azure OpenAI / AI Foundry Serverless	PaaS, automatisk scaling, prompt caching
Custom ML models (scikit-learn, XGBoost)	Azure ML Managed Endpoints	Full kontroll, autoscaling, ONNX-support
High-throughput batch	Azure ML Batch Endpoints / Databricks	Cost-efficient, parallelization
Edge deployment	ONNX Runtime + Windows ML / IoT Edge	Cross-platform, hardware acceleration
Real-time inference (<50ms)	Azure ML Online Endpoints (GPU)	Low latency, high throughput
SQL-integrated inference	Azure SQL Edge (ONNX)	Native scoring, no external API calls

[Confidence: HIGH] — Basert på Microsoft's offisielle deployment guidance.

2. Velge Compute for Inference

Model Type	Recommended Compute	Rationale
Small tabular models	CPU (Standard_DS3_v2)	Cost-efficient, sufficient performance
Deep learning vision	GPU (Standard_NC6s_v3)	Parallel processing, low latency
Large language models	GPU (Standard_NC24s_v3 eller PTU)	High throughput, batch support
Batch scoring	CPU clusters (autoscale 0-N)	Cost optimization, scale to zero
Edge scenarios	NPU (Windows devices)	Power-efficient, local inference

Testing strategy:

Start med CPU baseline
Test GPU for latency-kritiske workloads
Sammenlign cost vs performance
Dokumenter resultatene som baseline for re-evaluation

[Confidence: HIGH] — Standard industry practice i Azure ML.

3. Velge Caching Strategy

Use Case	Caching Layer	TTL	Cache Key Components
Chatbot FAQ	Result cache (Redis)	24h	`query_hash + model_version`
Product catalog search	Grounding cache (Cosmos DB)	1h	`query_embedding + catalog_version`
RAG knowledge retrieval	Snippet cache (Cosmos DB)	6h	`query + user_groups + doc_timestamp`
GPT-4o prompts	Prompt cache (automatic)	24h	Første 1024 tokens (automatic)
Batch predictions	Model output cache	N/A	Not recommended (one-time use)

Security checklist:

Cache keys include user/tenant identity for private data?
TTL aligns with data freshness requirements?
Invalidation hooks implemented for data/model updates?
No user-private content cached cross-user?

[Confidence: MEDIUM-HIGH] — Pattern er godt dokumentert, men må tilpasses per use case.

4. Online vs Batch Inference Decision Tree

Start: Har du real-time latency krav (<1s)?
  │
  ├─ YES → Online Inference
  │         │
  │         ├─ Throughput <100 RPS? → Managed Online Endpoint (CPU)
  │         ├─ Throughput >100 RPS? → Managed Online Endpoint (GPU) + autoscaling
  │         └─ Need 99.9% SLA? → Multi-region deployment
  │
  └─ NO → Batch Inference
            │
            ├─ Data size <1GB? → Azure ML Batch Endpoint
            ├─ Data size >1GB? → Databricks Batch (Spark)
            └─ Foundation model? → Azure OpenAI Batch API (50% discount)

[Confidence: HIGH] — Klar beslutningslogikk basert på Microsoft docs.

Integrasjon med Microsoft-stakken

Azure Machine Learning

Deployment options:

Managed Online Endpoints — Real-time inference, autoscaling, monitoring
Batch Endpoints — Scheduled/on-demand batch scoring
Kubernetes Endpoints — Deploy to AKS, on-prem, eller edge Kubernetes

ONNX integration:

Export modeller direkte fra AutoML (image classification, object detection)
Deploy ONNX models via MLflow eller custom scoring script
Automatic optimization via ONNX Runtime execution providers

Monitoring:

Application Insights for latency, throughput, errors
Model performance monitoring for drift detection
Cost tracking per deployment

Azure AI Foundry

Serverless API:

Deploy foundation models uten å administrere infrastructure
Automatisk prompt caching for GPT-4o-modeller
Pay-per-token pricing

Model Catalog:

Pretrained models fra Hugging Face, Meta, Mistral
One-click deployment to serverless endpoints
ONNX-modeller for cross-platform scenarios

Global Standard Deployments:

Cost savings vs standard deployments
Custom model weights kan midlertidig lagres utenfor resource geography (vær obs på compliance)

Azure OpenAI

Deployment types:

Standard — Pay-per-token, regional data residency
Provisioned Throughput (PTU) — Reserved capacity, up to 100% discount on cached input tokens
Global Standard — Cost savings, global routing
Developer Tier — No hourly fee, no SLA (for testing)

Batch API:

50% cost reduction for non-real-time workloads
24-hour completion window
Azure Blob Storage integration

Windows ML

Edge inference scenarios:

Deploy ONNX models directly i Windows apps
NPU acceleration via Windows AI runtime
Execution provider discovery og registration:

import winui3.microsoft.windows.ai.machinelearning as winml

catalog = winml.ExecutionProviderCatalog.get_default()
providers = catalog.find_all_providers()

for provider in providers:
    provider.ensure_ready_async().get()
    if provider.library_path:
        ort.register_execution_provider_library(provider.name, provider.library_path)

Azure SQL Edge

Native ONNX scoring:

Deploy ONNX models directly i SQL Edge
PREDICT T-SQL function for inference
No external API calls, low-latency scoring
Ideal for IoT/edge scenarios med connectivity constraints

Databricks

Batch inference optimization:

Spark Pandas UDFs for distributed inference
Delta Lake integration for data caching
GPU clusters for deep learning models

Disk cache configuration:

spark.conf.set("spark.databricks.io.cache.enabled", "true")
spark.conf.set("spark.databricks.io.cache.maxDiskUsage", "50g")

Offentlig sektor (Norge)

Compliance og Data Residency

Prompt caching compliance:

Azure OpenAI prompt caches er ikke delt mellom subscriptions — OK for multi-tenant scenarios innad i én subscription
Cache lifetime: 24 timer — vurder om dette er akseptabelt for sensitive data
Vær obs på at cached tokens ikke påvirker output content — kun performance/cost

Global Standard deployments:

Custom model weights kan midlertidig lagres utenfor region — vurder mot Schrems II og data residency-krav
For offentlig sektor: foretrekk Standard deployments (regional data residency) over Global Standard

ONNX edge deployment:

For edge scenarios (Azure SQL Edge, Windows ML) — data forlater ikke device hvis modell er embedded
Ideelt for kommuner/sykehus med connectivity constraints eller privacy-krav

Cost Optimization for Offentlig Sektor

Batch API for budsjett-beskrankede prosjekter:

50% lavere cost enn real-time API
Egnet for daglige rapporter, batch-analyser, data enrichment

Prompt caching for cost reduction:

Standard deployment: rabatt på input tokens
Provisioned deployment: opptil 100% rabatt på cached tokens
Eksempel: Knowledge base Q&A med repetitiv grounding context — store savings

Autoscaling for variabel demand:

Sett min_instances: 0 for ikke-kritiske workloads (scale to zero when idle)
Bruk target_utilization_percentage: 70 for å balansere cost vs responsiveness

TCO-vurdering:

Online inference: høyere cost, men nødvendig for brukervendte apps
Batch inference: lavere cost, egnet for interne analyser/rapporter
Edge inference: ingen inference API cost, men krever on-prem hardware

Sikkerhet og Personvern

Cache security best practices:

ALDRI cache personidentifiserbare data (fødselsnummer, helseopplysninger) uten kryptering og user-scoped keys
Implementer cache_key = hash(user_id + query + model_version) for user-private content
Bruk kort TTL (5-15 min) for sensitive queries

Authorization-aware retrieval:

Pass Microsoft Entra group claims til knowledge layer
Grounding services må enforces ACL-based filtering
Eksempel: RAG-system for saksdokumenter — kun returner dokumenter bruker har tilgang til

Audit logging:

Log alle cache hits/misses for compliance
Track hvilke brukere har accesset cached results
Integrer med Azure Monitor for SIEM-forwarding

[Confidence: MEDIUM-HIGH] — Security patterns er godt dokumentert, men krever nøye implementering.

Kostnad og lisensiering

Azure Machine Learning Pricing

Compute costs:

Managed Online Endpoints: Pay for VM uptime (even if idle) + inference requests
Batch Endpoints: Pay only for compute time during job execution
Autoscaling: Kan redusere cost ved å scale to zero (min_instances: 0)

Estimat (Standard_DS3_v2, 2 vCPU, 14GB RAM):

~$0.192/hour per instance
Med autoscaling (avg 5 instances, 8h/day): ~$230/måned
Batch (4h/dag): ~$92/måned

Cost optimization tips:

Bruk Reserved Instances for predictable workloads (opptil 72% discount)
Leverage Spot VMs for non-critical batch jobs (opptil 90% discount)
Monitor idle instances og adjust min_instances

Azure OpenAI Pricing

Standard deployment:

Pay-per-token (input + output)
Prompt caching discount: reduced rate for cached input tokens (varies by model)
Eksempel (GPT-4o): $5/1M input tokens, $15/1M output tokens — cached input tokens $2.50/1M (estimated)

Provisioned Throughput (PTU):

Fixed monthly cost basert på reserved capacity
Up to 100% discount on cached input tokens
Egnet for high-volume, predictable workloads

Batch API:

50% lavere cost enn standard API
Eksempel: $2.50/1M tokens (vs $5/1M for real-time)

Cost estimation example:

RAG chatbot: 1M requests/måned, avg 2000 tokens/request (1500 prompt, 500 completion)
Med prompt caching (70% cache hit rate): $10,500/måned (vs $18,000 uten caching)

Lisensiering

ONNX Runtime:

MIT License — free for commercial use
No licensing cost for deployment

Azure Services:

Azure ML, Azure OpenAI, AI Foundry: pay-per-use (no upfront license fees)
Windows ML: inkludert i Windows (no additional license)

Power Platform AI:

AI Builder capacity: $500/måned for 1M AI Builder service credits
Custom models (ONNX): ingen ekstra cost utover AI Builder capacity

[Confidence: HIGH] — Pricing er transparent og godt dokumentert på azure.com.

For arkitekten (Cosmo)

Typiske Spørsmål fra Kunder

Q: "Hvorfor er inferencing så tregt sammenlignet med training?"

A: Misforståelse! Training og inferencing har ulike optimaliseringsmål. Training fokuserer på accuracy (kan ta timer/dager), mens inferencing må levere prediksjoner i sanntid (<100ms). Løsning: ONNX-konvertering, GPU-akselerasjon, caching, batch inference for ikke-latency-kritiske scenarios.

Q: "Vi har deployet en modell, men Azure ML-costs eksploderer. Hva gjør vi?"

A: Sjekk følgende:

Er min_instances satt til >0 for idle endpoints? → Sett til 0 eller sllett endpoint
Bruker dere GPU for enkel ML-modell? → Bytt til CPU
Har dere implementert caching? → Implementer result cache (Redis) for repetitive queries
Er autoscaling konfiguert? → Sett target_utilization til 70% og max_instances til realistisk verdi

Q: "Kan vi bruke samme modell i Azure ML, Power Platform og edge devices?"

A: Ja, med ONNX! Konverter modell til ONNX, deploy til:

Azure ML Managed Endpoints (cloud)
AI Builder custom models (Power Platform)
Azure SQL Edge (edge database)
Windows ML (client apps)

Q: "Hvordan balanserer vi cost vs performance?"

A: Følg denne prioriteringen:

Implementer caching først — største ROI for generative AI workloads
Velg riktig compute — CPU for de fleste ML-modeller, GPU kun for deep learning
Batch vs online — bruk batch hvor mulig (50% lavere cost)
Autoscaling — scale to zero for ikke-kritiske workloads
Reserved capacity — for predictable workloads (opptil 72% discount)

Anti-Patterns å Unngå

❌ Deploying GPU instances for simple ML models

Scikit-learn, XGBoost kjører fint på CPU
GPU gir minimal speedup, men 3-5x høyere cost

❌ No caching for repetitive queries

Eksempel: chatbot med FAQ — samme spørsmål stilles om og om igjen
Løsning: Redis cache med 1-hour TTL

❌ Ignoring autoscaling (min_instances = max_instances)

Fastlåst antall instances betyr du betaler for idle capacity
Løsning: Sett min_instances til 0-1, max_instances til realistic peak

❌ Using online inference for batch workloads

Daglige rapporter kjørt via online API → unødvendig dyrt
Løsning: Azure ML Batch Endpoint eller Azure OpenAI Batch API

❌ Not converting to ONNX for cross-platform deployment

Deploying PyTorch modell direkte til edge → store dependencies, treg inferencing
Løsning: Konverter til ONNX, deploy via Windows ML/IoT Edge

Troubleshooting Guide

Problem: High latency (>500ms) for simple predictions

Diagnostikk:

Sjekk Application Insights → identifiser bottleneck (network, model, preprocessing)
Profiler modell med azureml.core.Model.profile() → se CPU/memory usage
Sjekk om modell er ONNX-konvertert → hvis ikke, konverter for speedup

Problem: Autoscaling ikke fungerer

Diagnostikk:

Sjekk at azureml-fe ikke konkurrerer med Kubernetes HPA → disable HPA
Verify scale_settings i deployment YAML
Monitor utilization_percentage metric → skal trigger ved 70%

Problem: Cache hit rate lav (<20%)

Diagnostikk:

Prompt caching: Er første 1024 tokens identiske? → restructure prompts
Result cache: Er cache_key for granular? → reduser til færre dimensjoner
TTL for kort? → øk TTL for static data

Problem: Out-of-memory errors på inference endpoint

Diagnostikk:

Sjekk batch size → reduser for å unngå OOM
Upgrade VM SKU → mer memory (Standard_DS3_v2 → Standard_DS4_v2)
Vurder model quantization → reduser model size

Decision Framework: Når Bruke Hva

Scenario: Real-time chatbot (consumer-facing)

Platform: Azure OpenAI (Standard deployment)
Caching: Prompt caching (automatic) + Result cache (Redis, 1h TTL)
Compute: Serverless (automatic scaling)
Monitoring: Application Insights for latency/errors

Scenario: Batch document classification (internal)

Platform: Azure ML Batch Endpoint
Caching: N/A (one-time processing)
Compute: CPU cluster (Standard_DS3_v2, autoscale 0-10)
Monitoring: Job logs for throughput/errors

Scenario: Edge inference på IoT devices

Platform: Azure IoT Edge + ONNX Runtime
Caching: Local model cache (embedded i device)
Compute: NPU (hvis tilgjengelig) eller CPU
Monitoring: IoT Hub telemetry

Scenario: RAG system for kunnskapsdatabase

Platform: Azure AI Foundry + Azure AI Search
Caching: Grounding snippet cache (Cosmos DB, 6h TTL) + Prompt cache
Compute: Serverless (Azure OpenAI)
Monitoring: Cache hit rate, latency, token usage

Kilder og verifisering

Microsoft Learn Dokumentasjon

ONNX and Azure Machine Learning https://learn.microsoft.com/en-us/azure/machine-learning/concept-onnx?view=azureml-api-2 Verifisert: 2026-02-04 — Komplett guide til ONNX Runtime, model conversion, deployment
Prompt Caching (Azure OpenAI) https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/prompt-caching?view=foundry-classic Verifisert: 2026-02-04 — Official docs for prompt caching, supported models, pricing
Application Design for AI Workloads on Azure https://learn.microsoft.com/en-us/azure/well-architected/ai/application-design Verifisert: 2026-02-04 — Multi-layer caching strategies, security best practices
Azure Machine Learning Inference Router https://learn.microsoft.com/en-us/azure/machine-learning/how-to-kubernetes-inference-routing-azureml-fe?view=azureml-api-2 Verifisert: 2026-02-04 — Autoscaling, performance characteristics
Best Practices for Deep Learning on Azure Databricks https://learn.microsoft.com/en-us/azure/databricks/machine-learning/train-model/dl-best-practices Verifisert: 2026-02-04 — Batch inference optimization, Spark Pandas UDFs
Make Predictions with ONNX (AutoML) https://learn.microsoft.com/en-us/azure/machine-learning/how-to-inference-onnx-automl-image-models?view=azureml-api-2 Verifisert: 2026-02-04 — ONNX inference for computer vision models
Sustainable AI Design for Workloads on Azure https://learn.microsoft.com/en-us/azure/well-architected/sustainability/sustainable-ai-design Verifisert: 2026-02-04 — Model caching for carbon reduction
Azure Machine Learning Architecture Best Practices https://learn.microsoft.com/en-us/azure/well-architected/service-guides/azure-machine-learning Verifisert: 2026-02-04 — Performance efficiency, cost optimization

Code Samples (MCP microsoft-learn)

ONNX Runtime inference session creation (Python)
Batch inference with Azure ML SDK
Prompt caching response parsing
Autoscaling configuration (YAML)
Databricks disk cache configuration

Total MCP-kall: 7 (docs search) + 3 (docs fetch) + 2 (code samples) = 12

Kilder totalt: 8 Microsoft Learn-artikler + 15+ kodeeksempler

Oppsummering for Cosmo

Key Takeaways:

ONNX Runtime er game-changer for cross-platform deployment og performance optimization (2x speedup på CPU)
Prompt caching (Azure OpenAI) gir opptil 100% discount på cached input tokens — kritisk for cost optimization
Multi-layer caching (result → prompt → grounding → model output) er obligatorisk for production AI apps
Batch inference er 50% billigere enn online, men kun egnet for ikke-latency-kritiske workloads
Autoscaling må konfigureres riktig (min_instances: 0, target_utilization: 70%) for å unngå waste

Anbefalinger til kunde:

Start med CPU + ONNX Runtime for ML-modeller (unless deep learning)
Implementer prompt caching for generative AI workloads (automatisk i Azure OpenAI)
Bruk Azure ML Batch Endpoints for rapporter/analyser
Deploy ONNX models til edge (Azure SQL Edge, Windows ML) for low-latency/privacy-kritiske scenarios
Monitor cache hit rate og autoscaling metrics kontinuerlig

Confidence nivå: HIGH — Denne referansen er basert på 12 MCP-kall til offisiell Microsoft-dokumentasjon og kodeeksempler.

ONNX Inferencing Optimization for Computer Vision (Azure ML AutoML 2026)

ONNX (Open Neural Network Exchange) enables cross-framework interoperability and inference optimization:

Supported AutoML computer vision tasks:

Image classification (binary and multi-class)
Object detection
Instance segmentation

ONNX inference workflow:

Download ONNX model files from AutoML training run
Understand model inputs/outputs (image format requirements)
Preprocess data to required input format
Run inference with ONNX Runtime Python API (onnxruntime)
Post-process predictions (bounding boxes for detection, masks for segmentation)

Python ONNX Runtime:

import onnxruntime as rt
sess = rt.InferenceSession("model.onnx")
# Works across languages: Python, C++, C#, Java, JavaScript

Cross-platform benefits:

Deploy on any platform without framework dependencies
Reduced inference latency vs Python framework
Edge deployment: Azure IoT Edge, on-premises
Language flexibility post-export

SDK: azure-ai-ml v2 (current) — use AutoML image tasks to generate ONNX models automatically

41 KiB Raw Blame History

Inferencing Optimization and Caching

Introduksjon

Kjernekomponenter

1. ONNX Runtime — High-Performance Inference Engine

2. Model Optimization Techniques

A. Model Conversion to ONNX

B. Batch Inference Optimization

3. Multi-Layer Caching Strategies

A. Prompt Caching (Azure OpenAI / AI Foundry)

B. Application-Layer Caching

C. Databricks Disk Caching

4. Compute Resource Optimization

A. CPU vs GPU Selection

B. Autoscaling for Inference Endpoints

5. Batch vs Online Inference Optimization

A. Batch Inference Best Practices

B. Online Inference Best Practices

6. Azure OpenAI Batch API for Cost-Efficient Inference

Arkitekturmønstre

Pattern 1: Multi-Layer Caching Architecture

Pattern 2: ONNX-Based Cross-Platform Deployment

Pattern 3: Autoscaling Inference Architecture

Beslutningsveiledning

1. Velge Inferencing Platform

2. Velge Compute for Inference

3. Velge Caching Strategy

4. Online vs Batch Inference Decision Tree

Integrasjon med Microsoft-stakken

Azure Machine Learning

Azure AI Foundry

Azure OpenAI

Windows ML

Azure SQL Edge

Databricks

Offentlig sektor (Norge)

Compliance og Data Residency

Cost Optimization for Offentlig Sektor

Sikkerhet og Personvern

Kostnad og lisensiering

Azure Machine Learning Pricing

Azure OpenAI Pricing

Lisensiering

For arkitekten (Cosmo)

Typiske Spørsmål fra Kunder

Anti-Patterns å Unngå

Troubleshooting Guide

Decision Framework: Når Bruke Hva

Kilder og verifisering

Microsoft Learn Dokumentasjon

Code Samples (MCP microsoft-learn)

Oppsummering for Cosmo

ONNX Inferencing Optimization for Computer Vision (Azure ML AutoML 2026)

41 KiB

Raw Blame History