Kjell Tore Guttormsen ff6a50d14f docs(architect): weekly KB update — 106 files refreshed (2026-04)

Updates across all 5 skills: ms-ai-advisor, ms-ai-engineering,
ms-ai-governance, ms-ai-security, ms-ai-infrastructure.

Key changes:
- Language Services (Custom Text Classification, Text Analytics, QnA):
  retirement warning 2029-03-31, migration guides to Foundry/GPT-4o
- Agentic Retrieval: 50M free reasoning tokens/month (Public Preview)
- Computer Use: Claude Sonnet 4.5 (preview) + OpenAI CUA models
- Agent Registry: Risks column (M365 E7), user-shared/org-published types
- Declarative agents: schema v1.5 → v1.6, Store validation requirements
- MLflow 3: 13 built-in LLM judges, production monitoring, Genie Code
- AG-UI HITL: ApprovalRequiredAIFunction (C#) + @tool(approval_mode) (Python)
- Entra ID Ignite 2025: Agent ID Admin/Developer RBAC roles, Conditional Access
- Security Copilot: 400 SCU/month per 1000 M365 E5 licenses, auto-provisioned
- Fast Transcription API: phrase lists, 14-language multi-lingual transcription
- Azure Monitor Workbooks: Bicep support, RBAC specifics
- Power Platform Copilot: data residency (Norway/Europe → EU DB, Bing → USA)
- RAG security-rbac: 4-approach table (GA + 3 preview access control methods)
- IaC MLOps: Well-Architected OE:05 principles, Bicep/Terraform patterns
- Translator: image file batch translation Preview (JPEG/PNG/BMP/WebP)

All 106 files: Last updated 2026-04 | Verified: MCP 2026-04

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-10 09:13:24 +02:00

22 KiB

Raw Blame History

RAG Caching and Performance Optimization

Last updated: 2026-04 | Verified: MCP 2026-04 Status: GA Category: RAG Architecture & Semantic Search

Introduksjon

Caching er en kritisk optimaliseringsstrategi for RAG-applikasjoner (Retrieval-Augmented Generation) som kan dramatisk redusere både latency og kostnader. I typiske RAG-scenarier er kall til LLM-modeller ofte den mest kostbare og tidkrevende operasjonen, spesielt når store mengder kontekstdata og chat history sendes med hver request. En godt designet caching-strategi kan redusere antall LLM-invocations med opptil 90% for high-traffic scenarier med repeterende eller semantisk like queries.

Multi-layer caching-tilnærmingen dekker flere nivåer i RAG-arkitekturen: result caching (hele LLM-responser), retrieval caching (knowledge fragments fra vektorsøk), embedding caching (forhåndsberegnede vektorrepresentasjoner), og semantic caching (semantisk like prompts). Hver av disse lagene adresserer ulike aspekter av ytelse og kostnadsoptimalisering.

Microsoft-stakken tilbyr flere tjenester optimalisert for AI-workloads: Azure Cache for Redis (traditional og semantic caching), Azure Cosmos DB (semantic cache med vektorsøk), Azure AI Search (built-in caching av search results), og Azure API Management (semantic caching for LLM APIs). Valget av løsning avhenger av cache-type, scale-requirements, og compliance-krav.

Kjernekomponenter

Multi-layer caching-strategi

Cache Layer	Formål	Typisk Hit Rate	Latency Impact
Result caching	Cache hele LLM-responser for identiske/semantisk like queries	30-60% (high-traffic)	-80% til -95%
Retrieval caching	Cache knowledge fragments fra vector search	40-70%	-50% til -70%
Embedding caching	Cache forhåndsberegnede embeddings	60-90%	-30% til -50%
Model output caching	Cache intermediate model outputs	20-40%	-40% til -60%

Verified (Microsoft Learn - Application design for AI workloads)

Cache Key Components

Effektive cache keys må inkludere:

Tenant/User identity — For multi-tenant security
Policy context — RBAC og data access policies
Model version — Unngå stale responses ved model updates
Prompt version — Track prompt engineering changes
Context window — Chat history for contextual relevance

Verified (Microsoft Learn - Multi-layer caching strategies)

Time-to-Live (TTL) Policies

Data Type	Anbefalt TTL	Begrunnelse
Static content (dokumentasjon, policies)	24-72 timer	Sjelden endring
Dynamic content (dashboard data)	5-30 minutter	Moderate freshness-krav
User-specific queries	1-5 minutter	Privacy og freshness
Search results	15-60 minutter	Balanse mellom cost og freshness

Baseline (Industry best practices)

Cache Invalidation Triggers

Data updates — Webhook-triggered invalidation ved source data changes
Model changes — Invalidate ved model deployment/retraining
Prompt modifications — Clear cache ved prompt template changes
Manual purge — Admin-triggered for compliance eller testing

Verified (Microsoft Learn - Caching strategies)

Arkitekturmønstre

1. Semantic Caching (anbefalt for RAG)

Beskrivelse: Bruker vector similarity search på cached prompts for å returnere responses til semantisk like queries, selv om teksten ikke er identisk.

Hvordan det fungerer:

Incoming prompt vektoriseres med embedding model
Vector search kjøres mot cached prompt vectors
Items med similarity score > threshold returneres
Ved cache miss: LLM genererer response, som caches med vectorized prompt

Fordeler:

Høyere cache hit rate enn traditional key-value caching (30-60% vs 10-20%)
Håndterer variasjon i user input (synonyms, paraphrasing)
Reduserer LLM token consumption drastisk

Ulemper:

Krever embedding model (extra latency ~50-100ms)
Mer kompleks implementation
Krever vector-capable cache (Cosmos DB, Redis med RediSearch)

Context Window Requirement: Semantic cache MÅ operere innenfor context window. Uten chat history kan cache returnere contextually incorrect responses.

Eksempel: User spør "What is the largest lake in North America?" (cached: "Lake Superior"), deretter "What is the second largest?" Uten context window ville cache kunne returnere feil svar til en annen user som spør samme oppfølgingsspørsmål i en annen kontekst.

Verified (Microsoft Learn - Semantic cache introduction)

2. Multi-tier Result Caching

Beskrivelse: Kombinerer in-memory cache (Redis) med persistent cache (Cosmos DB) for optimal balance mellom speed og durability.

Arkitektur:

User Query → L1: Redis (in-memory, <5ms) → L2: Cosmos DB (persistent, <50ms) → LLM (fallback, >2s)

Fordeler:

Sub-5ms response time for hot data
Data durability ved cache failures
Cost-effective (Redis for hot, Cosmos for warm data)

Ulemper:

Økt complexity i cache management
Potential for stale data across tiers
Høyere infrastructure cost

Baseline (Common enterprise pattern)

3. Retrieval Snippet Caching

Beskrivelse: Cache frequently retrieved knowledge fragments fra Azure AI Search eller vector databases for å unngå repeated database queries.

Implementation:

Cache top-K search results per query pattern
Key: hash(query + filters + top-K)
TTL: 15-60 minutter (avhengig av data freshness-krav)

Fordeler:

Reduserer Azure AI Search query costs (50-70% reduction)
Lavere latency for grounding data retrieval
Mindre load på vector index

Ulemper:

Stale grounding data hvis source documents oppdateres
Cache size kan vokse raskt med mange unique queries

Verified (Microsoft Learn - Retrieval caching)

Beslutningsveiledning

Når bruke hvilken caching-strategi

Scenario	Anbefalt Strategi	Rationale
Chatbot med repeterende FAQs	Semantic caching (Redis + RediSearch)	Høy hit rate, semantisk matching
Document Q&A med mange unique queries	Retrieval snippet caching	Kostnad-effektiv, fokus på grounding data
Real-time dashboard med AI insights	Multi-tier caching (Redis L1 + Cosmos L2)	Speed + durability
Compliance-sensitive applikasjoner	User-scoped semantic caching	Privacy protection, audit trail

Baseline (Architecture decision framework)

Vanlige feil å unngå

Anti-pattern	Problem	Løsning
Caching user-private data globally	Privacy violation, data leakage	Scope cache keys by user/tenant identity
Ingen TTL policy	Runaway cache growth, stale data	Implement TTL basert på data sensitivity
For høy similarity threshold (>0.8)	Lav cache hit rate	Start med 0.15-0.3, tune basert på metrics
Caching uten context window	Contextually incorrect responses	Vectorize chat history + latest prompt
Ingen invalidation strategy	Stale responses ved data updates	Implement webhook-based invalidation

Verified (Microsoft Learn - Caching risks)

Røde flagg

Cache hit rate < 20% etter tuning → Revurder cache strategy
Cache size vokser >10GB/dag → Implementer aggressive TTL eller pruning
Latency øker etter caching → Sjekk embedding model overhead
Brukerklager på stale data → Reduser TTL eller implementer invalidation

Baseline (Performance monitoring thresholds)

Integrasjon med Microsoft-stakken

Azure Cache for Redis

Use Cases: Traditional result caching, high-throughput scenarios

Tiers:

Premium tier — 99.9% SLA, up to 120GB per shard
Enterprise tier — 99.99% SLA, active-active geo-replication, Flash storage support
Enterprise Flash tier — Up to 13TB cache size, 20% RAM + 80% NVMe Flash

Workloads suited for Flash tier:

Read-heavy (high read/write ratio)
Hot/cold access patterns (frequently accessed subset)
Large values (keys in RAM, values in Flash)

Not suited for Flash tier:

Write-heavy workloads
Uniform data access patterns
Long key names with small values

Configuration for AI workloads:

import redis
from azure.identity import DefaultAzureCredential
from redis_entraid.cred_provider import create_from_default_azure_credential

credential_provider = create_from_default_azure_credential(
    ("https://redis.azure.com/.default",),
)

r = redis.Redis(
    host="<redis-host>.redis.cache.windows.net",
    port=10000,
    ssl=True,
    decode_responses=True,
    credential_provider=credential_provider
)

# Set TTL på cached item
r.setex("cache_key", 3600, "cached_value")  # 1 hour TTL

Verified (Microsoft Learn - Azure Managed Redis architecture, code samples)

Azure API Management - Semantic Caching

Use Case: Semantic caching for LLM APIs (Azure OpenAI, Model Inference API)

Prerequisites:

Azure Managed Redis med RediSearch module enabled
Embeddings API deployment (for vectorization)
Chat Completion API deployment (for user requests)

Policy Configuration:

Inbound (cache lookup):

<azure-openai-semantic-cache-lookup
    score-threshold="0.15"
    embeddings-backend-id="embeddings-backend"
    embeddings-backend-auth="system-assigned"
    ignore-system-messages="true"
    max-message-count="10">
    <vary-by>@(context.Subscription.Id)</vary-by>
</azure-openai-semantic-cache-lookup>
<rate-limit calls="10" renewal-period="60" />

Outbound (cache store):

<azure-openai-semantic-cache-store duration="60" />

Score Threshold Tuning:

0.1-0.2 → Liberal matching, høy hit rate, noe lavere relevance
0.3-0.5 → Balanced, medium hit rate, god relevance
0.6-0.8 → Strict matching, lav hit rate, høy relevance

Verified (Microsoft Learn - Enable semantic caching for LLM APIs)

Azure Cosmos DB for NoSQL

Use Case: Semantic cache med built-in vector search, persistent storage

Implementation Pattern:

from azure.cosmos import CosmosClient
from openai import AzureOpenAI

# Setup Cosmos DB vector store
cosmos_client = CosmosClient(url=cosmos_uri, credential=cosmos_key)
database = cosmos_client.get_database_client(cosmos_database_name)
container = database.get_container_client(cosmos_container_name)

# Query semantic cache
def query_cache(prompt_vector, similarity_threshold=0.15, top_k=5):
    query = f"""
    SELECT TOP {top_k} c.id, c.prompt, c.completion,
           VectorDistance(c.promptVector, @promptVector) AS similarity
    FROM c
    WHERE VectorDistance(c.promptVector, @promptVector) > @threshold
    ORDER BY VectorDistance(c.promptVector, @promptVector) DESC
    """

    items = list(container.query_items(
        query=query,
        parameters=[
            {"name": "@promptVector", "value": prompt_vector},
            {"name": "@threshold", "value": similarity_threshold}
        ]
    ))
    return items

Fordeler:

Globally distributed, multi-region writes
Automatic indexing av vectors
99.999% SLA med multi-region setup
Built-in TTL support

Verified (Microsoft Learn - Semantic cache with Cosmos DB, code samples)

Azure AI Search - Built-in Caching

Automatic Caching Behavior: Azure AI Search cacher automatisk content etter første query for raskere subsequent searches.

Optimization Tips:

Reduser index size → raskere caching, mindre memory footprint
Selective field attribution → kun indexer nødvendige fields
Unngå over-attribution (filterable, sortable, facetable) → reduserer storage 4x

Performance Factors:

Smaller indexes → mer content i cache → lavere query latency
Higher tiers (S2, S3) → mer memory → større cache capacity
Partitions → parallel processing for slow queries

Verified (Microsoft Learn - Azure AI Search performance tips)

Offentlig sektor (Norge)

Cache Key Scoping (OBLIGATORISK):

Aldri cache user-private content uten proper scoping by user identity
Implementer tenant/user isolation i cache keys
Audit trail for cached persondata

Data Minimization:

Cache kun minimum nødvendig data for å svare på query
TTL på persondata skal ikke overstige formåls-begrensningen
Automatisk sletting ved user request (GDPR Article 17)

Eksempel - GDPR-compliant cache key:

cache_key = f"user:{user_id}:tenant:{tenant_id}:query_hash:{hash(prompt)}"
# TTL: 1 hour (minimal for chat session)

Baseline (GDPR compliance patterns)

Compliance-krav

Krav	Implementation
Dataportabilitet (GDPR Art. 20)	Export cached user data on request
Rett til sletting (GDPR Art. 17)	Implement cache purge by user_id
Behandlingsgrunnlag	Dokumenter legitimate interest for caching
Datatilsynet rapportering	Audit log for cache access/invalidation

Baseline (Norwegian public sector compliance)

Sikkerhet

Encryption:

At rest: Azure Cache for Redis (Premium/Enterprise) — automatic encryption
In transit: TLS 1.2+ mandatory for all cache connections
Key management: Azure Key Vault for cache access keys

Access Control:

Microsoft Entra ID authentication for Redis (preview)
Role-based access control (RBAC) for cache management
Network isolation via Private Endpoints

Verified (Microsoft Learn - Redis security)

Kostnad og lisensiering

Azure Cache for Redis Pricing (Norway East - 2026)

Tier	Size	Kapasitet	Månedskostnad (NOK)	Best For
Basic C0	250 MB	N/A (no SLA)	~400	Dev/Test
Standard C1	1 GB	2 replicas, 99.9% SLA	~1,200	Small production
Premium P1	6 GB	Clustering, geo-replication	~7,000	Enterprise
Enterprise E10	12 GB	Active-active, 99.99% SLA	~25,000	Mission-critical
Enterprise Flash F300	345 GB	20% RAM + 80% Flash	~60,000	Large-scale AI

Cost Optimization Tips:

Start with Premium P1 for production RAG (best price/performance)
Scale out vs scale up — Add replicas før du går til høyere tier
Use Flash tier for large caches (>100GB) — 5x lavere cost per GB vs Enterprise
Monitor cache hit rate — <20% hit rate betyr ineffektiv caching strategy
Implement TTL aggressively — Reduser cache size, lavere tier

Verified (Microsoft Learn - Plan and manage costs)

Azure Cosmos DB Pricing

Request Units (RU/s) for Semantic Cache:

Vector query (1KB): ~10-50 RU
Write (cache store): ~5-10 RU
Storage: ~2.5 NOK/GB/måned

Cost Example (10,000 queries/day):

10,000 queries × 30 RU avg = 300,000 RU/day = 3.5 RU/s avg
Provisioned: 100 RU/s (for burst) = ~600 NOK/måned
Storage (10GB cache): ~25 NOK/måned
Total: ~625 NOK/måned

Baseline (Cosmos DB pricing calculator estimates)

TCO Sammenligning

Scenario	Without Caching	With Semantic Caching (Redis)	Savings
100K LLM queries/day (GPT-4)	~450,000 NOK/måned	~150,000 NOK/måned + 7,000 (Redis)	65%
10K queries/day (GPT-3.5)	~45,000 NOK/måned	~15,000 NOK/måned + 7,000 (Redis)	51%

Assumptions: 50% cache hit rate, avg 2000 tokens/query

Baseline (TCO analysis based on Azure pricing)

For arkitekten (Cosmo)

Spørsmål å stille kunden

Traffic pattern: Hvor mange LLM queries per dag/time forventer dere? Hva er peak vs avg load?
Query similarity: Er det mange repeterende eller semantisk like spørsmål? (Indikerer semantic cache ROI)
Data freshness: Hvor ofte endres underlying data? Hva er akseptabelt staleness-vindu?
Privacy requirements: Håndterer dere persondata? Trengs user-scoped caching?
Compliance: Hvilke regulatory frameworks gjelder (GDPR, Schrems II, Datatilsynet)?
Budget: Hva er totalt budsjett for LLM + caching infrastructure?
Latency SLA: Hva er maks akseptabel response time (p50, p95, p99)?
Global reach: Trengs multi-region caching for latency eller compliance?

Fallgruver å unngå

Fallgruve	Impact	Mitigering
Caching uten context window	Contextually incorrect responses → user frustration	Vectorize chat history + prompt
Global caching av persondata	GDPR violation, potential bøter	User-scoped keys, TTL enforcement
For høy similarity threshold	Lav hit rate, caching ineffective	Start lavt (0.15), tune opp
Ingen invalidation strategy	Stale data → incorrect LLM responses	Webhook-based invalidation
Undersized cache tier	High eviction rate, lav hit rate	Monitor evictions, scale proaktivt
Ignoring embedding overhead	Latency increase vs direct LLM call	Batch embeddings, use async patterns

Anbefalinger per modenhetsnivå

Level 1 - Pilot (0-6 måneder RAG erfaring):

Start med Azure API Management semantic caching (managed, low-complexity)
Use case: FAQ chatbot med <1000 queries/dag
Tier: Standard Redis (C1) for læring, lav cost
Monitoring: Basic hit rate metrics i APIM

Level 2 - Production (6-18 måneder):

Implementer multi-layer caching (Redis L1 + Cosmos DB L2)
Use case: Customer support RAG med 10K-100K queries/dag
Tier: Premium Redis (P1) + Cosmos DB autoscale
Monitoring: Application Insights med custom metrics (hit rate, latency, cost per query)

Level 3 - Enterprise (18+ måneder):

Hybrid semantic + retrieval caching med advanced invalidation
Use case: Multi-tenant SaaS RAG platform, 100K+ queries/dag
Tier: Enterprise Redis (E10) + global Cosmos DB
Monitoring: Full observability stack (Grafana, custom dashboards, alerting)

Baseline (Maturity model for AI implementations)

Kilder og verifisering

Microsoft Learn Documentation

Application design for AI workloads on Azure - Multi-layer caching strategies https://learn.microsoft.com/en-us/azure/well-architected/ai/application-design#implement-multi-layer-caching-strategies Confidence: Verified (2026-02)
Introduction to semantic cache - Semantic caching concepts, context window requirements https://learn.microsoft.com/en-us/azure/cosmos-db/gen-ai/semantic-cache Confidence: Verified (2026-02)
Enable semantic caching for LLM APIs in Azure API Management - APIM semantic cache implementation https://learn.microsoft.com/en-us/azure/api-management/azure-openai-enable-semantic-caching Confidence: Verified (2026-02)
Tips for better performance in Azure AI Search - Index caching, performance optimization https://learn.microsoft.com/en-us/azure/search/search-performance-tips Confidence: Verified (2026-02)
Azure Managed Redis architecture - Flash tier workloads, caching strategies https://learn.microsoft.com/en-us/azure/redis/architecture#flash-optimized-tier Confidence: Verified (2026-02)
Plan and manage costs of an Azure AI Search service - Cost optimization, enrichment caching https://learn.microsoft.com/en-us/azure/search/search-sku-manage-costs#minimize-costs Confidence: Verified (2026-02)
Data platform considerations for mission-critical workloads - Azure Cache for Redis enterprise patterns https://learn.microsoft.com/en-us/azure/well-architected/mission-critical/mission-critical-data-platform#caching-for-hot-tier-data Confidence: Verified (2026-02)

Code Samples

RAG implementation with Azure AI Search - Python RAG cache patterns https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview#content-retrieval-in-azure-ai-search Confidence: Verified (code sample)
Azure Cache for Redis with Python - Redis connection and caching code https://learn.microsoft.com/en-us/azure/redis/python-get-started#code-to-connect-to-a-redis-cache Confidence: Verified (code sample)

Confidence Levels per Section

Seksjon	Confidence	Source
Multi-layer caching strategy	Verified	Microsoft Learn docs (1)
Semantic caching pattern	Verified	Microsoft Learn docs (2, 3)
Azure Cache for Redis configuration	Verified	Microsoft Learn docs (5, 7), code samples (9)
Azure API Management policies	Verified	Microsoft Learn docs (3)
Azure AI Search caching	Verified	Microsoft Learn docs (4, 6)
Cost estimates	Baseline	Azure pricing calculator (2026-02)
GDPR compliance patterns	Baseline	Industry best practices
Maturity model recommendations	Baseline	Architecture consulting experience

Totalt antall kilder: 9 unike Microsoft Learn URLer MCP calls: 6 (4 docs_search + 2 docs_fetch + 1 code_sample_search) Sist verifisert: 2026-02-03

Azure Managed Redis — Arkitektur (oppdatert 2026-04)

Azure Managed Redis (basert på Redis Enterprise) er anbefalt for AI-workloads vs. Azure Cache for Redis (community edition):

Egenskap	Azure Cache for Redis	Azure Managed Redis
Threading	Single-threaded	Multi-threaded (Redis Enterprise)
Arkitektur	Primary + replica (2 nodes)	Multiple shards per node, distributed primaries
Performance	Begrenset av single thread	Nær-lineær skalering med vCPUs
Clustering	Valgfritt	Alltid aktivert (OSS, Enterprise, eller Non-Clustered policy)
Active geo-replication	Nei	Ja

Cluster policies:

OSS policy — anbefalt for de fleste. Klienten kobles direkte til shards, laveste latency, best throughput
Enterprise policy — enkelt endpoint, bakoverkompatibelt, men enkelt-node proxy kan bli bottleneck. Påkrevd for RediSearch
Non-Clustered — kun ≤25 GB, for migrering fra ikke-shardede miljøer

Flash Optimized tier: 20% RAM + 80% NVMe Flash. Optimal for read-heavy workloads med subset av hot keys.

22 KiB Raw Blame History Unescape Escape

RAG Caching and Performance Optimization

Introduksjon

Kjernekomponenter

Multi-layer caching-strategi

Cache Key Components

Time-to-Live (TTL) Policies

Cache Invalidation Triggers

Arkitekturmønstre

1. Semantic Caching (anbefalt for RAG)

2. Multi-tier Result Caching

3. Retrieval Snippet Caching

Beslutningsveiledning

Når bruke hvilken caching-strategi

Vanlige feil å unngå

Røde flagg

Integrasjon med Microsoft-stakken

Azure Cache for Redis

Azure API Management - Semantic Caching

Azure Cosmos DB for NoSQL

Azure AI Search - Built-in Caching

Offentlig sektor (Norge)

GDPR og Privacy

Compliance-krav

Sikkerhet

Kostnad og lisensiering

Azure Cache for Redis Pricing (Norway East - 2026)

Azure Cosmos DB Pricing

TCO Sammenligning

For arkitekten (Cosmo)

Spørsmål å stille kunden

Fallgruver å unngå

Anbefalinger per modenhetsnivå

Kilder og verifisering

Microsoft Learn Documentation

Code Samples

Confidence Levels per Section

Azure Managed Redis — Arkitektur (oppdatert 2026-04)

22 KiB

Raw Blame History