Updates across all 5 skills: ms-ai-advisor, ms-ai-engineering, ms-ai-governance, ms-ai-security, ms-ai-infrastructure. Key changes: - Language Services (Custom Text Classification, Text Analytics, QnA): retirement warning 2029-03-31, migration guides to Foundry/GPT-4o - Agentic Retrieval: 50M free reasoning tokens/month (Public Preview) - Computer Use: Claude Sonnet 4.5 (preview) + OpenAI CUA models - Agent Registry: Risks column (M365 E7), user-shared/org-published types - Declarative agents: schema v1.5 → v1.6, Store validation requirements - MLflow 3: 13 built-in LLM judges, production monitoring, Genie Code - AG-UI HITL: ApprovalRequiredAIFunction (C#) + @tool(approval_mode) (Python) - Entra ID Ignite 2025: Agent ID Admin/Developer RBAC roles, Conditional Access - Security Copilot: 400 SCU/month per 1000 M365 E5 licenses, auto-provisioned - Fast Transcription API: phrase lists, 14-language multi-lingual transcription - Azure Monitor Workbooks: Bicep support, RBAC specifics - Power Platform Copilot: data residency (Norway/Europe → EU DB, Bing → USA) - RAG security-rbac: 4-approach table (GA + 3 preview access control methods) - IaC MLOps: Well-Architected OE:05 principles, Bicep/Terraform patterns - Translator: image file batch translation Preview (JPEG/PNG/BMP/WebP) All 106 files: Last updated 2026-04 | Verified: MCP 2026-04 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
22 KiB
RAG Caching and Performance Optimization
Last updated: 2026-04 | Verified: MCP 2026-04 Status: GA Category: RAG Architecture & Semantic Search
Introduksjon
Caching er en kritisk optimaliseringsstrategi for RAG-applikasjoner (Retrieval-Augmented Generation) som kan dramatisk redusere både latency og kostnader. I typiske RAG-scenarier er kall til LLM-modeller ofte den mest kostbare og tidkrevende operasjonen, spesielt når store mengder kontekstdata og chat history sendes med hver request. En godt designet caching-strategi kan redusere antall LLM-invocations med opptil 90% for high-traffic scenarier med repeterende eller semantisk like queries.
Multi-layer caching-tilnærmingen dekker flere nivåer i RAG-arkitekturen: result caching (hele LLM-responser), retrieval caching (knowledge fragments fra vektorsøk), embedding caching (forhåndsberegnede vektorrepresentasjoner), og semantic caching (semantisk like prompts). Hver av disse lagene adresserer ulike aspekter av ytelse og kostnadsoptimalisering.
Microsoft-stakken tilbyr flere tjenester optimalisert for AI-workloads: Azure Cache for Redis (traditional og semantic caching), Azure Cosmos DB (semantic cache med vektorsøk), Azure AI Search (built-in caching av search results), og Azure API Management (semantic caching for LLM APIs). Valget av løsning avhenger av cache-type, scale-requirements, og compliance-krav.
Kjernekomponenter
Multi-layer caching-strategi
| Cache Layer | Formål | Typisk Hit Rate | Latency Impact |
|---|---|---|---|
| Result caching | Cache hele LLM-responser for identiske/semantisk like queries | 30-60% (high-traffic) | -80% til -95% |
| Retrieval caching | Cache knowledge fragments fra vector search | 40-70% | -50% til -70% |
| Embedding caching | Cache forhåndsberegnede embeddings | 60-90% | -30% til -50% |
| Model output caching | Cache intermediate model outputs | 20-40% | -40% til -60% |
Verified (Microsoft Learn - Application design for AI workloads)
Cache Key Components
Effektive cache keys må inkludere:
- Tenant/User identity — For multi-tenant security
- Policy context — RBAC og data access policies
- Model version — Unngå stale responses ved model updates
- Prompt version — Track prompt engineering changes
- Context window — Chat history for contextual relevance
Verified (Microsoft Learn - Multi-layer caching strategies)
Time-to-Live (TTL) Policies
| Data Type | Anbefalt TTL | Begrunnelse |
|---|---|---|
| Static content (dokumentasjon, policies) | 24-72 timer | Sjelden endring |
| Dynamic content (dashboard data) | 5-30 minutter | Moderate freshness-krav |
| User-specific queries | 1-5 minutter | Privacy og freshness |
| Search results | 15-60 minutter | Balanse mellom cost og freshness |
Baseline (Industry best practices)
Cache Invalidation Triggers
- Data updates — Webhook-triggered invalidation ved source data changes
- Model changes — Invalidate ved model deployment/retraining
- Prompt modifications — Clear cache ved prompt template changes
- Manual purge — Admin-triggered for compliance eller testing
Verified (Microsoft Learn - Caching strategies)
Arkitekturmønstre
1. Semantic Caching (anbefalt for RAG)
Beskrivelse: Bruker vector similarity search på cached prompts for å returnere responses til semantisk like queries, selv om teksten ikke er identisk.
Hvordan det fungerer:
- Incoming prompt vektoriseres med embedding model
- Vector search kjøres mot cached prompt vectors
- Items med similarity score > threshold returneres
- Ved cache miss: LLM genererer response, som caches med vectorized prompt
Fordeler:
- Høyere cache hit rate enn traditional key-value caching (30-60% vs 10-20%)
- Håndterer variasjon i user input (synonyms, paraphrasing)
- Reduserer LLM token consumption drastisk
Ulemper:
- Krever embedding model (extra latency ~50-100ms)
- Mer kompleks implementation
- Krever vector-capable cache (Cosmos DB, Redis med RediSearch)
Context Window Requirement: Semantic cache MÅ operere innenfor context window. Uten chat history kan cache returnere contextually incorrect responses.
Eksempel: User spør "What is the largest lake in North America?" (cached: "Lake Superior"), deretter "What is the second largest?" Uten context window ville cache kunne returnere feil svar til en annen user som spør samme oppfølgingsspørsmål i en annen kontekst.
Verified (Microsoft Learn - Semantic cache introduction)
2. Multi-tier Result Caching
Beskrivelse: Kombinerer in-memory cache (Redis) med persistent cache (Cosmos DB) for optimal balance mellom speed og durability.
Arkitektur:
User Query → L1: Redis (in-memory, <5ms) → L2: Cosmos DB (persistent, <50ms) → LLM (fallback, >2s)
Fordeler:
- Sub-5ms response time for hot data
- Data durability ved cache failures
- Cost-effective (Redis for hot, Cosmos for warm data)
Ulemper:
- Økt complexity i cache management
- Potential for stale data across tiers
- Høyere infrastructure cost
Baseline (Common enterprise pattern)
3. Retrieval Snippet Caching
Beskrivelse: Cache frequently retrieved knowledge fragments fra Azure AI Search eller vector databases for å unngå repeated database queries.
Implementation:
- Cache top-K search results per query pattern
- Key: hash(query + filters + top-K)
- TTL: 15-60 minutter (avhengig av data freshness-krav)
Fordeler:
- Reduserer Azure AI Search query costs (50-70% reduction)
- Lavere latency for grounding data retrieval
- Mindre load på vector index
Ulemper:
- Stale grounding data hvis source documents oppdateres
- Cache size kan vokse raskt med mange unique queries
Verified (Microsoft Learn - Retrieval caching)
Beslutningsveiledning
Når bruke hvilken caching-strategi
| Scenario | Anbefalt Strategi | Rationale |
|---|---|---|
| Chatbot med repeterende FAQs | Semantic caching (Redis + RediSearch) | Høy hit rate, semantisk matching |
| Document Q&A med mange unique queries | Retrieval snippet caching | Kostnad-effektiv, fokus på grounding data |
| Real-time dashboard med AI insights | Multi-tier caching (Redis L1 + Cosmos L2) | Speed + durability |
| Compliance-sensitive applikasjoner | User-scoped semantic caching | Privacy protection, audit trail |
Baseline (Architecture decision framework)
Vanlige feil å unngå
| Anti-pattern | Problem | Løsning |
|---|---|---|
| Caching user-private data globally | Privacy violation, data leakage | Scope cache keys by user/tenant identity |
| Ingen TTL policy | Runaway cache growth, stale data | Implement TTL basert på data sensitivity |
| For høy similarity threshold (>0.8) | Lav cache hit rate | Start med 0.15-0.3, tune basert på metrics |
| Caching uten context window | Contextually incorrect responses | Vectorize chat history + latest prompt |
| Ingen invalidation strategy | Stale responses ved data updates | Implement webhook-based invalidation |
Verified (Microsoft Learn - Caching risks)
Røde flagg
- Cache hit rate < 20% etter tuning → Revurder cache strategy
- Cache size vokser >10GB/dag → Implementer aggressive TTL eller pruning
- Latency øker etter caching → Sjekk embedding model overhead
- Brukerklager på stale data → Reduser TTL eller implementer invalidation
Baseline (Performance monitoring thresholds)
Integrasjon med Microsoft-stakken
Azure Cache for Redis
Use Cases: Traditional result caching, high-throughput scenarios
Tiers:
- Premium tier — 99.9% SLA, up to 120GB per shard
- Enterprise tier — 99.99% SLA, active-active geo-replication, Flash storage support
- Enterprise Flash tier — Up to 13TB cache size, 20% RAM + 80% NVMe Flash
Workloads suited for Flash tier:
- Read-heavy (high read/write ratio)
- Hot/cold access patterns (frequently accessed subset)
- Large values (keys in RAM, values in Flash)
Not suited for Flash tier:
- Write-heavy workloads
- Uniform data access patterns
- Long key names with small values
Configuration for AI workloads:
import redis
from azure.identity import DefaultAzureCredential
from redis_entraid.cred_provider import create_from_default_azure_credential
credential_provider = create_from_default_azure_credential(
("https://redis.azure.com/.default",),
)
r = redis.Redis(
host="<redis-host>.redis.cache.windows.net",
port=10000,
ssl=True,
decode_responses=True,
credential_provider=credential_provider
)
# Set TTL på cached item
r.setex("cache_key", 3600, "cached_value") # 1 hour TTL
Verified (Microsoft Learn - Azure Managed Redis architecture, code samples)
Azure API Management - Semantic Caching
Use Case: Semantic caching for LLM APIs (Azure OpenAI, Model Inference API)
Prerequisites:
- Azure Managed Redis med RediSearch module enabled
- Embeddings API deployment (for vectorization)
- Chat Completion API deployment (for user requests)
Policy Configuration:
Inbound (cache lookup):
<azure-openai-semantic-cache-lookup
score-threshold="0.15"
embeddings-backend-id="embeddings-backend"
embeddings-backend-auth="system-assigned"
ignore-system-messages="true"
max-message-count="10">
<vary-by>@(context.Subscription.Id)</vary-by>
</azure-openai-semantic-cache-lookup>
<rate-limit calls="10" renewal-period="60" />
Outbound (cache store):
<azure-openai-semantic-cache-store duration="60" />
Score Threshold Tuning:
- 0.1-0.2 → Liberal matching, høy hit rate, noe lavere relevance
- 0.3-0.5 → Balanced, medium hit rate, god relevance
- 0.6-0.8 → Strict matching, lav hit rate, høy relevance
Verified (Microsoft Learn - Enable semantic caching for LLM APIs)
Azure Cosmos DB for NoSQL
Use Case: Semantic cache med built-in vector search, persistent storage
Implementation Pattern:
from azure.cosmos import CosmosClient
from openai import AzureOpenAI
# Setup Cosmos DB vector store
cosmos_client = CosmosClient(url=cosmos_uri, credential=cosmos_key)
database = cosmos_client.get_database_client(cosmos_database_name)
container = database.get_container_client(cosmos_container_name)
# Query semantic cache
def query_cache(prompt_vector, similarity_threshold=0.15, top_k=5):
query = f"""
SELECT TOP {top_k} c.id, c.prompt, c.completion,
VectorDistance(c.promptVector, @promptVector) AS similarity
FROM c
WHERE VectorDistance(c.promptVector, @promptVector) > @threshold
ORDER BY VectorDistance(c.promptVector, @promptVector) DESC
"""
items = list(container.query_items(
query=query,
parameters=[
{"name": "@promptVector", "value": prompt_vector},
{"name": "@threshold", "value": similarity_threshold}
]
))
return items
Fordeler:
- Globally distributed, multi-region writes
- Automatic indexing av vectors
- 99.999% SLA med multi-region setup
- Built-in TTL support
Verified (Microsoft Learn - Semantic cache with Cosmos DB, code samples)
Azure AI Search - Built-in Caching
Automatic Caching Behavior: Azure AI Search cacher automatisk content etter første query for raskere subsequent searches.
Optimization Tips:
- Reduser index size → raskere caching, mindre memory footprint
- Selective field attribution → kun indexer nødvendige fields
- Unngå over-attribution (filterable, sortable, facetable) → reduserer storage 4x
Performance Factors:
- Smaller indexes → mer content i cache → lavere query latency
- Higher tiers (S2, S3) → mer memory → større cache capacity
- Partitions → parallel processing for slow queries
Verified (Microsoft Learn - Azure AI Search performance tips)
Offentlig sektor (Norge)
GDPR og Privacy
Cache Key Scoping (OBLIGATORISK):
- Aldri cache user-private content uten proper scoping by user identity
- Implementer tenant/user isolation i cache keys
- Audit trail for cached persondata
Data Minimization:
- Cache kun minimum nødvendig data for å svare på query
- TTL på persondata skal ikke overstige formåls-begrensningen
- Automatisk sletting ved user request (GDPR Article 17)
Eksempel - GDPR-compliant cache key:
cache_key = f"user:{user_id}:tenant:{tenant_id}:query_hash:{hash(prompt)}"
# TTL: 1 hour (minimal for chat session)
Baseline (GDPR compliance patterns)
Compliance-krav
| Krav | Implementation |
|---|---|
| Dataportabilitet (GDPR Art. 20) | Export cached user data on request |
| Rett til sletting (GDPR Art. 17) | Implement cache purge by user_id |
| Behandlingsgrunnlag | Dokumenter legitimate interest for caching |
| Datatilsynet rapportering | Audit log for cache access/invalidation |
Baseline (Norwegian public sector compliance)
Sikkerhet
Encryption:
- At rest: Azure Cache for Redis (Premium/Enterprise) — automatic encryption
- In transit: TLS 1.2+ mandatory for all cache connections
- Key management: Azure Key Vault for cache access keys
Access Control:
- Microsoft Entra ID authentication for Redis (preview)
- Role-based access control (RBAC) for cache management
- Network isolation via Private Endpoints
Verified (Microsoft Learn - Redis security)
Kostnad og lisensiering
Azure Cache for Redis Pricing (Norway East - 2026)
| Tier | Size | Kapasitet | Månedskostnad (NOK) | Best For |
|---|---|---|---|---|
| Basic C0 | 250 MB | N/A (no SLA) | ~400 | Dev/Test |
| Standard C1 | 1 GB | 2 replicas, 99.9% SLA | ~1,200 | Small production |
| Premium P1 | 6 GB | Clustering, geo-replication | ~7,000 | Enterprise |
| Enterprise E10 | 12 GB | Active-active, 99.99% SLA | ~25,000 | Mission-critical |
| Enterprise Flash F300 | 345 GB | 20% RAM + 80% Flash | ~60,000 | Large-scale AI |
Cost Optimization Tips:
- Start with Premium P1 for production RAG (best price/performance)
- Scale out vs scale up — Add replicas før du går til høyere tier
- Use Flash tier for large caches (>100GB) — 5x lavere cost per GB vs Enterprise
- Monitor cache hit rate — <20% hit rate betyr ineffektiv caching strategy
- Implement TTL aggressively — Reduser cache size, lavere tier
Verified (Microsoft Learn - Plan and manage costs)
Azure Cosmos DB Pricing
Request Units (RU/s) for Semantic Cache:
- Vector query (1KB): ~10-50 RU
- Write (cache store): ~5-10 RU
- Storage: ~2.5 NOK/GB/måned
Cost Example (10,000 queries/day):
- 10,000 queries × 30 RU avg = 300,000 RU/day = 3.5 RU/s avg
- Provisioned: 100 RU/s (for burst) = ~600 NOK/måned
- Storage (10GB cache): ~25 NOK/måned
- Total: ~625 NOK/måned
Baseline (Cosmos DB pricing calculator estimates)
TCO Sammenligning
| Scenario | Without Caching | With Semantic Caching (Redis) | Savings |
|---|---|---|---|
| 100K LLM queries/day (GPT-4) | ~450,000 NOK/måned | ~150,000 NOK/måned + 7,000 (Redis) | 65% |
| 10K queries/day (GPT-3.5) | ~45,000 NOK/måned | ~15,000 NOK/måned + 7,000 (Redis) | 51% |
Assumptions: 50% cache hit rate, avg 2000 tokens/query
Baseline (TCO analysis based on Azure pricing)
For arkitekten (Cosmo)
Spørsmål å stille kunden
- Traffic pattern: Hvor mange LLM queries per dag/time forventer dere? Hva er peak vs avg load?
- Query similarity: Er det mange repeterende eller semantisk like spørsmål? (Indikerer semantic cache ROI)
- Data freshness: Hvor ofte endres underlying data? Hva er akseptabelt staleness-vindu?
- Privacy requirements: Håndterer dere persondata? Trengs user-scoped caching?
- Compliance: Hvilke regulatory frameworks gjelder (GDPR, Schrems II, Datatilsynet)?
- Budget: Hva er totalt budsjett for LLM + caching infrastructure?
- Latency SLA: Hva er maks akseptabel response time (p50, p95, p99)?
- Global reach: Trengs multi-region caching for latency eller compliance?
Fallgruver å unngå
| Fallgruve | Impact | Mitigering |
|---|---|---|
| Caching uten context window | Contextually incorrect responses → user frustration | Vectorize chat history + prompt |
| Global caching av persondata | GDPR violation, potential bøter | User-scoped keys, TTL enforcement |
| For høy similarity threshold | Lav hit rate, caching ineffective | Start lavt (0.15), tune opp |
| Ingen invalidation strategy | Stale data → incorrect LLM responses | Webhook-based invalidation |
| Undersized cache tier | High eviction rate, lav hit rate | Monitor evictions, scale proaktivt |
| Ignoring embedding overhead | Latency increase vs direct LLM call | Batch embeddings, use async patterns |
Anbefalinger per modenhetsnivå
Level 1 - Pilot (0-6 måneder RAG erfaring):
- Start med Azure API Management semantic caching (managed, low-complexity)
- Use case: FAQ chatbot med <1000 queries/dag
- Tier: Standard Redis (C1) for læring, lav cost
- Monitoring: Basic hit rate metrics i APIM
Level 2 - Production (6-18 måneder):
- Implementer multi-layer caching (Redis L1 + Cosmos DB L2)
- Use case: Customer support RAG med 10K-100K queries/dag
- Tier: Premium Redis (P1) + Cosmos DB autoscale
- Monitoring: Application Insights med custom metrics (hit rate, latency, cost per query)
Level 3 - Enterprise (18+ måneder):
- Hybrid semantic + retrieval caching med advanced invalidation
- Use case: Multi-tenant SaaS RAG platform, 100K+ queries/dag
- Tier: Enterprise Redis (E10) + global Cosmos DB
- Monitoring: Full observability stack (Grafana, custom dashboards, alerting)
Baseline (Maturity model for AI implementations)
Kilder og verifisering
Microsoft Learn Documentation
-
Application design for AI workloads on Azure - Multi-layer caching strategies https://learn.microsoft.com/en-us/azure/well-architected/ai/application-design#implement-multi-layer-caching-strategies Confidence: Verified (2026-02)
-
Introduction to semantic cache - Semantic caching concepts, context window requirements https://learn.microsoft.com/en-us/azure/cosmos-db/gen-ai/semantic-cache Confidence: Verified (2026-02)
-
Enable semantic caching for LLM APIs in Azure API Management - APIM semantic cache implementation https://learn.microsoft.com/en-us/azure/api-management/azure-openai-enable-semantic-caching Confidence: Verified (2026-02)
-
Tips for better performance in Azure AI Search - Index caching, performance optimization https://learn.microsoft.com/en-us/azure/search/search-performance-tips Confidence: Verified (2026-02)
-
Azure Managed Redis architecture - Flash tier workloads, caching strategies https://learn.microsoft.com/en-us/azure/redis/architecture#flash-optimized-tier Confidence: Verified (2026-02)
-
Plan and manage costs of an Azure AI Search service - Cost optimization, enrichment caching https://learn.microsoft.com/en-us/azure/search/search-sku-manage-costs#minimize-costs Confidence: Verified (2026-02)
-
Data platform considerations for mission-critical workloads - Azure Cache for Redis enterprise patterns https://learn.microsoft.com/en-us/azure/well-architected/mission-critical/mission-critical-data-platform#caching-for-hot-tier-data Confidence: Verified (2026-02)
Code Samples
-
RAG implementation with Azure AI Search - Python RAG cache patterns https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview#content-retrieval-in-azure-ai-search Confidence: Verified (code sample)
-
Azure Cache for Redis with Python - Redis connection and caching code https://learn.microsoft.com/en-us/azure/redis/python-get-started#code-to-connect-to-a-redis-cache Confidence: Verified (code sample)
Confidence Levels per Section
| Seksjon | Confidence | Source |
|---|---|---|
| Multi-layer caching strategy | Verified | Microsoft Learn docs (1) |
| Semantic caching pattern | Verified | Microsoft Learn docs (2, 3) |
| Azure Cache for Redis configuration | Verified | Microsoft Learn docs (5, 7), code samples (9) |
| Azure API Management policies | Verified | Microsoft Learn docs (3) |
| Azure AI Search caching | Verified | Microsoft Learn docs (4, 6) |
| Cost estimates | Baseline | Azure pricing calculator (2026-02) |
| GDPR compliance patterns | Baseline | Industry best practices |
| Maturity model recommendations | Baseline | Architecture consulting experience |
Totalt antall kilder: 9 unike Microsoft Learn URLer MCP calls: 6 (4 docs_search + 2 docs_fetch + 1 code_sample_search) Sist verifisert: 2026-02-03
Azure Managed Redis — Arkitektur (oppdatert 2026-04)
Azure Managed Redis (basert på Redis Enterprise) er anbefalt for AI-workloads vs. Azure Cache for Redis (community edition):
| Egenskap | Azure Cache for Redis | Azure Managed Redis |
|---|---|---|
| Threading | Single-threaded | Multi-threaded (Redis Enterprise) |
| Arkitektur | Primary + replica (2 nodes) | Multiple shards per node, distributed primaries |
| Performance | Begrenset av single thread | Nær-lineær skalering med vCPUs |
| Clustering | Valgfritt | Alltid aktivert (OSS, Enterprise, eller Non-Clustered policy) |
| Active geo-replication | Nei | Ja |
Cluster policies:
- OSS policy — anbefalt for de fleste. Klienten kobles direkte til shards, laveste latency, best throughput
- Enterprise policy — enkelt endpoint, bakoverkompatibelt, men enkelt-node proxy kan bli bottleneck. Påkrevd for RediSearch
- Non-Clustered — kun ≤25 GB, for migrering fra ikke-shardede miljøer
Flash Optimized tier: 20% RAM + 80% NVMe Flash. Optimal for read-heavy workloads med subset av hot keys.