ktg-plugin-marketplace/plugins/ms-ai-architect/skills/ms-ai-engineering/references/rag-architecture/rag-caching-optimization.md
Kjell Tore Guttormsen ff6a50d14f docs(architect): weekly KB update — 106 files refreshed (2026-04)
Updates across all 5 skills: ms-ai-advisor, ms-ai-engineering,
ms-ai-governance, ms-ai-security, ms-ai-infrastructure.

Key changes:
- Language Services (Custom Text Classification, Text Analytics, QnA):
  retirement warning 2029-03-31, migration guides to Foundry/GPT-4o
- Agentic Retrieval: 50M free reasoning tokens/month (Public Preview)
- Computer Use: Claude Sonnet 4.5 (preview) + OpenAI CUA models
- Agent Registry: Risks column (M365 E7), user-shared/org-published types
- Declarative agents: schema v1.5 → v1.6, Store validation requirements
- MLflow 3: 13 built-in LLM judges, production monitoring, Genie Code
- AG-UI HITL: ApprovalRequiredAIFunction (C#) + @tool(approval_mode) (Python)
- Entra ID Ignite 2025: Agent ID Admin/Developer RBAC roles, Conditional Access
- Security Copilot: 400 SCU/month per 1000 M365 E5 licenses, auto-provisioned
- Fast Transcription API: phrase lists, 14-language multi-lingual transcription
- Azure Monitor Workbooks: Bicep support, RBAC specifics
- Power Platform Copilot: data residency (Norway/Europe → EU DB, Bing → USA)
- RAG security-rbac: 4-approach table (GA + 3 preview access control methods)
- IaC MLOps: Well-Architected OE:05 principles, Bicep/Terraform patterns
- Translator: image file batch translation Preview (JPEG/PNG/BMP/WebP)

All 106 files: Last updated 2026-04 | Verified: MCP 2026-04

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 09:13:24 +02:00

22 KiB
Raw Blame History

RAG Caching and Performance Optimization

Last updated: 2026-04 | Verified: MCP 2026-04 Status: GA Category: RAG Architecture & Semantic Search


Introduksjon

Caching er en kritisk optimaliseringsstrategi for RAG-applikasjoner (Retrieval-Augmented Generation) som kan dramatisk redusere både latency og kostnader. I typiske RAG-scenarier er kall til LLM-modeller ofte den mest kostbare og tidkrevende operasjonen, spesielt når store mengder kontekstdata og chat history sendes med hver request. En godt designet caching-strategi kan redusere antall LLM-invocations med opptil 90% for high-traffic scenarier med repeterende eller semantisk like queries.

Multi-layer caching-tilnærmingen dekker flere nivåer i RAG-arkitekturen: result caching (hele LLM-responser), retrieval caching (knowledge fragments fra vektorsøk), embedding caching (forhåndsberegnede vektorrepresentasjoner), og semantic caching (semantisk like prompts). Hver av disse lagene adresserer ulike aspekter av ytelse og kostnadsoptimalisering.

Microsoft-stakken tilbyr flere tjenester optimalisert for AI-workloads: Azure Cache for Redis (traditional og semantic caching), Azure Cosmos DB (semantic cache med vektorsøk), Azure AI Search (built-in caching av search results), og Azure API Management (semantic caching for LLM APIs). Valget av løsning avhenger av cache-type, scale-requirements, og compliance-krav.

Kjernekomponenter

Multi-layer caching-strategi

Cache Layer Formål Typisk Hit Rate Latency Impact
Result caching Cache hele LLM-responser for identiske/semantisk like queries 30-60% (high-traffic) -80% til -95%
Retrieval caching Cache knowledge fragments fra vector search 40-70% -50% til -70%
Embedding caching Cache forhåndsberegnede embeddings 60-90% -30% til -50%
Model output caching Cache intermediate model outputs 20-40% -40% til -60%

Verified (Microsoft Learn - Application design for AI workloads)

Cache Key Components

Effektive cache keys må inkludere:

  • Tenant/User identity — For multi-tenant security
  • Policy context — RBAC og data access policies
  • Model version — Unngå stale responses ved model updates
  • Prompt version — Track prompt engineering changes
  • Context window — Chat history for contextual relevance

Verified (Microsoft Learn - Multi-layer caching strategies)

Time-to-Live (TTL) Policies

Data Type Anbefalt TTL Begrunnelse
Static content (dokumentasjon, policies) 24-72 timer Sjelden endring
Dynamic content (dashboard data) 5-30 minutter Moderate freshness-krav
User-specific queries 1-5 minutter Privacy og freshness
Search results 15-60 minutter Balanse mellom cost og freshness

Baseline (Industry best practices)

Cache Invalidation Triggers

  • Data updates — Webhook-triggered invalidation ved source data changes
  • Model changes — Invalidate ved model deployment/retraining
  • Prompt modifications — Clear cache ved prompt template changes
  • Manual purge — Admin-triggered for compliance eller testing

Verified (Microsoft Learn - Caching strategies)

Arkitekturmønstre

1. Semantic Caching (anbefalt for RAG)

Beskrivelse: Bruker vector similarity search på cached prompts for å returnere responses til semantisk like queries, selv om teksten ikke er identisk.

Hvordan det fungerer:

  1. Incoming prompt vektoriseres med embedding model
  2. Vector search kjøres mot cached prompt vectors
  3. Items med similarity score > threshold returneres
  4. Ved cache miss: LLM genererer response, som caches med vectorized prompt

Fordeler:

  • Høyere cache hit rate enn traditional key-value caching (30-60% vs 10-20%)
  • Håndterer variasjon i user input (synonyms, paraphrasing)
  • Reduserer LLM token consumption drastisk

Ulemper:

  • Krever embedding model (extra latency ~50-100ms)
  • Mer kompleks implementation
  • Krever vector-capable cache (Cosmos DB, Redis med RediSearch)

Context Window Requirement: Semantic cache MÅ operere innenfor context window. Uten chat history kan cache returnere contextually incorrect responses.

Eksempel: User spør "What is the largest lake in North America?" (cached: "Lake Superior"), deretter "What is the second largest?" Uten context window ville cache kunne returnere feil svar til en annen user som spør samme oppfølgingsspørsmål i en annen kontekst.

Verified (Microsoft Learn - Semantic cache introduction)

2. Multi-tier Result Caching

Beskrivelse: Kombinerer in-memory cache (Redis) med persistent cache (Cosmos DB) for optimal balance mellom speed og durability.

Arkitektur:

User Query → L1: Redis (in-memory, <5ms) → L2: Cosmos DB (persistent, <50ms) → LLM (fallback, >2s)

Fordeler:

  • Sub-5ms response time for hot data
  • Data durability ved cache failures
  • Cost-effective (Redis for hot, Cosmos for warm data)

Ulemper:

  • Økt complexity i cache management
  • Potential for stale data across tiers
  • Høyere infrastructure cost

Baseline (Common enterprise pattern)

3. Retrieval Snippet Caching

Beskrivelse: Cache frequently retrieved knowledge fragments fra Azure AI Search eller vector databases for å unngå repeated database queries.

Implementation:

  • Cache top-K search results per query pattern
  • Key: hash(query + filters + top-K)
  • TTL: 15-60 minutter (avhengig av data freshness-krav)

Fordeler:

  • Reduserer Azure AI Search query costs (50-70% reduction)
  • Lavere latency for grounding data retrieval
  • Mindre load på vector index

Ulemper:

  • Stale grounding data hvis source documents oppdateres
  • Cache size kan vokse raskt med mange unique queries

Verified (Microsoft Learn - Retrieval caching)

Beslutningsveiledning

Når bruke hvilken caching-strategi

Scenario Anbefalt Strategi Rationale
Chatbot med repeterende FAQs Semantic caching (Redis + RediSearch) Høy hit rate, semantisk matching
Document Q&A med mange unique queries Retrieval snippet caching Kostnad-effektiv, fokus på grounding data
Real-time dashboard med AI insights Multi-tier caching (Redis L1 + Cosmos L2) Speed + durability
Compliance-sensitive applikasjoner User-scoped semantic caching Privacy protection, audit trail

Baseline (Architecture decision framework)

Vanlige feil å unngå

Anti-pattern Problem Løsning
Caching user-private data globally Privacy violation, data leakage Scope cache keys by user/tenant identity
Ingen TTL policy Runaway cache growth, stale data Implement TTL basert på data sensitivity
For høy similarity threshold (>0.8) Lav cache hit rate Start med 0.15-0.3, tune basert på metrics
Caching uten context window Contextually incorrect responses Vectorize chat history + latest prompt
Ingen invalidation strategy Stale responses ved data updates Implement webhook-based invalidation

Verified (Microsoft Learn - Caching risks)

Røde flagg

  • Cache hit rate < 20% etter tuning → Revurder cache strategy
  • Cache size vokser >10GB/dag → Implementer aggressive TTL eller pruning
  • Latency øker etter caching → Sjekk embedding model overhead
  • Brukerklager på stale data → Reduser TTL eller implementer invalidation

Baseline (Performance monitoring thresholds)

Integrasjon med Microsoft-stakken

Azure Cache for Redis

Use Cases: Traditional result caching, high-throughput scenarios

Tiers:

  • Premium tier — 99.9% SLA, up to 120GB per shard
  • Enterprise tier — 99.99% SLA, active-active geo-replication, Flash storage support
  • Enterprise Flash tier — Up to 13TB cache size, 20% RAM + 80% NVMe Flash

Workloads suited for Flash tier:

  • Read-heavy (high read/write ratio)
  • Hot/cold access patterns (frequently accessed subset)
  • Large values (keys in RAM, values in Flash)

Not suited for Flash tier:

  • Write-heavy workloads
  • Uniform data access patterns
  • Long key names with small values

Configuration for AI workloads:

import redis
from azure.identity import DefaultAzureCredential
from redis_entraid.cred_provider import create_from_default_azure_credential

credential_provider = create_from_default_azure_credential(
    ("https://redis.azure.com/.default",),
)

r = redis.Redis(
    host="<redis-host>.redis.cache.windows.net",
    port=10000,
    ssl=True,
    decode_responses=True,
    credential_provider=credential_provider
)

# Set TTL på cached item
r.setex("cache_key", 3600, "cached_value")  # 1 hour TTL

Verified (Microsoft Learn - Azure Managed Redis architecture, code samples)

Azure API Management - Semantic Caching

Use Case: Semantic caching for LLM APIs (Azure OpenAI, Model Inference API)

Prerequisites:

  • Azure Managed Redis med RediSearch module enabled
  • Embeddings API deployment (for vectorization)
  • Chat Completion API deployment (for user requests)

Policy Configuration:

Inbound (cache lookup):

<azure-openai-semantic-cache-lookup
    score-threshold="0.15"
    embeddings-backend-id="embeddings-backend"
    embeddings-backend-auth="system-assigned"
    ignore-system-messages="true"
    max-message-count="10">
    <vary-by>@(context.Subscription.Id)</vary-by>
</azure-openai-semantic-cache-lookup>
<rate-limit calls="10" renewal-period="60" />

Outbound (cache store):

<azure-openai-semantic-cache-store duration="60" />

Score Threshold Tuning:

  • 0.1-0.2 → Liberal matching, høy hit rate, noe lavere relevance
  • 0.3-0.5 → Balanced, medium hit rate, god relevance
  • 0.6-0.8 → Strict matching, lav hit rate, høy relevance

Verified (Microsoft Learn - Enable semantic caching for LLM APIs)

Azure Cosmos DB for NoSQL

Use Case: Semantic cache med built-in vector search, persistent storage

Implementation Pattern:

from azure.cosmos import CosmosClient
from openai import AzureOpenAI

# Setup Cosmos DB vector store
cosmos_client = CosmosClient(url=cosmos_uri, credential=cosmos_key)
database = cosmos_client.get_database_client(cosmos_database_name)
container = database.get_container_client(cosmos_container_name)

# Query semantic cache
def query_cache(prompt_vector, similarity_threshold=0.15, top_k=5):
    query = f"""
    SELECT TOP {top_k} c.id, c.prompt, c.completion,
           VectorDistance(c.promptVector, @promptVector) AS similarity
    FROM c
    WHERE VectorDistance(c.promptVector, @promptVector) > @threshold
    ORDER BY VectorDistance(c.promptVector, @promptVector) DESC
    """

    items = list(container.query_items(
        query=query,
        parameters=[
            {"name": "@promptVector", "value": prompt_vector},
            {"name": "@threshold", "value": similarity_threshold}
        ]
    ))
    return items

Fordeler:

  • Globally distributed, multi-region writes
  • Automatic indexing av vectors
  • 99.999% SLA med multi-region setup
  • Built-in TTL support

Verified (Microsoft Learn - Semantic cache with Cosmos DB, code samples)

Azure AI Search - Built-in Caching

Automatic Caching Behavior: Azure AI Search cacher automatisk content etter første query for raskere subsequent searches.

Optimization Tips:

  • Reduser index size → raskere caching, mindre memory footprint
  • Selective field attribution → kun indexer nødvendige fields
  • Unngå over-attribution (filterable, sortable, facetable) → reduserer storage 4x

Performance Factors:

  • Smaller indexes → mer content i cache → lavere query latency
  • Higher tiers (S2, S3) → mer memory → større cache capacity
  • Partitions → parallel processing for slow queries

Verified (Microsoft Learn - Azure AI Search performance tips)

Offentlig sektor (Norge)

GDPR og Privacy

Cache Key Scoping (OBLIGATORISK):

  • Aldri cache user-private content uten proper scoping by user identity
  • Implementer tenant/user isolation i cache keys
  • Audit trail for cached persondata

Data Minimization:

  • Cache kun minimum nødvendig data for å svare på query
  • TTL på persondata skal ikke overstige formåls-begrensningen
  • Automatisk sletting ved user request (GDPR Article 17)

Eksempel - GDPR-compliant cache key:

cache_key = f"user:{user_id}:tenant:{tenant_id}:query_hash:{hash(prompt)}"
# TTL: 1 hour (minimal for chat session)

Baseline (GDPR compliance patterns)

Compliance-krav

Krav Implementation
Dataportabilitet (GDPR Art. 20) Export cached user data on request
Rett til sletting (GDPR Art. 17) Implement cache purge by user_id
Behandlingsgrunnlag Dokumenter legitimate interest for caching
Datatilsynet rapportering Audit log for cache access/invalidation

Baseline (Norwegian public sector compliance)

Sikkerhet

Encryption:

  • At rest: Azure Cache for Redis (Premium/Enterprise) — automatic encryption
  • In transit: TLS 1.2+ mandatory for all cache connections
  • Key management: Azure Key Vault for cache access keys

Access Control:

  • Microsoft Entra ID authentication for Redis (preview)
  • Role-based access control (RBAC) for cache management
  • Network isolation via Private Endpoints

Verified (Microsoft Learn - Redis security)

Kostnad og lisensiering

Azure Cache for Redis Pricing (Norway East - 2026)

Tier Size Kapasitet Månedskostnad (NOK) Best For
Basic C0 250 MB N/A (no SLA) ~400 Dev/Test
Standard C1 1 GB 2 replicas, 99.9% SLA ~1,200 Small production
Premium P1 6 GB Clustering, geo-replication ~7,000 Enterprise
Enterprise E10 12 GB Active-active, 99.99% SLA ~25,000 Mission-critical
Enterprise Flash F300 345 GB 20% RAM + 80% Flash ~60,000 Large-scale AI

Cost Optimization Tips:

  1. Start with Premium P1 for production RAG (best price/performance)
  2. Scale out vs scale up — Add replicas før du går til høyere tier
  3. Use Flash tier for large caches (>100GB) — 5x lavere cost per GB vs Enterprise
  4. Monitor cache hit rate — <20% hit rate betyr ineffektiv caching strategy
  5. Implement TTL aggressively — Reduser cache size, lavere tier

Verified (Microsoft Learn - Plan and manage costs)

Azure Cosmos DB Pricing

Request Units (RU/s) for Semantic Cache:

  • Vector query (1KB): ~10-50 RU
  • Write (cache store): ~5-10 RU
  • Storage: ~2.5 NOK/GB/måned

Cost Example (10,000 queries/day):

  • 10,000 queries × 30 RU avg = 300,000 RU/day = 3.5 RU/s avg
  • Provisioned: 100 RU/s (for burst) = ~600 NOK/måned
  • Storage (10GB cache): ~25 NOK/måned
  • Total: ~625 NOK/måned

Baseline (Cosmos DB pricing calculator estimates)

TCO Sammenligning

Scenario Without Caching With Semantic Caching (Redis) Savings
100K LLM queries/day (GPT-4) ~450,000 NOK/måned ~150,000 NOK/måned + 7,000 (Redis) 65%
10K queries/day (GPT-3.5) ~45,000 NOK/måned ~15,000 NOK/måned + 7,000 (Redis) 51%

Assumptions: 50% cache hit rate, avg 2000 tokens/query

Baseline (TCO analysis based on Azure pricing)

For arkitekten (Cosmo)

Spørsmål å stille kunden

  1. Traffic pattern: Hvor mange LLM queries per dag/time forventer dere? Hva er peak vs avg load?
  2. Query similarity: Er det mange repeterende eller semantisk like spørsmål? (Indikerer semantic cache ROI)
  3. Data freshness: Hvor ofte endres underlying data? Hva er akseptabelt staleness-vindu?
  4. Privacy requirements: Håndterer dere persondata? Trengs user-scoped caching?
  5. Compliance: Hvilke regulatory frameworks gjelder (GDPR, Schrems II, Datatilsynet)?
  6. Budget: Hva er totalt budsjett for LLM + caching infrastructure?
  7. Latency SLA: Hva er maks akseptabel response time (p50, p95, p99)?
  8. Global reach: Trengs multi-region caching for latency eller compliance?

Fallgruver å unngå

Fallgruve Impact Mitigering
Caching uten context window Contextually incorrect responses → user frustration Vectorize chat history + prompt
Global caching av persondata GDPR violation, potential bøter User-scoped keys, TTL enforcement
For høy similarity threshold Lav hit rate, caching ineffective Start lavt (0.15), tune opp
Ingen invalidation strategy Stale data → incorrect LLM responses Webhook-based invalidation
Undersized cache tier High eviction rate, lav hit rate Monitor evictions, scale proaktivt
Ignoring embedding overhead Latency increase vs direct LLM call Batch embeddings, use async patterns

Anbefalinger per modenhetsnivå

Level 1 - Pilot (0-6 måneder RAG erfaring):

  • Start med Azure API Management semantic caching (managed, low-complexity)
  • Use case: FAQ chatbot med <1000 queries/dag
  • Tier: Standard Redis (C1) for læring, lav cost
  • Monitoring: Basic hit rate metrics i APIM

Level 2 - Production (6-18 måneder):

  • Implementer multi-layer caching (Redis L1 + Cosmos DB L2)
  • Use case: Customer support RAG med 10K-100K queries/dag
  • Tier: Premium Redis (P1) + Cosmos DB autoscale
  • Monitoring: Application Insights med custom metrics (hit rate, latency, cost per query)

Level 3 - Enterprise (18+ måneder):

  • Hybrid semantic + retrieval caching med advanced invalidation
  • Use case: Multi-tenant SaaS RAG platform, 100K+ queries/dag
  • Tier: Enterprise Redis (E10) + global Cosmos DB
  • Monitoring: Full observability stack (Grafana, custom dashboards, alerting)

Baseline (Maturity model for AI implementations)

Kilder og verifisering

Microsoft Learn Documentation

  1. Application design for AI workloads on Azure - Multi-layer caching strategies https://learn.microsoft.com/en-us/azure/well-architected/ai/application-design#implement-multi-layer-caching-strategies Confidence: Verified (2026-02)

  2. Introduction to semantic cache - Semantic caching concepts, context window requirements https://learn.microsoft.com/en-us/azure/cosmos-db/gen-ai/semantic-cache Confidence: Verified (2026-02)

  3. Enable semantic caching for LLM APIs in Azure API Management - APIM semantic cache implementation https://learn.microsoft.com/en-us/azure/api-management/azure-openai-enable-semantic-caching Confidence: Verified (2026-02)

  4. Tips for better performance in Azure AI Search - Index caching, performance optimization https://learn.microsoft.com/en-us/azure/search/search-performance-tips Confidence: Verified (2026-02)

  5. Azure Managed Redis architecture - Flash tier workloads, caching strategies https://learn.microsoft.com/en-us/azure/redis/architecture#flash-optimized-tier Confidence: Verified (2026-02)

  6. Plan and manage costs of an Azure AI Search service - Cost optimization, enrichment caching https://learn.microsoft.com/en-us/azure/search/search-sku-manage-costs#minimize-costs Confidence: Verified (2026-02)

  7. Data platform considerations for mission-critical workloads - Azure Cache for Redis enterprise patterns https://learn.microsoft.com/en-us/azure/well-architected/mission-critical/mission-critical-data-platform#caching-for-hot-tier-data Confidence: Verified (2026-02)

Code Samples

  1. RAG implementation with Azure AI Search - Python RAG cache patterns https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview#content-retrieval-in-azure-ai-search Confidence: Verified (code sample)

  2. Azure Cache for Redis with Python - Redis connection and caching code https://learn.microsoft.com/en-us/azure/redis/python-get-started#code-to-connect-to-a-redis-cache Confidence: Verified (code sample)

Confidence Levels per Section

Seksjon Confidence Source
Multi-layer caching strategy Verified Microsoft Learn docs (1)
Semantic caching pattern Verified Microsoft Learn docs (2, 3)
Azure Cache for Redis configuration Verified Microsoft Learn docs (5, 7), code samples (9)
Azure API Management policies Verified Microsoft Learn docs (3)
Azure AI Search caching Verified Microsoft Learn docs (4, 6)
Cost estimates Baseline Azure pricing calculator (2026-02)
GDPR compliance patterns Baseline Industry best practices
Maturity model recommendations Baseline Architecture consulting experience

Totalt antall kilder: 9 unike Microsoft Learn URLer MCP calls: 6 (4 docs_search + 2 docs_fetch + 1 code_sample_search) Sist verifisert: 2026-02-03

Azure Managed Redis — Arkitektur (oppdatert 2026-04)

Azure Managed Redis (basert på Redis Enterprise) er anbefalt for AI-workloads vs. Azure Cache for Redis (community edition):

Egenskap Azure Cache for Redis Azure Managed Redis
Threading Single-threaded Multi-threaded (Redis Enterprise)
Arkitektur Primary + replica (2 nodes) Multiple shards per node, distributed primaries
Performance Begrenset av single thread Nær-lineær skalering med vCPUs
Clustering Valgfritt Alltid aktivert (OSS, Enterprise, eller Non-Clustered policy)
Active geo-replication Nei Ja

Cluster policies:

  • OSS policy — anbefalt for de fleste. Klienten kobles direkte til shards, laveste latency, best throughput
  • Enterprise policy — enkelt endpoint, bakoverkompatibelt, men enkelt-node proxy kan bli bottleneck. Påkrevd for RediSearch
  • Non-Clustered — kun ≤25 GB, for migrering fra ikke-shardede miljøer

Flash Optimized tier: 20% RAM + 80% NVMe Flash. Optimal for read-heavy workloads med subset av hot keys.