# RAG Caching and Performance Optimization **Last updated:** 2026-04 | Verified: MCP 2026-04 **Status:** GA **Category:** RAG Architecture & Semantic Search --- ## Introduksjon Caching er en kritisk optimaliseringsstrategi for RAG-applikasjoner (Retrieval-Augmented Generation) som kan dramatisk redusere både latency og kostnader. I typiske RAG-scenarier er kall til LLM-modeller ofte den mest kostbare og tidkrevende operasjonen, spesielt når store mengder kontekstdata og chat history sendes med hver request. En godt designet caching-strategi kan redusere antall LLM-invocations med opptil 90% for high-traffic scenarier med repeterende eller semantisk like queries. Multi-layer caching-tilnærmingen dekker flere nivåer i RAG-arkitekturen: result caching (hele LLM-responser), retrieval caching (knowledge fragments fra vektorsøk), embedding caching (forhåndsberegnede vektorrepresentasjoner), og semantic caching (semantisk like prompts). Hver av disse lagene adresserer ulike aspekter av ytelse og kostnadsoptimalisering. Microsoft-stakken tilbyr flere tjenester optimalisert for AI-workloads: Azure Cache for Redis (traditional og semantic caching), Azure Cosmos DB (semantic cache med vektorsøk), Azure AI Search (built-in caching av search results), og Azure API Management (semantic caching for LLM APIs). Valget av løsning avhenger av cache-type, scale-requirements, og compliance-krav. ## Kjernekomponenter ### Multi-layer caching-strategi | Cache Layer | Formål | Typisk Hit Rate | Latency Impact | |------------|--------|-----------------|----------------| | **Result caching** | Cache hele LLM-responser for identiske/semantisk like queries | 30-60% (high-traffic) | -80% til -95% | | **Retrieval caching** | Cache knowledge fragments fra vector search | 40-70% | -50% til -70% | | **Embedding caching** | Cache forhåndsberegnede embeddings | 60-90% | -30% til -50% | | **Model output caching** | Cache intermediate model outputs | 20-40% | -40% til -60% | **Verified** (Microsoft Learn - Application design for AI workloads) ### Cache Key Components Effektive cache keys må inkludere: - **Tenant/User identity** — For multi-tenant security - **Policy context** — RBAC og data access policies - **Model version** — Unngå stale responses ved model updates - **Prompt version** — Track prompt engineering changes - **Context window** — Chat history for contextual relevance **Verified** (Microsoft Learn - Multi-layer caching strategies) ### Time-to-Live (TTL) Policies | Data Type | Anbefalt TTL | Begrunnelse | |-----------|--------------|-------------| | Static content (dokumentasjon, policies) | 24-72 timer | Sjelden endring | | Dynamic content (dashboard data) | 5-30 minutter | Moderate freshness-krav | | User-specific queries | 1-5 minutter | Privacy og freshness | | Search results | 15-60 minutter | Balanse mellom cost og freshness | **Baseline** (Industry best practices) ### Cache Invalidation Triggers - **Data updates** — Webhook-triggered invalidation ved source data changes - **Model changes** — Invalidate ved model deployment/retraining - **Prompt modifications** — Clear cache ved prompt template changes - **Manual purge** — Admin-triggered for compliance eller testing **Verified** (Microsoft Learn - Caching strategies) ## Arkitekturmønstre ### 1. Semantic Caching (anbefalt for RAG) **Beskrivelse:** Bruker vector similarity search på cached prompts for å returnere responses til semantisk like queries, selv om teksten ikke er identisk. **Hvordan det fungerer:** 1. Incoming prompt vektoriseres med embedding model 2. Vector search kjøres mot cached prompt vectors 3. Items med similarity score > threshold returneres 4. Ved cache miss: LLM genererer response, som caches med vectorized prompt **Fordeler:** - Høyere cache hit rate enn traditional key-value caching (30-60% vs 10-20%) - Håndterer variasjon i user input (synonyms, paraphrasing) - Reduserer LLM token consumption drastisk **Ulemper:** - Krever embedding model (extra latency ~50-100ms) - Mer kompleks implementation - Krever vector-capable cache (Cosmos DB, Redis med RediSearch) **Context Window Requirement:** Semantic cache MÅ operere innenfor context window. Uten chat history kan cache returnere contextually incorrect responses. **Eksempel:** User spør "What is the largest lake in North America?" (cached: "Lake Superior"), deretter "What is the second largest?" Uten context window ville cache kunne returnere feil svar til en annen user som spør samme oppfølgingsspørsmål i en annen kontekst. **Verified** (Microsoft Learn - Semantic cache introduction) ### 2. Multi-tier Result Caching **Beskrivelse:** Kombinerer in-memory cache (Redis) med persistent cache (Cosmos DB) for optimal balance mellom speed og durability. **Arkitektur:** ``` User Query → L1: Redis (in-memory, <5ms) → L2: Cosmos DB (persistent, <50ms) → LLM (fallback, >2s) ``` **Fordeler:** - Sub-5ms response time for hot data - Data durability ved cache failures - Cost-effective (Redis for hot, Cosmos for warm data) **Ulemper:** - Økt complexity i cache management - Potential for stale data across tiers - Høyere infrastructure cost **Baseline** (Common enterprise pattern) ### 3. Retrieval Snippet Caching **Beskrivelse:** Cache frequently retrieved knowledge fragments fra Azure AI Search eller vector databases for å unngå repeated database queries. **Implementation:** - Cache top-K search results per query pattern - Key: hash(query + filters + top-K) - TTL: 15-60 minutter (avhengig av data freshness-krav) **Fordeler:** - Reduserer Azure AI Search query costs (50-70% reduction) - Lavere latency for grounding data retrieval - Mindre load på vector index **Ulemper:** - Stale grounding data hvis source documents oppdateres - Cache size kan vokse raskt med mange unique queries **Verified** (Microsoft Learn - Retrieval caching) ## Beslutningsveiledning ### Når bruke hvilken caching-strategi | Scenario | Anbefalt Strategi | Rationale | |----------|-------------------|-----------| | Chatbot med repeterende FAQs | Semantic caching (Redis + RediSearch) | Høy hit rate, semantisk matching | | Document Q&A med mange unique queries | Retrieval snippet caching | Kostnad-effektiv, fokus på grounding data | | Real-time dashboard med AI insights | Multi-tier caching (Redis L1 + Cosmos L2) | Speed + durability | | Compliance-sensitive applikasjoner | User-scoped semantic caching | Privacy protection, audit trail | **Baseline** (Architecture decision framework) ### Vanlige feil å unngå | Anti-pattern | Problem | Løsning | |-------------|---------|---------| | **Caching user-private data globally** | Privacy violation, data leakage | Scope cache keys by user/tenant identity | | **Ingen TTL policy** | Runaway cache growth, stale data | Implement TTL basert på data sensitivity | | **For høy similarity threshold (>0.8)** | Lav cache hit rate | Start med 0.15-0.3, tune basert på metrics | | **Caching uten context window** | Contextually incorrect responses | Vectorize chat history + latest prompt | | **Ingen invalidation strategy** | Stale responses ved data updates | Implement webhook-based invalidation | **Verified** (Microsoft Learn - Caching risks) ### Røde flagg - Cache hit rate < 20% etter tuning → Revurder cache strategy - Cache size vokser >10GB/dag → Implementer aggressive TTL eller pruning - Latency øker etter caching → Sjekk embedding model overhead - Brukerklager på stale data → Reduser TTL eller implementer invalidation **Baseline** (Performance monitoring thresholds) ## Integrasjon med Microsoft-stakken ### Azure Cache for Redis **Use Cases:** Traditional result caching, high-throughput scenarios **Tiers:** - **Premium tier** — 99.9% SLA, up to 120GB per shard - **Enterprise tier** — 99.99% SLA, active-active geo-replication, Flash storage support - **Enterprise Flash tier** — Up to 13TB cache size, 20% RAM + 80% NVMe Flash **Workloads suited for Flash tier:** - Read-heavy (high read/write ratio) - Hot/cold access patterns (frequently accessed subset) - Large values (keys in RAM, values in Flash) **Not suited for Flash tier:** - Write-heavy workloads - Uniform data access patterns - Long key names with small values **Configuration for AI workloads:** ```python import redis from azure.identity import DefaultAzureCredential from redis_entraid.cred_provider import create_from_default_azure_credential credential_provider = create_from_default_azure_credential( ("https://redis.azure.com/.default",), ) r = redis.Redis( host=".redis.cache.windows.net", port=10000, ssl=True, decode_responses=True, credential_provider=credential_provider ) # Set TTL på cached item r.setex("cache_key", 3600, "cached_value") # 1 hour TTL ``` **Verified** (Microsoft Learn - Azure Managed Redis architecture, code samples) ### Azure API Management - Semantic Caching **Use Case:** Semantic caching for LLM APIs (Azure OpenAI, Model Inference API) **Prerequisites:** - Azure Managed Redis med **RediSearch module** enabled - Embeddings API deployment (for vectorization) - Chat Completion API deployment (for user requests) **Policy Configuration:** Inbound (cache lookup): ```xml @(context.Subscription.Id) ``` Outbound (cache store): ```xml ``` **Score Threshold Tuning:** - 0.1-0.2 → Liberal matching, høy hit rate, noe lavere relevance - 0.3-0.5 → Balanced, medium hit rate, god relevance - 0.6-0.8 → Strict matching, lav hit rate, høy relevance **Verified** (Microsoft Learn - Enable semantic caching for LLM APIs) ### Azure Cosmos DB for NoSQL **Use Case:** Semantic cache med built-in vector search, persistent storage **Implementation Pattern:** ```python from azure.cosmos import CosmosClient from openai import AzureOpenAI # Setup Cosmos DB vector store cosmos_client = CosmosClient(url=cosmos_uri, credential=cosmos_key) database = cosmos_client.get_database_client(cosmos_database_name) container = database.get_container_client(cosmos_container_name) # Query semantic cache def query_cache(prompt_vector, similarity_threshold=0.15, top_k=5): query = f""" SELECT TOP {top_k} c.id, c.prompt, c.completion, VectorDistance(c.promptVector, @promptVector) AS similarity FROM c WHERE VectorDistance(c.promptVector, @promptVector) > @threshold ORDER BY VectorDistance(c.promptVector, @promptVector) DESC """ items = list(container.query_items( query=query, parameters=[ {"name": "@promptVector", "value": prompt_vector}, {"name": "@threshold", "value": similarity_threshold} ] )) return items ``` **Fordeler:** - Globally distributed, multi-region writes - Automatic indexing av vectors - 99.999% SLA med multi-region setup - Built-in TTL support **Verified** (Microsoft Learn - Semantic cache with Cosmos DB, code samples) ### Azure AI Search - Built-in Caching **Automatic Caching Behavior:** Azure AI Search cacher automatisk content etter første query for raskere subsequent searches. **Optimization Tips:** - Reduser index size → raskere caching, mindre memory footprint - Selective field attribution → kun indexer nødvendige fields - Unngå over-attribution (filterable, sortable, facetable) → reduserer storage 4x **Performance Factors:** - Smaller indexes → mer content i cache → lavere query latency - Higher tiers (S2, S3) → mer memory → større cache capacity - Partitions → parallel processing for slow queries **Verified** (Microsoft Learn - Azure AI Search performance tips) ## Offentlig sektor (Norge) ### GDPR og Privacy **Cache Key Scoping (OBLIGATORISK):** - Aldri cache user-private content uten proper scoping by user identity - Implementer tenant/user isolation i cache keys - Audit trail for cached persondata **Data Minimization:** - Cache kun minimum nødvendig data for å svare på query - TTL på persondata skal ikke overstige formåls-begrensningen - Automatisk sletting ved user request (GDPR Article 17) **Eksempel - GDPR-compliant cache key:** ```python cache_key = f"user:{user_id}:tenant:{tenant_id}:query_hash:{hash(prompt)}" # TTL: 1 hour (minimal for chat session) ``` **Baseline** (GDPR compliance patterns) ### Compliance-krav | Krav | Implementation | |------|----------------| | **Dataportabilitet (GDPR Art. 20)** | Export cached user data on request | | **Rett til sletting (GDPR Art. 17)** | Implement cache purge by user_id | | **Behandlingsgrunnlag** | Dokumenter legitimate interest for caching | | **Datatilsynet rapportering** | Audit log for cache access/invalidation | **Baseline** (Norwegian public sector compliance) ### Sikkerhet **Encryption:** - **At rest:** Azure Cache for Redis (Premium/Enterprise) — automatic encryption - **In transit:** TLS 1.2+ mandatory for all cache connections - **Key management:** Azure Key Vault for cache access keys **Access Control:** - Microsoft Entra ID authentication for Redis (preview) - Role-based access control (RBAC) for cache management - Network isolation via Private Endpoints **Verified** (Microsoft Learn - Redis security) ## Kostnad og lisensiering ### Azure Cache for Redis Pricing (Norway East - 2026) | Tier | Size | Kapasitet | Månedskostnad (NOK) | Best For | |------|------|-----------|---------------------|----------| | Basic C0 | 250 MB | N/A (no SLA) | ~400 | Dev/Test | | Standard C1 | 1 GB | 2 replicas, 99.9% SLA | ~1,200 | Small production | | Premium P1 | 6 GB | Clustering, geo-replication | ~7,000 | Enterprise | | Enterprise E10 | 12 GB | Active-active, 99.99% SLA | ~25,000 | Mission-critical | | Enterprise Flash F300 | 345 GB | 20% RAM + 80% Flash | ~60,000 | Large-scale AI | **Cost Optimization Tips:** 1. **Start with Premium P1** for production RAG (best price/performance) 2. **Scale out vs scale up** — Add replicas før du går til høyere tier 3. **Use Flash tier for large caches** (>100GB) — 5x lavere cost per GB vs Enterprise 4. **Monitor cache hit rate** — <20% hit rate betyr ineffektiv caching strategy 5. **Implement TTL aggressively** — Reduser cache size, lavere tier **Verified** (Microsoft Learn - Plan and manage costs) ### Azure Cosmos DB Pricing **Request Units (RU/s) for Semantic Cache:** - Vector query (1KB): ~10-50 RU - Write (cache store): ~5-10 RU - Storage: ~2.5 NOK/GB/måned **Cost Example (10,000 queries/day):** - 10,000 queries × 30 RU avg = 300,000 RU/day = 3.5 RU/s avg - Provisioned: 100 RU/s (for burst) = ~600 NOK/måned - Storage (10GB cache): ~25 NOK/måned - **Total: ~625 NOK/måned** **Baseline** (Cosmos DB pricing calculator estimates) ### TCO Sammenligning | Scenario | Without Caching | With Semantic Caching (Redis) | Savings | |----------|-----------------|-------------------------------|---------| | 100K LLM queries/day (GPT-4) | ~450,000 NOK/måned | ~150,000 NOK/måned + 7,000 (Redis) | 65% | | 10K queries/day (GPT-3.5) | ~45,000 NOK/måned | ~15,000 NOK/måned + 7,000 (Redis) | 51% | **Assumptions:** 50% cache hit rate, avg 2000 tokens/query **Baseline** (TCO analysis based on Azure pricing) ## For arkitekten (Cosmo) ### Spørsmål å stille kunden 1. **Traffic pattern:** Hvor mange LLM queries per dag/time forventer dere? Hva er peak vs avg load? 2. **Query similarity:** Er det mange repeterende eller semantisk like spørsmål? (Indikerer semantic cache ROI) 3. **Data freshness:** Hvor ofte endres underlying data? Hva er akseptabelt staleness-vindu? 4. **Privacy requirements:** Håndterer dere persondata? Trengs user-scoped caching? 5. **Compliance:** Hvilke regulatory frameworks gjelder (GDPR, Schrems II, Datatilsynet)? 6. **Budget:** Hva er totalt budsjett for LLM + caching infrastructure? 7. **Latency SLA:** Hva er maks akseptabel response time (p50, p95, p99)? 8. **Global reach:** Trengs multi-region caching for latency eller compliance? ### Fallgruver å unngå | Fallgruve | Impact | Mitigering | |-----------|--------|------------| | **Caching uten context window** | Contextually incorrect responses → user frustration | Vectorize chat history + prompt | | **Global caching av persondata** | GDPR violation, potential bøter | User-scoped keys, TTL enforcement | | **For høy similarity threshold** | Lav hit rate, caching ineffective | Start lavt (0.15), tune opp | | **Ingen invalidation strategy** | Stale data → incorrect LLM responses | Webhook-based invalidation | | **Undersized cache tier** | High eviction rate, lav hit rate | Monitor evictions, scale proaktivt | | **Ignoring embedding overhead** | Latency increase vs direct LLM call | Batch embeddings, use async patterns | ### Anbefalinger per modenhetsnivå **Level 1 - Pilot (0-6 måneder RAG erfaring):** - Start med **Azure API Management semantic caching** (managed, low-complexity) - Use case: FAQ chatbot med <1000 queries/dag - Tier: Standard Redis (C1) for læring, lav cost - Monitoring: Basic hit rate metrics i APIM **Level 2 - Production (6-18 måneder):** - Implementer **multi-layer caching** (Redis L1 + Cosmos DB L2) - Use case: Customer support RAG med 10K-100K queries/dag - Tier: Premium Redis (P1) + Cosmos DB autoscale - Monitoring: Application Insights med custom metrics (hit rate, latency, cost per query) **Level 3 - Enterprise (18+ måneder):** - **Hybrid semantic + retrieval caching** med advanced invalidation - Use case: Multi-tenant SaaS RAG platform, 100K+ queries/dag - Tier: Enterprise Redis (E10) + global Cosmos DB - Monitoring: Full observability stack (Grafana, custom dashboards, alerting) **Baseline** (Maturity model for AI implementations) ## Kilder og verifisering ### Microsoft Learn Documentation 1. **Application design for AI workloads on Azure** - Multi-layer caching strategies https://learn.microsoft.com/en-us/azure/well-architected/ai/application-design#implement-multi-layer-caching-strategies *Confidence: Verified (2026-02)* 2. **Introduction to semantic cache** - Semantic caching concepts, context window requirements https://learn.microsoft.com/en-us/azure/cosmos-db/gen-ai/semantic-cache *Confidence: Verified (2026-02)* 3. **Enable semantic caching for LLM APIs in Azure API Management** - APIM semantic cache implementation https://learn.microsoft.com/en-us/azure/api-management/azure-openai-enable-semantic-caching *Confidence: Verified (2026-02)* 4. **Tips for better performance in Azure AI Search** - Index caching, performance optimization https://learn.microsoft.com/en-us/azure/search/search-performance-tips *Confidence: Verified (2026-02)* 5. **Azure Managed Redis architecture** - Flash tier workloads, caching strategies https://learn.microsoft.com/en-us/azure/redis/architecture#flash-optimized-tier *Confidence: Verified (2026-02)* 6. **Plan and manage costs of an Azure AI Search service** - Cost optimization, enrichment caching https://learn.microsoft.com/en-us/azure/search/search-sku-manage-costs#minimize-costs *Confidence: Verified (2026-02)* 7. **Data platform considerations for mission-critical workloads** - Azure Cache for Redis enterprise patterns https://learn.microsoft.com/en-us/azure/well-architected/mission-critical/mission-critical-data-platform#caching-for-hot-tier-data *Confidence: Verified (2026-02)* ### Code Samples 8. **RAG implementation with Azure AI Search** - Python RAG cache patterns https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview#content-retrieval-in-azure-ai-search *Confidence: Verified (code sample)* 9. **Azure Cache for Redis with Python** - Redis connection and caching code https://learn.microsoft.com/en-us/azure/redis/python-get-started#code-to-connect-to-a-redis-cache *Confidence: Verified (code sample)* ### Confidence Levels per Section | Seksjon | Confidence | Source | |---------|-----------|--------| | Multi-layer caching strategy | **Verified** | Microsoft Learn docs (1) | | Semantic caching pattern | **Verified** | Microsoft Learn docs (2, 3) | | Azure Cache for Redis configuration | **Verified** | Microsoft Learn docs (5, 7), code samples (9) | | Azure API Management policies | **Verified** | Microsoft Learn docs (3) | | Azure AI Search caching | **Verified** | Microsoft Learn docs (4, 6) | | Cost estimates | **Baseline** | Azure pricing calculator (2026-02) | | GDPR compliance patterns | **Baseline** | Industry best practices | | Maturity model recommendations | **Baseline** | Architecture consulting experience | --- **Totalt antall kilder:** 9 unike Microsoft Learn URLer **MCP calls:** 6 (4 docs_search + 2 docs_fetch + 1 code_sample_search) **Sist verifisert:** 2026-02-03 ### Azure Managed Redis — Arkitektur (oppdatert 2026-04) Azure Managed Redis (basert på Redis Enterprise) er anbefalt for AI-workloads vs. Azure Cache for Redis (community edition): | Egenskap | Azure Cache for Redis | Azure Managed Redis | |---------|----------------------|---------------------| | Threading | Single-threaded | Multi-threaded (Redis Enterprise) | | Arkitektur | Primary + replica (2 nodes) | Multiple shards per node, distributed primaries | | Performance | Begrenset av single thread | Nær-lineær skalering med vCPUs | | Clustering | Valgfritt | Alltid aktivert (OSS, Enterprise, eller Non-Clustered policy) | | Active geo-replication | Nei | Ja | **Cluster policies:** - **OSS policy** — anbefalt for de fleste. Klienten kobles direkte til shards, laveste latency, best throughput - **Enterprise policy** — enkelt endpoint, bakoverkompatibelt, men enkelt-node proxy kan bli bottleneck. Påkrevd for RediSearch - **Non-Clustered** — kun ≤25 GB, for migrering fra ikke-shardede miljøer **Flash Optimized tier:** 20% RAM + 80% NVMe Flash. Optimal for read-heavy workloads med subset av hot keys.