# RAG Caching and Performance Optimization

**Last updated:** 2026-04 | Verified: MCP 2026-04
**Status:** GA
**Category:** RAG Architecture & Semantic Search

---

## Introduksjon

Caching er en kritisk optimaliseringsstrategi for RAG-applikasjoner (Retrieval-Augmented Generation) som kan dramatisk redusere både latency og kostnader. I typiske RAG-scenarier er kall til LLM-modeller ofte den mest kostbare og tidkrevende operasjonen, spesielt når store mengder kontekstdata og chat history sendes med hver request. En godt designet caching-strategi kan redusere antall LLM-invocations med opptil 90% for high-traffic scenarier med repeterende eller semantisk like queries.

Multi-layer caching-tilnærmingen dekker flere nivåer i RAG-arkitekturen: result caching (hele LLM-responser), retrieval caching (knowledge fragments fra vektorsøk), embedding caching (forhåndsberegnede vektorrepresentasjoner), og semantic caching (semantisk like prompts). Hver av disse lagene adresserer ulike aspekter av ytelse og kostnadsoptimalisering.

Microsoft-stakken tilbyr flere tjenester optimalisert for AI-workloads: Azure Cache for Redis (traditional og semantic caching), Azure Cosmos DB (semantic cache med vektorsøk), Azure AI Search (built-in caching av search results), og Azure API Management (semantic caching for LLM APIs). Valget av løsning avhenger av cache-type, scale-requirements, og compliance-krav.

## Kjernekomponenter

### Multi-layer caching-strategi

| Cache Layer | Formål | Typisk Hit Rate | Latency Impact |
|------------|--------|-----------------|----------------|
| **Result caching** | Cache hele LLM-responser for identiske/semantisk like queries | 30-60% (high-traffic) | -80% til -95% |
| **Retrieval caching** | Cache knowledge fragments fra vector search | 40-70% | -50% til -70% |
| **Embedding caching** | Cache forhåndsberegnede embeddings | 60-90% | -30% til -50% |
| **Model output caching** | Cache intermediate model outputs | 20-40% | -40% til -60% |

**Verified** (Microsoft Learn - Application design for AI workloads)

### Cache Key Components

Effektive cache keys må inkludere:
- **Tenant/User identity** — For multi-tenant security
- **Policy context** — RBAC og data access policies
- **Model version** — Unngå stale responses ved model updates
- **Prompt version** — Track prompt engineering changes
- **Context window** — Chat history for contextual relevance

**Verified** (Microsoft Learn - Multi-layer caching strategies)

### Time-to-Live (TTL) Policies

| Data Type | Anbefalt TTL | Begrunnelse |
|-----------|--------------|-------------|
| Static content (dokumentasjon, policies) | 24-72 timer | Sjelden endring |
| Dynamic content (dashboard data) | 5-30 minutter | Moderate freshness-krav |
| User-specific queries | 1-5 minutter | Privacy og freshness |
| Search results | 15-60 minutter | Balanse mellom cost og freshness |

**Baseline** (Industry best practices)

### Cache Invalidation Triggers

- **Data updates** — Webhook-triggered invalidation ved source data changes
- **Model changes** — Invalidate ved model deployment/retraining
- **Prompt modifications** — Clear cache ved prompt template changes
- **Manual purge** — Admin-triggered for compliance eller testing

**Verified** (Microsoft Learn - Caching strategies)

## Arkitekturmønstre

### 1. Semantic Caching (anbefalt for RAG)

**Beskrivelse:** Bruker vector similarity search på cached prompts for å returnere responses til semantisk like queries, selv om teksten ikke er identisk.

**Hvordan det fungerer:**
1. Incoming prompt vektoriseres med embedding model
2. Vector search kjøres mot cached prompt vectors
3. Items med similarity score > threshold returneres
4. Ved cache miss: LLM genererer response, som caches med vectorized prompt

**Fordeler:**
- Høyere cache hit rate enn traditional key-value caching (30-60% vs 10-20%)
- Håndterer variasjon i user input (synonyms, paraphrasing)
- Reduserer LLM token consumption drastisk

**Ulemper:**
- Krever embedding model (extra latency ~50-100ms)
- Mer kompleks implementation
- Krever vector-capable cache (Cosmos DB, Redis med RediSearch)

**Context Window Requirement:**
Semantic cache MÅ operere innenfor context window. Uten chat history kan cache returnere contextually incorrect responses.

**Eksempel:** User spør "What is the largest lake in North America?" (cached: "Lake Superior"), deretter "What is the second largest?" Uten context window ville cache kunne returnere feil svar til en annen user som spør samme oppfølgingsspørsmål i en annen kontekst.

**Verified** (Microsoft Learn - Semantic cache introduction)

### 2. Multi-tier Result Caching

**Beskrivelse:** Kombinerer in-memory cache (Redis) med persistent cache (Cosmos DB) for optimal balance mellom speed og durability.

**Arkitektur:**
```
User Query → L1: Redis (in-memory, <5ms) → L2: Cosmos DB (persistent, <50ms) → LLM (fallback, >2s)
```

**Fordeler:**
- Sub-5ms response time for hot data
- Data durability ved cache failures
- Cost-effective (Redis for hot, Cosmos for warm data)

**Ulemper:**
- Økt complexity i cache management
- Potential for stale data across tiers
- Høyere infrastructure cost

**Baseline** (Common enterprise pattern)

### 3. Retrieval Snippet Caching

**Beskrivelse:** Cache frequently retrieved knowledge fragments fra Azure AI Search eller vector databases for å unngå repeated database queries.

**Implementation:**
- Cache top-K search results per query pattern
- Key: hash(query + filters + top-K)
- TTL: 15-60 minutter (avhengig av data freshness-krav)

**Fordeler:**
- Reduserer Azure AI Search query costs (50-70% reduction)
- Lavere latency for grounding data retrieval
- Mindre load på vector index

**Ulemper:**
- Stale grounding data hvis source documents oppdateres
- Cache size kan vokse raskt med mange unique queries

**Verified** (Microsoft Learn - Retrieval caching)

## Beslutningsveiledning

### Når bruke hvilken caching-strategi

| Scenario | Anbefalt Strategi | Rationale |
|----------|-------------------|-----------|
| Chatbot med repeterende FAQs | Semantic caching (Redis + RediSearch) | Høy hit rate, semantisk matching |
| Document Q&A med mange unique queries | Retrieval snippet caching | Kostnad-effektiv, fokus på grounding data |
| Real-time dashboard med AI insights | Multi-tier caching (Redis L1 + Cosmos L2) | Speed + durability |
| Compliance-sensitive applikasjoner | User-scoped semantic caching | Privacy protection, audit trail |

**Baseline** (Architecture decision framework)

### Vanlige feil å unngå

| Anti-pattern | Problem | Løsning |
|-------------|---------|---------|
| **Caching user-private data globally** | Privacy violation, data leakage | Scope cache keys by user/tenant identity |
| **Ingen TTL policy** | Runaway cache growth, stale data | Implement TTL basert på data sensitivity |
| **For høy similarity threshold (>0.8)** | Lav cache hit rate | Start med 0.15-0.3, tune basert på metrics |
| **Caching uten context window** | Contextually incorrect responses | Vectorize chat history + latest prompt |
| **Ingen invalidation strategy** | Stale responses ved data updates | Implement webhook-based invalidation |

**Verified** (Microsoft Learn - Caching risks)

### Røde flagg

- Cache hit rate < 20% etter tuning → Revurder cache strategy
- Cache size vokser >10GB/dag → Implementer aggressive TTL eller pruning
- Latency øker etter caching → Sjekk embedding model overhead
- Brukerklager på stale data → Reduser TTL eller implementer invalidation

**Baseline** (Performance monitoring thresholds)

## Integrasjon med Microsoft-stakken

### Azure Cache for Redis

**Use Cases:** Traditional result caching, high-throughput scenarios

**Tiers:**
- **Premium tier** — 99.9% SLA, up to 120GB per shard
- **Enterprise tier** — 99.99% SLA, active-active geo-replication, Flash storage support
- **Enterprise Flash tier** — Up to 13TB cache size, 20% RAM + 80% NVMe Flash

**Workloads suited for Flash tier:**
- Read-heavy (high read/write ratio)
- Hot/cold access patterns (frequently accessed subset)
- Large values (keys in RAM, values in Flash)

**Not suited for Flash tier:**
- Write-heavy workloads
- Uniform data access patterns
- Long key names with small values

**Configuration for AI workloads:**
```python
import redis
from azure.identity import DefaultAzureCredential
from redis_entraid.cred_provider import create_from_default_azure_credential

credential_provider = create_from_default_azure_credential(
    ("https://redis.azure.com/.default",),
)

r = redis.Redis(
    host="<redis-host>.redis.cache.windows.net",
    port=10000,
    ssl=True,
    decode_responses=True,
    credential_provider=credential_provider
)

# Set TTL på cached item
r.setex("cache_key", 3600, "cached_value")  # 1 hour TTL
```

**Verified** (Microsoft Learn - Azure Managed Redis architecture, code samples)

### Azure API Management - Semantic Caching

**Use Case:** Semantic caching for LLM APIs (Azure OpenAI, Model Inference API)

**Prerequisites:**
- Azure Managed Redis med **RediSearch module** enabled
- Embeddings API deployment (for vectorization)
- Chat Completion API deployment (for user requests)

**Policy Configuration:**

Inbound (cache lookup):
```xml
<azure-openai-semantic-cache-lookup
    score-threshold="0.15"
    embeddings-backend-id="embeddings-backend"
    embeddings-backend-auth="system-assigned"
    ignore-system-messages="true"
    max-message-count="10">
    <vary-by>@(context.Subscription.Id)</vary-by>
</azure-openai-semantic-cache-lookup>
<rate-limit calls="10" renewal-period="60" />
```

Outbound (cache store):
```xml
<azure-openai-semantic-cache-store duration="60" />
```

**Score Threshold Tuning:**
- 0.1-0.2 → Liberal matching, høy hit rate, noe lavere relevance
- 0.3-0.5 → Balanced, medium hit rate, god relevance
- 0.6-0.8 → Strict matching, lav hit rate, høy relevance

**Verified** (Microsoft Learn - Enable semantic caching for LLM APIs)

### Azure Cosmos DB for NoSQL

**Use Case:** Semantic cache med built-in vector search, persistent storage

**Implementation Pattern:**

```python
from azure.cosmos import CosmosClient
from openai import AzureOpenAI

# Setup Cosmos DB vector store
cosmos_client = CosmosClient(url=cosmos_uri, credential=cosmos_key)
database = cosmos_client.get_database_client(cosmos_database_name)
container = database.get_container_client(cosmos_container_name)

# Query semantic cache
def query_cache(prompt_vector, similarity_threshold=0.15, top_k=5):
    query = f"""
    SELECT TOP {top_k} c.id, c.prompt, c.completion,
           VectorDistance(c.promptVector, @promptVector) AS similarity
    FROM c
    WHERE VectorDistance(c.promptVector, @promptVector) > @threshold
    ORDER BY VectorDistance(c.promptVector, @promptVector) DESC
    """

    items = list(container.query_items(
        query=query,
        parameters=[
            {"name": "@promptVector", "value": prompt_vector},
            {"name": "@threshold", "value": similarity_threshold}
        ]
    ))
    return items
```

**Fordeler:**
- Globally distributed, multi-region writes
- Automatic indexing av vectors
- 99.999% SLA med multi-region setup
- Built-in TTL support

**Verified** (Microsoft Learn - Semantic cache with Cosmos DB, code samples)

### Azure AI Search - Built-in Caching

**Automatic Caching Behavior:**
Azure AI Search cacher automatisk content etter første query for raskere subsequent searches.

**Optimization Tips:**
- Reduser index size → raskere caching, mindre memory footprint
- Selective field attribution → kun indexer nødvendige fields
- Unngå over-attribution (filterable, sortable, facetable) → reduserer storage 4x

**Performance Factors:**
- Smaller indexes → mer content i cache → lavere query latency
- Higher tiers (S2, S3) → mer memory → større cache capacity
- Partitions → parallel processing for slow queries

**Verified** (Microsoft Learn - Azure AI Search performance tips)

## Offentlig sektor (Norge)

### GDPR og Privacy

**Cache Key Scoping (OBLIGATORISK):**
- Aldri cache user-private content uten proper scoping by user identity
- Implementer tenant/user isolation i cache keys
- Audit trail for cached persondata

**Data Minimization:**
- Cache kun minimum nødvendig data for å svare på query
- TTL på persondata skal ikke overstige formåls-begrensningen
- Automatisk sletting ved user request (GDPR Article 17)

**Eksempel - GDPR-compliant cache key:**
```python
cache_key = f"user:{user_id}:tenant:{tenant_id}:query_hash:{hash(prompt)}"
# TTL: 1 hour (minimal for chat session)
```

**Baseline** (GDPR compliance patterns)

### Compliance-krav

| Krav | Implementation |
|------|----------------|
| **Dataportabilitet (GDPR Art. 20)** | Export cached user data on request |
| **Rett til sletting (GDPR Art. 17)** | Implement cache purge by user_id |
| **Behandlingsgrunnlag** | Dokumenter legitimate interest for caching |
| **Datatilsynet rapportering** | Audit log for cache access/invalidation |

**Baseline** (Norwegian public sector compliance)

### Sikkerhet

**Encryption:**
- **At rest:** Azure Cache for Redis (Premium/Enterprise) — automatic encryption
- **In transit:** TLS 1.2+ mandatory for all cache connections
- **Key management:** Azure Key Vault for cache access keys

**Access Control:**
- Microsoft Entra ID authentication for Redis (preview)
- Role-based access control (RBAC) for cache management
- Network isolation via Private Endpoints

**Verified** (Microsoft Learn - Redis security)

## Kostnad og lisensiering

### Azure Cache for Redis Pricing (Norway East - 2026)

| Tier | Size | Kapasitet | Månedskostnad (NOK) | Best For |
|------|------|-----------|---------------------|----------|
| Basic C0 | 250 MB | N/A (no SLA) | ~400 | Dev/Test |
| Standard C1 | 1 GB | 2 replicas, 99.9% SLA | ~1,200 | Small production |
| Premium P1 | 6 GB | Clustering, geo-replication | ~7,000 | Enterprise |
| Enterprise E10 | 12 GB | Active-active, 99.99% SLA | ~25,000 | Mission-critical |
| Enterprise Flash F300 | 345 GB | 20% RAM + 80% Flash | ~60,000 | Large-scale AI |

**Cost Optimization Tips:**
1. **Start with Premium P1** for production RAG (best price/performance)
2. **Scale out vs scale up** — Add replicas før du går til høyere tier
3. **Use Flash tier for large caches** (>100GB) — 5x lavere cost per GB vs Enterprise
4. **Monitor cache hit rate** — <20% hit rate betyr ineffektiv caching strategy
5. **Implement TTL aggressively** — Reduser cache size, lavere tier

**Verified** (Microsoft Learn - Plan and manage costs)

### Azure Cosmos DB Pricing

**Request Units (RU/s) for Semantic Cache:**
- Vector query (1KB): ~10-50 RU
- Write (cache store): ~5-10 RU
- Storage: ~2.5 NOK/GB/måned

**Cost Example (10,000 queries/day):**
- 10,000 queries × 30 RU avg = 300,000 RU/day = 3.5 RU/s avg
- Provisioned: 100 RU/s (for burst) = ~600 NOK/måned
- Storage (10GB cache): ~25 NOK/måned
- **Total: ~625 NOK/måned**

**Baseline** (Cosmos DB pricing calculator estimates)

### TCO Sammenligning

| Scenario | Without Caching | With Semantic Caching (Redis) | Savings |
|----------|-----------------|-------------------------------|---------|
| 100K LLM queries/day (GPT-4) | ~450,000 NOK/måned | ~150,000 NOK/måned + 7,000 (Redis) | 65% |
| 10K queries/day (GPT-3.5) | ~45,000 NOK/måned | ~15,000 NOK/måned + 7,000 (Redis) | 51% |

**Assumptions:** 50% cache hit rate, avg 2000 tokens/query

**Baseline** (TCO analysis based on Azure pricing)

## For arkitekten (Cosmo)

### Spørsmål å stille kunden

1. **Traffic pattern:** Hvor mange LLM queries per dag/time forventer dere? Hva er peak vs avg load?
2. **Query similarity:** Er det mange repeterende eller semantisk like spørsmål? (Indikerer semantic cache ROI)
3. **Data freshness:** Hvor ofte endres underlying data? Hva er akseptabelt staleness-vindu?
4. **Privacy requirements:** Håndterer dere persondata? Trengs user-scoped caching?
5. **Compliance:** Hvilke regulatory frameworks gjelder (GDPR, Schrems II, Datatilsynet)?
6. **Budget:** Hva er totalt budsjett for LLM + caching infrastructure?
7. **Latency SLA:** Hva er maks akseptabel response time (p50, p95, p99)?
8. **Global reach:** Trengs multi-region caching for latency eller compliance?

### Fallgruver å unngå

| Fallgruve | Impact | Mitigering |
|-----------|--------|------------|
| **Caching uten context window** | Contextually incorrect responses → user frustration | Vectorize chat history + prompt |
| **Global caching av persondata** | GDPR violation, potential bøter | User-scoped keys, TTL enforcement |
| **For høy similarity threshold** | Lav hit rate, caching ineffective | Start lavt (0.15), tune opp |
| **Ingen invalidation strategy** | Stale data → incorrect LLM responses | Webhook-based invalidation |
| **Undersized cache tier** | High eviction rate, lav hit rate | Monitor evictions, scale proaktivt |
| **Ignoring embedding overhead** | Latency increase vs direct LLM call | Batch embeddings, use async patterns |

### Anbefalinger per modenhetsnivå

**Level 1 - Pilot (0-6 måneder RAG erfaring):**
- Start med **Azure API Management semantic caching** (managed, low-complexity)
- Use case: FAQ chatbot med <1000 queries/dag
- Tier: Standard Redis (C1) for læring, lav cost
- Monitoring: Basic hit rate metrics i APIM

**Level 2 - Production (6-18 måneder):**
- Implementer **multi-layer caching** (Redis L1 + Cosmos DB L2)
- Use case: Customer support RAG med 10K-100K queries/dag
- Tier: Premium Redis (P1) + Cosmos DB autoscale
- Monitoring: Application Insights med custom metrics (hit rate, latency, cost per query)

**Level 3 - Enterprise (18+ måneder):**
- **Hybrid semantic + retrieval caching** med advanced invalidation
- Use case: Multi-tenant SaaS RAG platform, 100K+ queries/dag
- Tier: Enterprise Redis (E10) + global Cosmos DB
- Monitoring: Full observability stack (Grafana, custom dashboards, alerting)

**Baseline** (Maturity model for AI implementations)

## Kilder og verifisering

### Microsoft Learn Documentation

1. **Application design for AI workloads on Azure** - Multi-layer caching strategies
   https://learn.microsoft.com/en-us/azure/well-architected/ai/application-design#implement-multi-layer-caching-strategies
   *Confidence: Verified (2026-02)*

2. **Introduction to semantic cache** - Semantic caching concepts, context window requirements
   https://learn.microsoft.com/en-us/azure/cosmos-db/gen-ai/semantic-cache
   *Confidence: Verified (2026-02)*

3. **Enable semantic caching for LLM APIs in Azure API Management** - APIM semantic cache implementation
   https://learn.microsoft.com/en-us/azure/api-management/azure-openai-enable-semantic-caching
   *Confidence: Verified (2026-02)*

4. **Tips for better performance in Azure AI Search** - Index caching, performance optimization
   https://learn.microsoft.com/en-us/azure/search/search-performance-tips
   *Confidence: Verified (2026-02)*

5. **Azure Managed Redis architecture** - Flash tier workloads, caching strategies
   https://learn.microsoft.com/en-us/azure/redis/architecture#flash-optimized-tier
   *Confidence: Verified (2026-02)*

6. **Plan and manage costs of an Azure AI Search service** - Cost optimization, enrichment caching
   https://learn.microsoft.com/en-us/azure/search/search-sku-manage-costs#minimize-costs
   *Confidence: Verified (2026-02)*

7. **Data platform considerations for mission-critical workloads** - Azure Cache for Redis enterprise patterns
   https://learn.microsoft.com/en-us/azure/well-architected/mission-critical/mission-critical-data-platform#caching-for-hot-tier-data
   *Confidence: Verified (2026-02)*

### Code Samples

8. **RAG implementation with Azure AI Search** - Python RAG cache patterns
   https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview#content-retrieval-in-azure-ai-search
   *Confidence: Verified (code sample)*

9. **Azure Cache for Redis with Python** - Redis connection and caching code
   https://learn.microsoft.com/en-us/azure/redis/python-get-started#code-to-connect-to-a-redis-cache
   *Confidence: Verified (code sample)*

### Confidence Levels per Section

| Seksjon | Confidence | Source |
|---------|-----------|--------|
| Multi-layer caching strategy | **Verified** | Microsoft Learn docs (1) |
| Semantic caching pattern | **Verified** | Microsoft Learn docs (2, 3) |
| Azure Cache for Redis configuration | **Verified** | Microsoft Learn docs (5, 7), code samples (9) |
| Azure API Management policies | **Verified** | Microsoft Learn docs (3) |
| Azure AI Search caching | **Verified** | Microsoft Learn docs (4, 6) |
| Cost estimates | **Baseline** | Azure pricing calculator (2026-02) |
| GDPR compliance patterns | **Baseline** | Industry best practices |
| Maturity model recommendations | **Baseline** | Architecture consulting experience |

---

**Totalt antall kilder:** 9 unike Microsoft Learn URLer
**MCP calls:** 6 (4 docs_search + 2 docs_fetch + 1 code_sample_search)
**Sist verifisert:** 2026-02-03


### Azure Managed Redis — Arkitektur (oppdatert 2026-04)

Azure Managed Redis (basert på Redis Enterprise) er anbefalt for AI-workloads vs. Azure Cache for Redis (community edition):

| Egenskap | Azure Cache for Redis | Azure Managed Redis |
|---------|----------------------|---------------------|
| Threading | Single-threaded | Multi-threaded (Redis Enterprise) |
| Arkitektur | Primary + replica (2 nodes) | Multiple shards per node, distributed primaries |
| Performance | Begrenset av single thread | Nær-lineær skalering med vCPUs |
| Clustering | Valgfritt | Alltid aktivert (OSS, Enterprise, eller Non-Clustered policy) |
| Active geo-replication | Nei | Ja |

**Cluster policies:**
- **OSS policy** — anbefalt for de fleste. Klienten kobles direkte til shards, laveste latency, best throughput
- **Enterprise policy** — enkelt endpoint, bakoverkompatibelt, men enkelt-node proxy kan bli bottleneck. Påkrevd for RediSearch
- **Non-Clustered** — kun ≤25 GB, for migrering fra ikke-shardede miljøer

**Flash Optimized tier:** 20% RAM + 80% NVMe Flash. Optimal for read-heavy workloads med subset av hot keys.