LettuceCache
Context-aware semantic cache for LLMs. Stops redundant API calls without false hits - because the same question means different things in different conversations.
The Problem
Traditional caching matches on exact text. Semantic caching matches on meaning. Both fail for context-dependent queries:
User A - "What is the cancellation policy?" (hotel booking)
User B - "What is the cancellation policy?" (gym membership)
A naive semantic cache serves User B the hotel answer - a false hit worse than a miss.
| Approach | Exact match | Semantic match | Context-aware |
|---|---|---|---|
| Traditional KV cache | ✓ | ✗ | ✗ |
| Semantic cache (embedding only) | ✗ | ✓ | ✗ |
| LettuceCache | ✓ | ✓ | ✓ |
Key Numbers
| Metric | Value |
|---|---|
| L1 cache latency (Redis exact match) | 1–3 ms |
| L2 embedding latency (Python sidecar, CPU) | 20–50 ms |
| L2 FAISS search latency | 1–3 ms |
| End-to-end L1 hit | 1–3 ms |
| End-to-end L2 hit | 25–60 ms |
| LLM call baseline | 500-2000 ms |
| Validation threshold | 0.85 (configurable) |
| Embedding model | all-MiniLM-L6-v2 (384 dims) |
| TurboQuant compression | 7.8× (1536B → 196B, d=384) |
Architecture
Every query goes through two cache checks before hitting the LLM. Here is how that works: All reads are non-blocking; all writes are async.
Services
C++ Orchestrator
Port 8080 · cpp-httplib · all components wired at construction · shared_mutex on FAISS reads
Python Sidecar
Port 8001 · FastAPI · all-MiniLM-L6-v2 · 384-dim L2-normalised · circuit breaker on C++ side
Redis L1
Port 6379 · SETEX 3600s · RDB persistence · key: lc:l1:{sha256}
FAISS L2
In-process · IVF+PQ index · metadata sidecar (.meta.json) survives restarts
unique_ptr and constructs them in dependency order.
Nothing else uses new - all cross-component communication is via references.
- TurboQuantizer is created first (if enabled) - FaissVectorStore and ValidationService both receive a raw pointer to it.
- IntelligentAdmissionPolicy receives a reference to FaissVectorStore (for novelty search).
- QueryOrchestrator receives a reference to the policy to record cache hits for adaptive threshold updates.
- Destruction order:
svr_→builder_→ everything else (FAISS persist on destructor).
- FaissVectorStore:
std::shared_mutex- concurrentsearch()calls (shared lock), exclusiveadd()/remove()/persist(). - CacheBuilderWorker: single background thread with
std::condition_variablequeue.enqueue()is the only method called from the hot path - it acquires a brief mutex, pushes, and notifies. - EmbeddingClient:
std::mutex curl_mutex_around the persistent CURL handle. - RedisCacheAdapter: single
redisContext*with no mutex - open bug. Concurrent requests race; a connection pool or per-call mutex is needed. - IntelligentAdmissionPolicy: two separate mutexes -
freq_mutex_anddomain_mutex_- to avoid contention between frequency updates and domain stats reads.
How It Works
Every request flows through a deterministic pipeline. Here is the exact sequence:
Client sends POST /query
Send query, context[] (prior turns), domain, and optionally user_id / correlation_id.
ContextBuilder computes signature CPU only
Pulls the first 3 meaningful words from the query as intent. Hashes user_id to a 16-char token (never stored raw). Sorts context turns. Combines into SHA-256(intent:domain:scope:sorted_context).
L1 Redis lookup 1–3ms
Looks up lc:l1:{sig_hash}. Hit returns instantly at confidence 1.0. Miss moves to the embedding step.
Embed query via Python sidecar 20–50ms (CPU)
Calls :8001/embed, gets back 384 floats (L2-normalised). If the sidecar is down, a circuit breaker kicks in and falls through to the LLM instead of hanging.
FAISS search + ValidationService 1–3ms
Returns top-5 candidates. Each scored: 0.60×cosine + 0.25×ctx_match + 0.15×domain_match. First candidate with score ≥ 0.85 wins. On hit: backfills L1 with the rendered response (slot placeholders filled).
LLM fallback + async cache build 500ms+
Calls OpenAI gpt-4o-mini. Response returned to client immediately. Entry enqueued to CacheBuilderWorker (background thread) - admission check, templatization, FAISS+Redis write. Client never waits for this.
Privacy by design. user_id is hashed to a 16-char token and never stored raw. Query text is never persisted - only the 384-dim embedding and a templatized response with high-entropy tokens (UUIDs, dates, numbers, proper nouns) replaced by {{SLOT_N}} placeholders.
Writes never block reads. The CacheBuilderWorker runs on a dedicated background thread with a std::condition_variable queue. The orchestrator calls enqueue() and returns the HTTP response immediately.
Step 1 - intent extraction (ContextBuilder.cpp:extractIntent()):
Lowercase → strip non-alphanumeric → skip 34 stopwords → take first 3 tokens → join with
_.
SHA-256("user:" + user_id)[:16] - never stores raw identity.Step 3 - context canonicalisation:
std::sort(context_turns) - reversed order produces same hash.Step 4 - signature:
context_signature on each FAISS entry.
Tokenise response → for each token, classify:
isNumeric: regex^-?\d+(\.\d+)?%?$looksLikeDate: regex for YYYY-MM-DD / DD/MM/YYYY patternslooksLikeUUID: 8-4-4-4-12 hex patternlooksLikeProperNoun: starts uppercase, rest lowercase, length > 3, not in common-word blocklist
{{SLOT_N}}. Slot values stored in Redis at lc:slots:{entry_id}.On L2 serve (QueryOrchestrator.cpp):
Read
lc:slots:{entry_id} → parse JSON array → Templatizer::render(template, slot_values) → fills placeholders back in. Falls back to raw template if slots have expired.
Validation Scoring
For each FAISS candidate, a composite score is computed. First one at or above 0.85 gets returned as a cache hit.
Interactive Score Calculator
Same context, same domain
"How do I cancel?" → "Can I cancel?" (paraphrase)
0.25×1.0 = 0.250
0.15×1.0 = 0.150
Different context (hotel vs gym)
Near-perfect text match, wrong context
0.25×0.0 = 0.000
0.15×0.0 = 0.000
Related query, same context
"Cancel order" → "Refund process"
0.25×1.0 = 0.250
0.15×1.0 = 0.150
TurboQuant-corrected scoring. With ENABLE_TURBO_QUANT=1, the cosine term uses the unbiased inner-product estimator: E[TQ_ip(y, encode(x))] = ⟨y, x⟩. This is critical - at 1-bit MSE alone, every cosine score would be scaled by 2/π ≈ 0.64, making 0.85 unreachable. TurboQuant_prod eliminates this bias entirely.
Two paths for cosine similarity:
- TurboQuant path: if
tq_ != nullptrandcandidate.tq_codesis non-empty → callstq_->inner_product(query_embedding, tq_codes). Returns unbiased estimate E[result] = ⟨y,x⟩. - Fast dot-product path: embeddings are L2-normalised by the Python sidecar, so cosine = dot product. The original
sqrt(norm_a) * sqrt(norm_b)computation was always computing sqrt(1)×sqrt(1) = 1 - eliminated.
Context Signatures
This is what stops false hits. Two identical queries with different conversation histories get different signatures and never share a cache entry.
Extract Intent
"What is the cancellation policy for my booking?" → strip stopwords → cancellation_policy_booking
First 3 non-stopword tokens, lowercase, joined by underscore
Anonymise User
SHA-256("user:u_12345")[:16] → 3a7f2c91b0e4d852
Deterministic 16-char hex token - user_id is never stored raw
Canonicalize Context
["bot: hello", "user: hi"] → sorted → ["bot: hello", "user: hi"]|
Sort ensures the hash is the same regardless of turn order
Compute SHA-256
SHA-256("cancellation_policy_booking:ecommerce:3a7f2c91b0e4d852:...")
= a3f2c1d8e9f0b2c3... - this is the Redis L1 key suffix
Why Context Mismatch Always Misses
| Scenario | Cosine | Ctx sig | Domain | Score | Result |
|---|---|---|---|---|---|
| Same query, same context, same domain | 0.97 | 1.0 | 1.0 | 0.97 | HIT |
| Same query, different context | 0.97 | 0.0 | 1.0 | 0.73 | MISS |
| Same query, same context, different domain | 0.97 | 0.0 | 0.0 | 0.58 | MISS |
| Very similar query, same context | 0.93 | 1.0 | 1.0 | 0.96 | HIT |
Context mismatch caps score at 0.75. The context term (weight 0.25) contributes a maximum of 0.25 on match. On mismatch: max score = 0.60×1.0 + 0.0 + 0.15×1.0 = 0.75 - safely below the 0.85 threshold. The cache cannot serve a false hit across context boundaries regardless of embedding similarity.
["user: hi", "bot: hello"] and ["bot: hello", "user: hi"] produced different SHA-256 hashes → same conversation, two cache entries.Fix (ContextBuilder.cpp):
std::find over a 34-element vector for stopword lookup was also replaced with an std::unordered_set for O(1) lookup.
Admission Control
Not every LLM response is worth caching. AdmissionController prevents one-off queries from polluting the FAISS index and keeps cached entries generalisable.
Rules
| Rule | Default | Purpose |
|---|---|---|
| min_frequency | 2 | Same signature must be seen ≥ N times in the window |
| window_seconds | 300 (5 min) | Rolling time window - counter resets after inactivity |
| max_response_bytes | 32 768 | Responses > 32 KB are too specific to reuse |
Frequency Counting Timeline
Templatizer runs after admission. When admitted, Templatizer replaces high-entropy tokens (UUIDs, dates, numbers, proper nouns) with {{SLOT_N}} placeholders. At serve time, slot values are restored from Redis (lc:slots:{entry_id}). This makes cached responses reusable across minor surface variations.
Pipeline in CacheBuilderWorker::processEntry():
- Size gate (AdmissionController) - fast reject oversized/empty
- ResponseQualityFilter - hard reject conversational, session-bound, refusals, dynamic content
- CVF decision (IntelligentAdmissionPolicy)
TurboQuant Integration
TurboQuant (arXiv:2504.19874, Zandieh et al. 2025) is a data-oblivious vector quantizer that achieves near-optimal distortion with zero training time and provably unbiased inner-product estimation - the two properties LettuceCache needs most.
1536B → 196B (d=384)
data-oblivious · online
E[TQ_ip] = ⟨y,x⟩
Why LettuceCache Needs This
❌ Without TurboQuant (plain MSE)
MSE quantizers introduce a multiplicative bias of 2/π ≈ 0.637 on inner-product estimates. At 1-bit, every cosine similarity is scaled down:
✓ With TurboQuant_prod
The two-stage algorithm (MSE + QJL residual) produces an unbiased estimator - threshold remains exactly as configured:
Algorithm Walkthrough
TurboQuant_prod encodes a d-dimensional vector in two stages, each addressing a different objective:
Code Layout (d=384, 4-bit total)
Unbiased Inner Product Estimation
At inference time, the inner product between a full-precision query y and a stored compressed vector is computed asymmetrically (query stays full-precision):
Proof sketch (Theorem 2, arXiv:2504.19874)
Since x = x̂_mse + r (decomposition by definition of residual):
Variance: σ²(score) = 0.60² × π/(2d) ≈ 0.00073 for d=384 → σ ≈ 0.027. This is comparable to the inherent noise in the embedding model itself.
TurboQuant vs Traditional Product Quantization (PQ)
| Property | FAISS IVF+PQ | TurboQuant_prod |
|---|---|---|
| Training required | ✗ Yes - k-means on all vectors | ✓ None - data-oblivious |
| Online indexing | ✗ Must retrain as distribution shifts | ✓ Encode any vector instantly |
| Inner product bias | ✗ Uncharacterised analytically | ✓ Zero bias (proved) |
| Distortion bound | Data-dependent, no guarantee | Within 2.7× of Shannon lower bound |
| Threshold validity | ✗ Score drift at low bits | ✓ 0.85 threshold holds at 4-bit |
| Cold start quality | ✗ Poor until codebooks trained | ✓ Full quality from first vector |
Enable: ENABLE_TURBO_QUANT=1 ./build/lettucecache
TurboQuantizer is wired into FaissVectorStore::add() (auto-encodes every entry) and ValidationService::score() (uses unbiased inner product when tq_codes present). Seeds are deterministic (rotation=42, QJL=137) ensuring reproducible compression across restarts.
Symptom: MSE of 0.233 for d=384 vs theoretical bound of 0.133.
Fix (TurboQuantizer.cpp):
API Reference
The orchestrator exposes a REST API on port 8080 (configurable via HTTP_PORT).
Main entry point. Checks L1 → L2 → LLM. Returns the answer with cache metadata.
Request Body
{
"query": "What is the return policy?",
"context": ["I bought a jacket last week", "It doesn't fit"],
"user_id": "u_123",
"session_id": "sess_abc",
"domain": "ecommerce",
"correlation_id": "req_xyz"
}
| Field | Type | Description |
|---|---|---|
| queryrequired | string | The user's question |
| context | string[] | Prior conversation turns. Sorted canonically before hashing - order doesn't matter. |
| user_id | string | Hashed to 16-char scope token. Never stored raw. |
| session_id | string | Passed through to logs for distributed tracing. Not used in cache logic. |
| domain | string | Domain tag (e.g. ecommerce, healthcare). Defaults to "general". |
| correlation_id | string | Echoed in structured logs. |
Response
{
"answer": "You can return items within 30 days of purchase.",
"cache_hit": true,
"confidence": 0.93,
"cache_entry_id": "lc:l1:a3f2c1d8...",
"latency_ms": 45
}
| Field | Type | Description |
|---|---|---|
| answer | string | The response text |
| cache_hit | boolean | true if served from L1 or L2 cache |
| confidence | float | Validation score 0.0-1.0. 1.0 on L1 hit (no scoring needed). 0.0 on LLM fallback. |
| cache_entry_id | string | FAISS entry ID on L2 hit; L1 key on L1 hit; empty on miss. |
| latency_ms | integer | Total server-side processing time |
Dependency health check. Returns 200 when all services healthy, 503 when degraded. Used as Kubernetes readiness probe.
// 200 OK - healthy
{
"status": "ok",
"redis": true,
"embedding_sidecar": true,
"faiss_entries": 1042,
"queue_depth": 3
}
// 503 Service Unavailable - degraded
{
"status": "degraded",
"redis": false,
"embedding_sidecar": true,
"faiss_entries": 1042,
"queue_depth": 0
}
Lightweight stats snapshot. No dependency checks - always fast.
{
"faiss_entries": 1042,
"queue_depth": 3
}
Evicts a specific entry from FAISS and Redis L1. Use when a cached response is stale or incorrect.
curl -X DELETE http://localhost:8080/cache/a3f2c1d8e9f0b2c3
// 200 OK
{ "deleted": true, "key": "a3f2c1d8e9f0b2c3" }
// 404 Not Found
{ "deleted": false, "key": "a3f2c1d8e9f0b2c3" }
The Python sidecar runs on port 8001 (internal; not exposed to clients directly).
// Request
{ "text": "What is the return policy?" }
// Response
{
"embedding": [0.023, -0.147, 0.891, "...384 floats..."],
"model": "all-MiniLM-L6-v2",
"dimension": 384
}
// Request (up to 256 texts)
{ "texts": ["query one", "query two"] }
// Response
{
"embeddings": [[...], [...]],
"model": "all-MiniLM-L6-v2",
"dimension": 384,
"count": 2
}
Configuration
All runtime configuration is via environment variables. No config files needed.
Environment Variables
| Variable | Default | Description |
|---|---|---|
| REDIS_HOST | localhost | Redis server hostname |
| REDIS_PORT | 6379 | Redis server port |
| EMBED_URL | http://localhost:8001 | Python sidecar base URL |
| EMBED_DIM | 384 | Embedding dimension - must match model |
| OPENAI_API_KEY | empty | OpenAI API key. Empty → stub response mode (useful for testing) |
| LLM_MODEL | gpt-4o-mini | OpenAI model name (e.g. gpt-4o, gpt-3.5-turbo) |
| HTTP_PORT | 8080 | Orchestrator HTTP listen port |
| FAISS_INDEX_PATH | ./faiss.index | Path for FAISS binary + .meta.json sidecar |
| ENABLE_TURBO_QUANT | unset | Set to 1 to enable TurboQuantizer (7.8× compression, unbiased cosine) |
Source-Code-Only Tuning
| Parameter | Location | Notes |
|---|---|---|
| Validation threshold (0.85) | src/api/HttpServer.cpp:47 | Lower = more hits; higher = fewer false positives |
| Scoring weights (0.60/0.25/0.15) | src/validation/ValidationService.h | Must sum to 1.0 |
| Admission min_frequency (2) | src/api/HttpServer.cpp:48 | Lower = more cache entries; higher = fewer one-offs |
| Admission window (300s) | src/api/HttpServer.cpp:48 | Rolling frequency window |
| FAISS NLIST/NPROBE/M_PQ | src/cache/FaissVectorStore.h | NLIST=100, NPROBE=10, M_PQ=8, NBITS=8 |
| TurboQuant seeds | TurboQuantizer constructor | rotation_seed=42, qjl_seed=137 (hardcoded) |
| L1 TTL (3600s) | src/builder/CacheBuilderWorker.h | Redis key TTL in seconds |
Quick Start
Get LettuceCache running locally in under 5 minutes using Docker Compose.
Clone the repository
git clone git@github.com:Ciphercrypt/LettuceCache.git
cd LettuceCacheSet your API key
export OPENAI_API_KEY=sk-...
# Or leave blank to use stub mode (returns "[LLM not configured] {query}")Start all services
docker compose up
# Starts: Redis 7 (:6379), Python sidecar (:8001), C++ orchestrator (:8080)
# Wait for: orchestrator | [INFO] HttpServer listening on port 8080Verify health
curl http://localhost:8080/health{ "status": "ok", "redis": true, "embedding_sidecar": true, "faiss_entries": 0 }First and second queries - LLM called (admission requires 2 appearances)
curl -s -X POST http://localhost:8080/query \
-H 'Content-Type: application/json' \
-d '{"query":"What is the capital of France?","domain":"geography"}' | jq .{ "answer": "The capital of France is Paris.", "cache_hit": false, "latency_ms": 712 }Third query - served from cache (~15× faster)
# Third call hits L2 cache (~45ms); subsequent identical calls hit L1 at ~2ms (~350× faster)
curl -s -X POST http://localhost:8080/query \
-H 'Content-Type: application/json' \
-d '{"query":"What is the capital of France?","domain":"geography"}' | jq .{ "answer": "The capital of France is Paris.", "cache_hit": true, "confidence": 0.94, "latency_ms": 45 }Enable TurboQuant for compressed storage
ENABLE_TURBO_QUANT=1 LLM_MODEL=gpt-4o-mini \
OPENAI_API_KEY=$OPENAI_API_KEY \
FAISS_INDEX_PATH=./my.faiss ./build/lettucecacheReduces stored embedding size from 1536 bytes to 244 bytes (d=384) with unbiased cosine scoring.
Run the test suite:
# Unit tests (no external deps required)
cmake -B build && cmake --build build --target unit_tests
cd build && ctest --output-on-failure
# Full happy-path integration test (starts stack automatically)
bash tests/happy_path.sh
# Live demo with real LLM
export OPENAI_API_KEY=sk-...
bash tests/live_demo.sh