LettuceCache

Context-aware semantic cache for LLMs. Stops redundant API calls without false hits - because the same question means different things in different conversations.

⚡ 1–3ms L1 hit 🎯 25–60ms L2 hit 🚀 ~15× faster than LLM 🔒 Context-isolated
lettucecache demo
# First call - LLM is invoked
$ curl -X POST :8080/query -d '{"query":"What is the return policy?","domain":"ecommerce"}'
{ "cache_hit": false, "latency_ms": 843, "answer": "Items can be returned within 30 days..." }
# Second call - served from cache instantly
$ curl -X POST :8080/query -d '{"query":"What is the return policy?","domain":"ecommerce"}'
{ "cache_hit": true, "confidence": 0.94, "latency_ms": 47, "answer": "Items can be returned within 30 days..." }
# Same query, different context - correctly misses
$ curl -X POST :8080/query -d '{"query":"What is the return policy?","domain":"fitness"}'
{ "cache_hit": false, "latency_ms": 761, "answer": "Our gym membership cancellation policy..." }

The Problem

Traditional caching matches on exact text. Semantic caching matches on meaning. Both fail for context-dependent queries:

User A - "What is the cancellation policy?" (hotel booking)
User B - "What is the cancellation policy?" (gym membership)

A naive semantic cache serves User B the hotel answer - a false hit worse than a miss.

ApproachExact matchSemantic matchContext-aware
Traditional KV cache
Semantic cache (embedding only)
LettuceCache

Key Numbers

MetricValue
L1 cache latency (Redis exact match)1–3 ms
L2 embedding latency (Python sidecar, CPU)20–50 ms
L2 FAISS search latency1–3 ms
End-to-end L1 hit1–3 ms
End-to-end L2 hit25–60 ms
LLM call baseline500-2000 ms
Validation threshold0.85 (configurable)
Embedding modelall-MiniLM-L6-v2 (384 dims)
TurboQuant compression7.8× (1536B → 196B, d=384)

Architecture

Every query goes through two cache checks before hitting the LLM. Here is how that works: All reads are non-blocking; all writes are async.

⚡ HOT PATH
📥
POST /query
HTTP :8080
🔑
ContextBuilder
SHA-256 signature
L1 · Redis
exact hash · 1–3ms
✓ HIT → return
✗ miss ↓
🧠
EmbeddingClient
384-dim · Python :8001
🔍
L2 · FAISS
ANN top-5 · 1–3ms
⚖️
ValidationService
score ≥ 0.85
✓ HIT → return
✗ miss ↓
🤖
LLM
gpt-4o-mini · 500ms+
⟳ ASYNC WRITE PATH (never blocks response)
📬
CacheBuilderWorker
background thread
🚦
AdmissionController
freq ≥ 2 · window 300s
✂️
Templatizer
strip PII → {{SLOT_N}}
💾
FAISS + Redis Write
L2 + L1 · TTL 3600s

Services

🔧

C++ Orchestrator

Port 8080 · cpp-httplib · all components wired at construction · shared_mutex on FAISS reads

🐍

Python Sidecar

Port 8001 · FastAPI · all-MiniLM-L6-v2 · 384-dim L2-normalised · circuit breaker on C++ side

Redis L1

Port 6379 · SETEX 3600s · RDB persistence · key: lc:l1:{sha256}

🔍

FAISS L2

In-process · IVF+PQ index · metadata sidecar (.meta.json) survives restarts

Composition root: src/api/HttpServer.cpp owns every component as a unique_ptr and constructs them in dependency order. Nothing else uses new - all cross-component communication is via references.
  • TurboQuantizer is created first (if enabled) - FaissVectorStore and ValidationService both receive a raw pointer to it.
  • IntelligentAdmissionPolicy receives a reference to FaissVectorStore (for novelty search).
  • QueryOrchestrator receives a reference to the policy to record cache hits for adaptive threshold updates.
  • Destruction order: svr_builder_ → everything else (FAISS persist on destructor).
HttpServer owns: tq_ → redis_ → faiss_ → embedder_ → llm_ → validator_ → admission_ → policy_ → quality_filter_ → templatizer_ → builder_ → orchestrator_ → svr_
  • FaissVectorStore: std::shared_mutex - concurrent search() calls (shared lock), exclusive add()/remove()/persist().
  • CacheBuilderWorker: single background thread with std::condition_variable queue. enqueue() is the only method called from the hot path - it acquires a brief mutex, pushes, and notifies.
  • EmbeddingClient: std::mutex curl_mutex_ around the persistent CURL handle.
  • RedisCacheAdapter: single redisContext* with no mutex - open bug. Concurrent requests race; a connection pool or per-call mutex is needed.
  • IntelligentAdmissionPolicy: two separate mutexes - freq_mutex_ and domain_mutex_ - to avoid contention between frequency updates and domain stats reads.

How It Works

Every request flows through a deterministic pipeline. Here is the exact sequence:

1

Client sends POST /query

Send query, context[] (prior turns), domain, and optionally user_id / correlation_id.

2

ContextBuilder computes signature CPU only

Pulls the first 3 meaningful words from the query as intent. Hashes user_id to a 16-char token (never stored raw). Sorts context turns. Combines into SHA-256(intent:domain:scope:sorted_context).

3

L1 Redis lookup 1–3ms

Looks up lc:l1:{sig_hash}. Hit returns instantly at confidence 1.0. Miss moves to the embedding step.

4

Embed query via Python sidecar 20–50ms (CPU)

Calls :8001/embed, gets back 384 floats (L2-normalised). If the sidecar is down, a circuit breaker kicks in and falls through to the LLM instead of hanging.

5

FAISS search + ValidationService 1–3ms

Returns top-5 candidates. Each scored: 0.60×cosine + 0.25×ctx_match + 0.15×domain_match. First candidate with score ≥ 0.85 wins. On hit: backfills L1 with the rendered response (slot placeholders filled).

6

LLM fallback + async cache build 500ms+

Calls OpenAI gpt-4o-mini. Response returned to client immediately. Entry enqueued to CacheBuilderWorker (background thread) - admission check, templatization, FAISS+Redis write. Client never waits for this.

🔒

Privacy by design. user_id is hashed to a 16-char token and never stored raw. Query text is never persisted - only the 384-dim embedding and a templatized response with high-entropy tokens (UUIDs, dates, numbers, proper nouns) replaced by {{SLOT_N}} placeholders.

Writes never block reads. The CacheBuilderWorker runs on a dedicated background thread with a std::condition_variable queue. The orchestrator calls enqueue() and returns the HTTP response immediately.

Input: query text, context[] array, domain string, user_id string

Step 1 - intent extraction (ContextBuilder.cpp:extractIntent()):
Lowercase → strip non-alphanumeric → skip 34 stopwords → take first 3 tokens → join with _.
"What is machine learning?" → "machine_learning" "How do I reset my password?" → "reset_password"
Step 2 - user anonymisation: SHA-256("user:" + user_id)[:16] - never stores raw identity.

Step 3 - context canonicalisation: std::sort(context_turns) - reversed order produces same hash.

Step 4 - signature:
SHA-256( intent + ":" + domain + ":" + user_scope + ":" + sorted_context_joined )
This 64-char hex string is the Redis L1 key suffix and the context_signature on each FAISS entry.
On cache write (Templatizer::templatize()):
Tokenise response → for each token, classify:
  • isNumeric: regex ^-?\d+(\.\d+)?%?$
  • looksLikeDate: regex for YYYY-MM-DD / DD/MM/YYYY patterns
  • looksLikeUUID: 8-4-4-4-12 hex pattern
  • looksLikeProperNoun: starts uppercase, rest lowercase, length > 3, not in common-word blocklist
High-entropy tokens → replaced with {{SLOT_N}}. Slot values stored in Redis at lc:slots:{entry_id}.

On L2 serve (QueryOrchestrator.cpp):
Read lc:slots:{entry_id} → parse JSON array → Templatizer::render(template, slot_values) → fills placeholders back in. Falls back to raw template if slots have expired.

Validation Scoring

For each FAISS candidate, a composite score is computed. First one at or above 0.85 gets returned as a cache hit.

score = 0.60 × cosine_similarity + 0.25 × context_match + 0.15 × domain_match

Interactive Score Calculator

0.85
Score 0.976
threshold 0.85
✓ CACHE HIT
✓ HIT

Same context, same domain
"How do I cancel?" → "Can I cancel?" (paraphrase)

0.60×0.96 = 0.576
0.25×1.0 = 0.250
0.15×1.0 = 0.150
0.976
✗ MISS

Different context (hotel vs gym)
Near-perfect text match, wrong context

0.60×0.99 = 0.594
0.25×0.0 = 0.000
0.15×0.0 = 0.000
0.594
✓ HIT

Related query, same context
"Cancel order" → "Refund process"

0.60×0.78 = 0.468
0.25×1.0 = 0.250
0.15×1.0 = 0.150
0.868

TurboQuant-corrected scoring. With ENABLE_TURBO_QUANT=1, the cosine term uses the unbiased inner-product estimator: E[TQ_ip(y, encode(x))] = ⟨y, x⟩. This is critical - at 1-bit MSE alone, every cosine score would be scaled by 2/π ≈ 0.64, making 0.85 unreachable. TurboQuant_prod eliminates this bias entirely.

src/validation/ValidationService.cpp

Two paths for cosine similarity:
  • TurboQuant path: if tq_ != nullptr and candidate.tq_codes is non-empty → calls tq_->inner_product(query_embedding, tq_codes). Returns unbiased estimate E[result] = ⟨y,x⟩.
  • Fast dot-product path: embeddings are L2-normalised by the Python sidecar, so cosine = dot product. The original sqrt(norm_a) * sqrt(norm_b) computation was always computing sqrt(1)×sqrt(1) = 1 - eliminated.
Composite score:
score = 0.60 × cosine + 0.25 × (sig_hash match ? 1 : 0) + 0.15 × (domain match ? 1 : 0)
A context mismatch alone caps the maximum achievable score at 0.75 - safely below the 0.85 threshold, making false hits across conversation contexts structurally impossible.

Context Signatures

This is what stops false hits. Two identical queries with different conversation histories get different signatures and never share a cache entry.

1

Extract Intent

"What is the cancellation policy for my booking?" → strip stopwords → cancellation_policy_booking

First 3 non-stopword tokens, lowercase, joined by underscore

2

Anonymise User

SHA-256("user:u_12345")[:16] → 3a7f2c91b0e4d852

Deterministic 16-char hex token - user_id is never stored raw

3

Canonicalize Context

["bot: hello", "user: hi"] → sorted → ["bot: hello", "user: hi"]|

Sort ensures the hash is the same regardless of turn order

4

Compute SHA-256

SHA-256("cancellation_policy_booking:ecommerce:3a7f2c91b0e4d852:...")

= a3f2c1d8e9f0b2c3... - this is the Redis L1 key suffix

Why Context Mismatch Always Misses

ScenarioCosineCtx sigDomainScoreResult
Same query, same context, same domain0.971.01.00.97HIT
Same query, different context0.970.01.00.73MISS
Same query, same context, different domain0.970.00.00.58MISS
Very similar query, same context0.931.01.00.96HIT
⚠️

Context mismatch caps score at 0.75. The context term (weight 0.25) contributes a maximum of 0.25 on match. On mismatch: max score = 0.60×1.0 + 0.0 + 0.15×1.0 = 0.75 - safely below the 0.85 threshold. The cache cannot serve a false hit across context boundaries regardless of embedding similarity.

Old behaviour (bug): ["user: hi", "bot: hello"] and ["bot: hello", "user: hi"] produced different SHA-256 hashes → same conversation, two cache entries.

Fix (ContextBuilder.cpp):
canonical_context = std::sort(context_turns) // lexicographic signature = SHA-256(intent:domain:scope:canonical_joined)
Sorted turns → same hash regardless of order. The original std::find over a 34-element vector for stopword lookup was also replaced with an std::unordered_set for O(1) lookup.

Admission Control

Not every LLM response is worth caching. AdmissionController prevents one-off queries from polluting the FAISS index and keeps cached entries generalisable.

Rules

RuleDefaultPurpose
min_frequency2Same signature must be seen ≥ N times in the window
window_seconds300 (5 min)Rolling time window - counter resets after inactivity
max_response_bytes32 768Responses > 32 KB are too specific to reuse

Frequency Counting Timeline

Request
Count
Action
Request #1
count=1
recordQuery() ✗ Rejected
Request #2
count=2
shouldAdmit() = true → writes to FAISS + Redis ✓ Admitted
Request #3+
count=3+
L1/L2 cache serves it - CacheBuilderWorker deduplicates ✓ Served
After 5 min idle
reset
evictExpired() - counter resets to 0
Request #4
count=1
Window expired - starts fresh ✗ Rejected
🗜

Templatizer runs after admission. When admitted, Templatizer replaces high-entropy tokens (UUIDs, dates, numbers, proper nouns) with {{SLOT_N}} placeholders. At serve time, slot values are restored from Redis (lc:slots:{entry_id}). This makes cached responses reusable across minor surface variations.

File: src/builder/IntelligentAdmissionPolicy.cpp

Pipeline in CacheBuilderWorker::processEntry():
  1. Size gate (AdmissionController) - fast reject oversized/empty
  2. ResponseQualityFilter - hard reject conversational, session-bound, refusals, dynamic content
  3. CVF decision (IntelligentAdmissionPolicy)
CVF signals:
frequency = 1 - exp(-Σ exp(-λ·age_i)) λ = ln2/120s (half-life) cost = sigmoid((tokens × tier - 150) / 350) tier: gpt-4o=3.0 · gpt-4o-mini=1.0 · gpt-3.5=0.4 novelty = 1 - max_cosine(new_emb, faiss_top1) cosine > 0.94 → hard-reject (near-duplicate) adaptive_threshold = base(0.42) ± 0.08 × domain_hit_rate_signal
Novel insight (literature gap): No existing LLM caching paper unifies all four signals. LettuceCache fills this gap - identified as open research in the survey (resources/Intelligent Cache Admission…).

TurboQuant Integration

TurboQuant (arXiv:2504.19874, Zandieh et al. 2025) is a data-oblivious vector quantizer that achieves near-optimal distortion with zero training time and provably unbiased inner-product estimation - the two properties LettuceCache needs most.

7.8×
compression
1536B → 196B (d=384)
0
training time
data-oblivious · online
≈0
bias
E[TQ_ip] = ⟨y,x⟩

Why LettuceCache Needs This

❌ Without TurboQuant (plain MSE)

MSE quantizers introduce a multiplicative bias of 2/π ≈ 0.637 on inner-product estimates. At 1-bit, every cosine similarity is scaled down:

threshold = 0.85
effective = 0.85 / 0.637 = 1.336
→ impossible to reach ✗

✓ With TurboQuant_prod

The two-stage algorithm (MSE + QJL residual) produces an unbiased estimator - threshold remains exactly as configured:

E[⟨y, Q⁻¹(Q(x))⟩] = ⟨y, x⟩
σ(score) ≈ 0.027 at d=384
→ threshold holds at 0.85 ✓

Algorithm Walkthrough

TurboQuant_prod encodes a d-dimensional vector in two stages, each addressing a different objective:

x
Input Vector
d-dimensional embedding from all-MiniLM-L6-v2. L2-normalised (‖x‖₂ = 1).
1a
Store Norm + Normalize
Store ‖x‖₂ as float32 (4 bytes). Compute unit vector x̂ = x / ‖x‖.
norm = ‖x‖₂ → store as float32 x̂ = x / norm
1b
Randomized Hadamard Transform (RHT)
Apply random sign-flip then Walsh-Hadamard transform. This rotates x̂ so that each coordinate follows N(0, 1/padded_dim) - independent of the input. Runs in O(d log d).
D = diag(±1) drawn from seed=42 y = (1/√n) · WHT · D · x̂_padded
After rotation, the coordinate distribution is known analytically - this is what enables optimal codebooks without data.
1c
Lloyd-Max Scalar Quantization (3 bits)
Scale each coordinate by √n (→ N(0,1)), then binary-search the precomputed codebook to find the nearest centroid. Pack index as 3 bits.
4-bit Lloyd-Max codebook for N(0,1) - 16 centroids
−2.733−2.069−1.618−1.256 −0.942−0.657−0.388−0.128 +0.128+0.388+0.657+0.942 +1.256+1.618+2.069+2.733
3-bit codebook (used in MSE stage) - 8 centroids
−2.152−1.344−0.756−0.245 +0.245+0.756+1.344+2.152
MSE distortion bound (Theorem 1): D_mse ≤ √(23π) · 4⁻³ ≈ 0.133 for any unit vector. Empirically ~0.03 at d=768.
2
QJL Residual Correction (1 bit)
Decode the MSE reconstruction x̂_mse, compute the residual r = x − x̂_mse, then apply the Quantized Johnson-Lindenstrauss transform:
r = x - decode(mse_codes) S ∈ ℝᵈˣᵈ, i.i.d. N(0,1), seed=137 sign_bits = sign(S · r) // d bits total
The QJL transform provides an unbiased 1-bit correction for the inner-product error left by the MSE stage. This is the key insight from the paper.
📦
Encoded Output
Compact code ready for storage in FAISS metadata and Redis.

Code Layout (d=384, 4-bit total)

float32 norm
4 B
MSE indices · 3-bit × padded_dim(512)
192 B
QJL sign bits · 1-bit × dim(384)
48 B
Total: 244 bytes vs 1536 bytes FP32 → 6.3× compression

Unbiased Inner Product Estimation

At inference time, the inner product between a full-precision query y and a stored compressed vector is computed asymmetrically (query stays full-precision):

⟨y, x̂_prod⟩ = ⟨y, x̂_mse⟩ + √(π/2)/d · (S·y)ᵀ · sign_bits
MSE reconstruction term + QJL bias correction term

Proof sketch (Theorem 2, arXiv:2504.19874)

Since x = x̂_mse + r (decomposition by definition of residual):

E[⟨y, x̂_prod⟩] = ⟨y, x̂_mse⟩ + E[QJL correction] = ⟨y, x̂_mse⟩ + ⟨y, r⟩ // QJL is unbiased by Lemma 4 = ⟨y, x̂_mse + r⟩ = ⟨y, x⟩ ✓

Variance: σ²(score) = 0.60² × π/(2d) ≈ 0.00073 for d=384 → σ ≈ 0.027. This is comparable to the inherent noise in the embedding model itself.

TurboQuant vs Traditional Product Quantization (PQ)

PropertyFAISS IVF+PQTurboQuant_prod
Training required✗ Yes - k-means on all vectors✓ None - data-oblivious
Online indexing✗ Must retrain as distribution shifts✓ Encode any vector instantly
Inner product bias✗ Uncharacterised analytically✓ Zero bias (proved)
Distortion boundData-dependent, no guaranteeWithin 2.7× of Shannon lower bound
Threshold validity✗ Score drift at low bits✓ 0.85 threshold holds at 4-bit
Cold start quality✗ Poor until codebooks trained✓ Full quality from first vector
🔧

Enable: ENABLE_TURBO_QUANT=1 ./build/lettucecache
TurboQuantizer is wired into FaissVectorStore::add() (auto-encodes every entry) and ValidationService::score() (uses unbiased inner product when tq_codes present). Seeds are deterministic (rotation=42, QJL=137) ensuring reproducible compression across restarts.

The bug (initial implementation): Walsh-Hadamard Transform on a 384-dim vector pads to 512 (next power of 2). After WHT, signal energy distributes across all 512 output coordinates. The original code quantized only the first 384 - discarding energy in the remaining 128.

Symptom: MSE of 0.233 for d=384 vs theoretical bound of 0.133.

Fix (TurboQuantizer.cpp):
// Before (bug): loop j in 0..dim_ → only 384 coords // After (fix): loop j in 0..padded_dim_ → all 512 coords mse_bytes = (padded_dim_ * mse_bits + 7) / 8 // 192 bytes not 144 sqrt_n = sqrt(padded_dim_) // 512 not 384
After fix: MSE < 0.10 for d=384 (well within theoretical bound). Code size: 244 bytes/vector (4 norm + 192 MSE + 48 QJL).

API Reference

The orchestrator exposes a REST API on port 8080 (configurable via HTTP_PORT).

POST /query
GET /health
GET /stats
DELETE /cache/:key
Sidecar
POST/query

Main entry point. Checks L1 → L2 → LLM. Returns the answer with cache metadata.

Request Body

{
  "query": "What is the return policy?",
  "context": ["I bought a jacket last week", "It doesn't fit"],
  "user_id": "u_123",
  "session_id": "sess_abc",
  "domain": "ecommerce",
  "correlation_id": "req_xyz"
}
FieldTypeDescription
queryrequiredstringThe user's question
contextstring[]Prior conversation turns. Sorted canonically before hashing - order doesn't matter.
user_idstringHashed to 16-char scope token. Never stored raw.
session_idstringPassed through to logs for distributed tracing. Not used in cache logic.
domainstringDomain tag (e.g. ecommerce, healthcare). Defaults to "general".
correlation_idstringEchoed in structured logs.

Response

{
  "answer": "You can return items within 30 days of purchase.",
  "cache_hit": true,
  "confidence": 0.93,
  "cache_entry_id": "lc:l1:a3f2c1d8...",
  "latency_ms": 45
}
FieldTypeDescription
answerstringThe response text
cache_hitbooleantrue if served from L1 or L2 cache
confidencefloatValidation score 0.0-1.0. 1.0 on L1 hit (no scoring needed). 0.0 on LLM fallback.
cache_entry_idstringFAISS entry ID on L2 hit; L1 key on L1 hit; empty on miss.
latency_msintegerTotal server-side processing time
GET/health

Dependency health check. Returns 200 when all services healthy, 503 when degraded. Used as Kubernetes readiness probe.

// 200 OK - healthy
{
  "status": "ok",
  "redis": true,
  "embedding_sidecar": true,
  "faiss_entries": 1042,
  "queue_depth": 3
}

// 503 Service Unavailable - degraded
{
  "status": "degraded",
  "redis": false,
  "embedding_sidecar": true,
  "faiss_entries": 1042,
  "queue_depth": 0
}
GET/stats

Lightweight stats snapshot. No dependency checks - always fast.

{
  "faiss_entries": 1042,
  "queue_depth": 3
}
DELETE/cache/{key}

Evicts a specific entry from FAISS and Redis L1. Use when a cached response is stale or incorrect.

curl -X DELETE http://localhost:8080/cache/a3f2c1d8e9f0b2c3
// 200 OK
{ "deleted": true, "key": "a3f2c1d8e9f0b2c3" }

// 404 Not Found
{ "deleted": false, "key": "a3f2c1d8e9f0b2c3" }

The Python sidecar runs on port 8001 (internal; not exposed to clients directly).

POST/embed
// Request
{ "text": "What is the return policy?" }

// Response
{
  "embedding": [0.023, -0.147, 0.891, "...384 floats..."],
  "model": "all-MiniLM-L6-v2",
  "dimension": 384
}
POST/embed_batch
// Request (up to 256 texts)
{ "texts": ["query one", "query two"] }

// Response
{
  "embeddings": [[...], [...]],
  "model": "all-MiniLM-L6-v2",
  "dimension": 384,
  "count": 2
}

Configuration

All runtime configuration is via environment variables. No config files needed.

Environment Variables

VariableDefaultDescription
REDIS_HOSTlocalhostRedis server hostname
REDIS_PORT6379Redis server port
EMBED_URLhttp://localhost:8001Python sidecar base URL
EMBED_DIM384Embedding dimension - must match model
OPENAI_API_KEYemptyOpenAI API key. Empty → stub response mode (useful for testing)
LLM_MODELgpt-4o-miniOpenAI model name (e.g. gpt-4o, gpt-3.5-turbo)
HTTP_PORT8080Orchestrator HTTP listen port
FAISS_INDEX_PATH./faiss.indexPath for FAISS binary + .meta.json sidecar
ENABLE_TURBO_QUANTunsetSet to 1 to enable TurboQuantizer (7.8× compression, unbiased cosine)

Source-Code-Only Tuning

ParameterLocationNotes
Validation threshold (0.85)src/api/HttpServer.cpp:47Lower = more hits; higher = fewer false positives
Scoring weights (0.60/0.25/0.15)src/validation/ValidationService.hMust sum to 1.0
Admission min_frequency (2)src/api/HttpServer.cpp:48Lower = more cache entries; higher = fewer one-offs
Admission window (300s)src/api/HttpServer.cpp:48Rolling frequency window
FAISS NLIST/NPROBE/M_PQsrc/cache/FaissVectorStore.hNLIST=100, NPROBE=10, M_PQ=8, NBITS=8
TurboQuant seedsTurboQuantizer constructorrotation_seed=42, qjl_seed=137 (hardcoded)
L1 TTL (3600s)src/builder/CacheBuilderWorker.hRedis key TTL in seconds

Quick Start

Get LettuceCache running locally in under 5 minutes using Docker Compose.

Clone the repository

git clone git@github.com:Ciphercrypt/LettuceCache.git
cd LettuceCache

Set your API key

export OPENAI_API_KEY=sk-...
# Or leave blank to use stub mode (returns "[LLM not configured] {query}")

Start all services

docker compose up
# Starts: Redis 7 (:6379), Python sidecar (:8001), C++ orchestrator (:8080)
# Wait for: orchestrator | [INFO] HttpServer listening on port 8080

Verify health

curl http://localhost:8080/health
{ "status": "ok", "redis": true, "embedding_sidecar": true, "faiss_entries": 0 }

First and second queries - LLM called (admission requires 2 appearances)

curl -s -X POST http://localhost:8080/query \
  -H 'Content-Type: application/json' \
  -d '{"query":"What is the capital of France?","domain":"geography"}' | jq .
{ "answer": "The capital of France is Paris.", "cache_hit": false, "latency_ms": 712 }

Third query - served from cache (~15× faster)

# Third call hits L2 cache (~45ms); subsequent identical calls hit L1 at ~2ms (~350× faster)
curl -s -X POST http://localhost:8080/query \
  -H 'Content-Type: application/json' \
  -d '{"query":"What is the capital of France?","domain":"geography"}' | jq .
{ "answer": "The capital of France is Paris.", "cache_hit": true, "confidence": 0.94, "latency_ms": 45 }

Enable TurboQuant for compressed storage

ENABLE_TURBO_QUANT=1 LLM_MODEL=gpt-4o-mini \
  OPENAI_API_KEY=$OPENAI_API_KEY \
  FAISS_INDEX_PATH=./my.faiss ./build/lettucecache

Reduces stored embedding size from 1536 bytes to 244 bytes (d=384) with unbiased cosine scoring.

🧪

Run the test suite:

# Unit tests (no external deps required)
cmake -B build && cmake --build build --target unit_tests
cd build && ctest --output-on-failure

# Full happy-path integration test (starts stack automatically)
bash tests/happy_path.sh

# Live demo with real LLM
export OPENAI_API_KEY=sk-...
bash tests/live_demo.sh