LettuceCache

Context-aware semantic cache for LLMs. Stops redundant API calls without false hits - because the same question means different things in different conversations.

⚡ 1–3ms L1 hit 🎯 25–60ms L2 hit 🚀 ~15× faster than LLM 🔒 Context-isolated

lettucecache demo

# First call - LLM is invoked

$ curl -X POST :8080/query -d '{"query":"What is the return policy?","domain":"ecommerce"}'

{ "cache_hit": false, "latency_ms": 843, "answer": "Items can be returned within 30 days..." }

# Second call - served from cache instantly

$ curl -X POST :8080/query -d '{"query":"What is the return policy?","domain":"ecommerce"}'

{ "cache_hit": true, "confidence": 0.94, "latency_ms": 47, "answer": "Items can be returned within 30 days..." }

# Same query, different context - correctly misses

$ curl -X POST :8080/query -d '{"query":"What is the return policy?","domain":"fitness"}'

{ "cache_hit": false, "latency_ms": 761, "answer": "Our gym membership cancellation policy..." }

The Problem

Traditional caching matches on exact text. Semantic caching matches on meaning. Both fail for context-dependent queries:

User A - "What is the cancellation policy?" (hotel booking)
User B - "What is the cancellation policy?" (gym membership)

A naive semantic cache serves User B the hotel answer - a false hit worse than a miss.

Approach	Exact match	Semantic match	Context-aware
Traditional KV cache	✓	✗	✗
Semantic cache (embedding only)	✗	✓	✗
LettuceCache	✓	✓	✓

Key Numbers

Metric	Value
L1 cache latency (Redis exact match)	1–3 ms
L2 embedding latency (Python sidecar, CPU)	20–50 ms
L2 FAISS search latency	1–3 ms
End-to-end L1 hit	1–3 ms
End-to-end L2 hit	25–60 ms
LLM call baseline	500-2000 ms
Validation threshold	0.85 (configurable)
Embedding model	all-MiniLM-L6-v2 (384 dims)
TurboQuant compression	7.8× (1536B → 196B, d=384)

Architecture

Every query goes through two cache checks before hitting the LLM. Here is how that works: All reads are non-blocking; all writes are async.

⚡ HOT PATH

📥

POST /query

HTTP :8080

→

🔑

ContextBuilder

SHA-256 signature

→

⚡

L1 · Redis

exact hash · 1–3ms

✓ HIT → return

✗ miss ↓

🧠

EmbeddingClient

384-dim · Python :8001

→

🔍

L2 · FAISS

ANN top-5 · 1–3ms

→

⚖️

ValidationService

score ≥ 0.85

✓ HIT → return

✗ miss ↓

🤖

LLM

gpt-4o-mini · 500ms+

⟳ ASYNC WRITE PATH (never blocks response)

📬

CacheBuilderWorker

background thread

→

🚦

AdmissionController

freq ≥ 2 · window 300s

→

✂️

Templatizer

strip PII → {{SLOT_N}}

→

💾

FAISS + Redis Write

L2 + L1 · TTL 3600s

Services

🔧

C++ Orchestrator

Port 8080 · cpp-httplib · all components wired at construction · shared_mutex on FAISS reads

🐍

Python Sidecar

Port 8001 · FastAPI · all-MiniLM-L6-v2 · 384-dim L2-normalised · circuit breaker on C++ side

⚡

Redis L1

Port 6379 · SETEX 3600s · RDB persistence · key: lc:l1:{sha256}

🔍

FAISS L2

In-process · IVF+PQ index · metadata sidecar (.meta.json) survives restarts

Composition root: src/api/HttpServer.cpp owns every component as a unique_ptr and constructs them in dependency order. Nothing else uses new - all cross-component communication is via references.

TurboQuantizer is created first (if enabled) - FaissVectorStore and ValidationService both receive a raw pointer to it.
IntelligentAdmissionPolicy receives a reference to FaissVectorStore (for novelty search).
QueryOrchestrator receives a reference to the policy to record cache hits for adaptive threshold updates.
Destruction order: svr_ → builder_ → everything else (FAISS persist on destructor).

HttpServer owns: tq_ → redis_ → faiss_ → embedder_ → llm_ → validator_ → admission_ → policy_ → quality_filter_ → templatizer_ → builder_ → orchestrator_ → svr_

FaissVectorStore: std::shared_mutex - concurrent search() calls (shared lock), exclusive add()/remove()/persist().
CacheBuilderWorker: single background thread with std::condition_variable queue. enqueue() is the only method called from the hot path - it acquires a brief mutex, pushes, and notifies.
EmbeddingClient: std::mutex curl_mutex_ around the persistent CURL handle.
RedisCacheAdapter: single redisContext* with no mutex - open bug. Concurrent requests race; a connection pool or per-call mutex is needed.
IntelligentAdmissionPolicy: two separate mutexes - freq_mutex_ and domain_mutex_ - to avoid contention between frequency updates and domain stats reads.

How It Works

Every request flows through a deterministic pipeline. Here is the exact sequence:

Client sends POST /query

Send query, context[] (prior turns), domain, and optionally user_id / correlation_id.

ContextBuilder computes signature CPU only

Pulls the first 3 meaningful words from the query as intent. Hashes user_id to a 16-char token (never stored raw). Sorts context turns. Combines into SHA-256(intent:domain:scope:sorted_context).

L1 Redis lookup 1–3ms

Looks up lc:l1:{sig_hash}. Hit returns instantly at confidence 1.0. Miss moves to the embedding step.

Embed query via Python sidecar 20–50ms (CPU)

Calls :8001/embed, gets back 384 floats (L2-normalised). If the sidecar is down, a circuit breaker kicks in and falls through to the LLM instead of hanging.

FAISS search + ValidationService 1–3ms

Returns top-5 candidates. Each scored: 0.60×cosine + 0.25×ctx_match + 0.15×domain_match. First candidate with score ≥ 0.85 wins. On hit: backfills L1 with the rendered response (slot placeholders filled).

LLM fallback + async cache build 500ms+

Calls OpenAI gpt-4o-mini. Response returned to client immediately. Entry enqueued to CacheBuilderWorker (background thread) - admission check, templatization, FAISS+Redis write. Client never waits for this.

🔒

Privacy by design. user_id is hashed to a 16-char token and never stored raw. Query text is never persisted - only the 384-dim embedding and a templatized response with high-entropy tokens (UUIDs, dates, numbers, proper nouns) replaced by {{SLOT_N}} placeholders.

⚡

Writes never block reads. The CacheBuilderWorker runs on a dedicated background thread with a std::condition_variable queue. The orchestrator calls enqueue() and returns the HTTP response immediately.

Input: query text, context[] array, domain string, user_id string

Step 1 - intent extraction (ContextBuilder.cpp:extractIntent()):
Lowercase → strip non-alphanumeric → skip 34 stopwords → take first 3 tokens → join with _.

"What is machine learning?" → "machine_learning" "How do I reset my password?" → "reset_password"

Step 2 - user anonymisation: SHA-256("user:" + user_id)[:16] - never stores raw identity.

Step 3 - context canonicalisation: std::sort(context_turns) - reversed order produces same hash.

Step 4 - signature:

SHA-256( intent + ":" + domain + ":" + user_scope + ":" + sorted_context_joined )

This 64-char hex string is the Redis L1 key suffix and the context_signature on each FAISS entry.

On cache write (Templatizer::templatize()):
Tokenise response → for each token, classify:

isNumeric: regex ^-?\d+(\.\d+)?%?$
looksLikeDate: regex for YYYY-MM-DD / DD/MM/YYYY patterns
looksLikeUUID: 8-4-4-4-12 hex pattern
looksLikeProperNoun: starts uppercase, rest lowercase, length > 3, not in common-word blocklist

High-entropy tokens → replaced with {{SLOT_N}}. Slot values stored in Redis at lc:slots:{entry_id}.

On L2 serve (QueryOrchestrator.cpp):
Read lc:slots:{entry_id} → parse JSON array → Templatizer::render(template, slot_values) → fills placeholders back in. Falls back to raw template if slots have expired.

Validation Scoring

For each FAISS candidate, a composite score is computed. First one at or above 0.85 gets returned as a cache hit.

score = 0.60 × cosine_similarity + 0.25 × context_match + 0.15 × domain_match

Interactive Score Calculator

Cosine similarity 0.85

Context signature match

Domain match

Score 0.976

threshold 0.85

✓ CACHE HIT

✓ HIT

Same context, same domain
"How do I cancel?" → "Can I cancel?" (paraphrase)

0.60×0.96 = 0.576
0.25×1.0 = 0.250
0.15×1.0 = 0.150

0.976

✗ MISS

Different context (hotel vs gym)
Near-perfect text match, wrong context

0.60×0.99 = 0.594
0.25×0.0 = 0.000
0.15×0.0 = 0.000

0.594

✓ HIT

Related query, same context
"Cancel order" → "Refund process"

0.60×0.78 = 0.468
0.25×1.0 = 0.250
0.15×1.0 = 0.150

0.868

⚡

TurboQuant-corrected scoring. With ENABLE_TURBO_QUANT=1, the cosine term uses the unbiased inner-product estimator: E[TQ_ip(y, encode(x))] = ⟨y, x⟩. This is critical - at 1-bit MSE alone, every cosine score would be scaled by 2/π ≈ 0.64, making 0.85 unreachable. TurboQuant_prod eliminates this bias entirely.

src/validation/ValidationService.cpp

Two paths for cosine similarity:

TurboQuant path: if tq_ != nullptr and candidate.tq_codes is non-empty → calls tq_->inner_product(query_embedding, tq_codes). Returns unbiased estimate E[result] = ⟨y,x⟩.
Fast dot-product path: embeddings are L2-normalised by the Python sidecar, so cosine = dot product. The original sqrt(norm_a) * sqrt(norm_b) computation was always computing sqrt(1)×sqrt(1) = 1 - eliminated.

Composite score:

score = 0.60 × cosine + 0.25 × (sig_hash match ? 1 : 0) + 0.15 × (domain match ? 1 : 0)

A context mismatch alone caps the maximum achievable score at 0.75 - safely below the 0.85 threshold, making false hits across conversation contexts structurally impossible.

Context Signatures

This is what stops false hits. Two identical queries with different conversation histories get different signatures and never share a cache entry.

Extract Intent

"What is the cancellation policy for my booking?" → strip stopwords → cancellation_policy_booking

First 3 non-stopword tokens, lowercase, joined by underscore

↓

Anonymise User

SHA-256("user:u_12345")[:16] → 3a7f2c91b0e4d852

Deterministic 16-char hex token - user_id is never stored raw

↓

Canonicalize Context

["bot: hello", "user: hi"] → sorted → ["bot: hello", "user: hi"]|

Sort ensures the hash is the same regardless of turn order

↓

Compute SHA-256

SHA-256("cancellation_policy_booking:ecommerce:3a7f2c91b0e4d852:...")

= a3f2c1d8e9f0b2c3... - this is the Redis L1 key suffix

Why Context Mismatch Always Misses

Scenario	Cosine	Ctx sig	Domain	Score	Result
Same query, same context, same domain	0.97	1.0	1.0	0.97	HIT
Same query, different context	0.97	0.0	1.0	0.73	MISS
Same query, same context, different domain	0.97	0.0	0.0	0.58	MISS
Very similar query, same context	0.93	1.0	1.0	0.96	HIT

⚠️

Context mismatch caps score at 0.75. The context term (weight 0.25) contributes a maximum of 0.25 on match. On mismatch: max score = 0.60×1.0 + 0.0 + 0.15×1.0 = 0.75 - safely below the 0.85 threshold. The cache cannot serve a false hit across context boundaries regardless of embedding similarity.

Old behaviour (bug): ["user: hi", "bot: hello"] and ["bot: hello", "user: hi"] produced different SHA-256 hashes → same conversation, two cache entries.

Fix (ContextBuilder.cpp):

canonical_context = std::sort(context_turns) // lexicographic signature = SHA-256(intent:domain:scope:canonical_joined)

Sorted turns → same hash regardless of order. The original std::find over a 34-element vector for stopword lookup was also replaced with an std::unordered_set for O(1) lookup.

Admission Control

Not every LLM response is worth caching. AdmissionController prevents one-off queries from polluting the FAISS index and keeps cached entries generalisable.

Rules

Rule	Default	Purpose
min_frequency	2	Same signature must be seen ≥ N times in the window
window_seconds	300 (5 min)	Rolling time window - counter resets after inactivity
max_response_bytes	32 768	Responses > 32 KB are too specific to reuse

Frequency Counting Timeline

Request

Count

Action

Request #1

count=1

recordQuery() ✗ Rejected

Request #2

count=2

shouldAdmit() = true → writes to FAISS + Redis ✓ Admitted

Request #3+

count=3+

L1/L2 cache serves it - CacheBuilderWorker deduplicates ✓ Served

After 5 min idle

reset

evictExpired() - counter resets to 0

Request #4

count=1

Window expired - starts fresh ✗ Rejected

🗜

Templatizer runs after admission. When admitted, Templatizer replaces high-entropy tokens (UUIDs, dates, numbers, proper nouns) with {{SLOT_N}} placeholders. At serve time, slot values are restored from Redis (lc:slots:{entry_id}). This makes cached responses reusable across minor surface variations.

File: src/builder/IntelligentAdmissionPolicy.cpp

Pipeline in CacheBuilderWorker::processEntry():

Size gate (AdmissionController) - fast reject oversized/empty
ResponseQualityFilter - hard reject conversational, session-bound, refusals, dynamic content
CVF decision (IntelligentAdmissionPolicy)

CVF signals:

frequency = 1 - exp(-Σ exp(-λ·age_i)) λ = ln2/120s (half-life) cost = sigmoid((tokens × tier - 150) / 350) tier: gpt-4o=3.0 · gpt-4o-mini=1.0 · gpt-3.5=0.4 novelty = 1 - max_cosine(new_emb, faiss_top1) cosine > 0.94 → hard-reject (near-duplicate) adaptive_threshold = base(0.42) ± 0.08 × domain_hit_rate_signal

Novel insight (literature gap): No existing LLM caching paper unifies all four signals. LettuceCache fills this gap - identified as open research in the survey (resources/Intelligent Cache Admission…).

TurboQuant Integration

TurboQuant (arXiv:2504.19874, Zandieh et al. 2025) is a data-oblivious vector quantizer that achieves near-optimal distortion with zero training time and provably unbiased inner-product estimation - the two properties LettuceCache needs most.

7.8×

compression
1536B → 196B (d=384)

training time
data-oblivious · online

≈0

bias
E[TQ_ip] = ⟨y,x⟩

Why LettuceCache Needs This

❌ Without TurboQuant (plain MSE)

MSE quantizers introduce a multiplicative bias of 2/π ≈ 0.637 on inner-product estimates. At 1-bit, every cosine similarity is scaled down:

threshold = 0.85

effective = 0.85 / 0.637 = 1.336

→ impossible to reach ✗

✓ With TurboQuant_prod

The two-stage algorithm (MSE + QJL residual) produces an unbiased estimator - threshold remains exactly as configured:

E[⟨y, Q⁻¹(Q(x))⟩] = ⟨y, x⟩

σ(score) ≈ 0.027 at d=384

→ threshold holds at 0.85 ✓

Algorithm Walkthrough

TurboQuant_prod encodes a d-dimensional vector in two stages, each addressing a different objective:

Input Vector

d-dimensional embedding from all-MiniLM-L6-v2. L2-normalised (‖x‖₂ = 1).

↓

Store Norm + Normalize

Store ‖x‖₂ as float32 (4 bytes). Compute unit vector x̂ = x / ‖x‖.

norm = ‖x‖₂ → store as float32 x̂ = x / norm

↓

Randomized Hadamard Transform (RHT)

Apply random sign-flip then Walsh-Hadamard transform. This rotates x̂ so that each coordinate follows N(0, 1/padded_dim) - independent of the input. Runs in O(d log d).

D = diag(±1) drawn from seed=42 y = (1/√n) · WHT · D · x̂_padded

After rotation, the coordinate distribution is known analytically - this is what enables optimal codebooks without data.

↓

Lloyd-Max Scalar Quantization (3 bits)

Scale each coordinate by √n (→ N(0,1)), then binary-search the precomputed codebook to find the nearest centroid. Pack index as 3 bits.

4-bit Lloyd-Max codebook for N(0,1) - 16 centroids

−2.733−2.069−1.618−1.256 −0.942−0.657−0.388−0.128 +0.128+0.388+0.657+0.942 +1.256+1.618+2.069+2.733

3-bit codebook (used in MSE stage) - 8 centroids

−2.152−1.344−0.756−0.245 +0.245+0.756+1.344+2.152

MSE distortion bound (Theorem 1): D_mse ≤ √(23π) · 4⁻³ ≈ 0.133 for any unit vector. Empirically ~0.03 at d=768.

↓

QJL Residual Correction (1 bit)

Decode the MSE reconstruction x̂_mse, compute the residual r = x − x̂_mse, then apply the Quantized Johnson-Lindenstrauss transform:

r = x - decode(mse_codes) S ∈ ℝᵈˣᵈ, i.i.d. N(0,1), seed=137 sign_bits = sign(S · r) // d bits total

The QJL transform provides an unbiased 1-bit correction for the inner-product error left by the MSE stage. This is the key insight from the paper.

↓

📦

Encoded Output

Compact code ready for storage in FAISS metadata and Redis.

Code Layout (d=384, 4-bit total)

float32 norm

4 B

MSE indices · 3-bit × padded_dim(512)

192 B

QJL sign bits · 1-bit × dim(384)

48 B

Total: 244 bytes vs 1536 bytes FP32 → 6.3× compression

Unbiased Inner Product Estimation

At inference time, the inner product between a full-precision query y and a stored compressed vector is computed asymmetrically (query stays full-precision):

⟨y, x̂_prod⟩ = ⟨y, x̂_mse⟩ + √(π/2)/d · (S·y)ᵀ · sign_bits

MSE reconstruction term + QJL bias correction term

Proof sketch (Theorem 2, arXiv:2504.19874)

Since x = x̂_mse + r (decomposition by definition of residual):

E[⟨y, x̂_prod⟩] = ⟨y, x̂_mse⟩ + E[QJL correction] = ⟨y, x̂_mse⟩ + ⟨y, r⟩ // QJL is unbiased by Lemma 4 = ⟨y, x̂_mse + r⟩ = ⟨y, x⟩ ✓

Variance: σ²(score) = 0.60² × π/(2d) ≈ 0.00073 for d=384 → σ ≈ 0.027. This is comparable to the inherent noise in the embedding model itself.

TurboQuant vs Traditional Product Quantization (PQ)

Property	FAISS IVF+PQ	TurboQuant_prod
Training required	✗ Yes - k-means on all vectors	✓ None - data-oblivious
Online indexing	✗ Must retrain as distribution shifts	✓ Encode any vector instantly
Inner product bias	✗ Uncharacterised analytically	✓ Zero bias (proved)
Distortion bound	Data-dependent, no guarantee	Within 2.7× of Shannon lower bound
Threshold validity	✗ Score drift at low bits	✓ 0.85 threshold holds at 4-bit
Cold start quality	✗ Poor until codebooks trained	✓ Full quality from first vector

🔧

Enable: ENABLE_TURBO_QUANT=1 ./build/lettucecache
TurboQuantizer is wired into FaissVectorStore::add() (auto-encodes every entry) and ValidationService::score() (uses unbiased inner product when tq_codes present). Seeds are deterministic (rotation=42, QJL=137) ensuring reproducible compression across restarts.

The bug (initial implementation): Walsh-Hadamard Transform on a 384-dim vector pads to 512 (next power of 2). After WHT, signal energy distributes across all 512 output coordinates. The original code quantized only the first 384 - discarding energy in the remaining 128.

Symptom: MSE of 0.233 for d=384 vs theoretical bound of 0.133.

Fix (TurboQuantizer.cpp):

// Before (bug): loop j in 0..dim_ → only 384 coords // After (fix): loop j in 0..padded_dim_ → all 512 coords mse_bytes = (padded_dim_ * mse_bits + 7) / 8 // 192 bytes not 144 sqrt_n = sqrt(padded_dim_) // 512 not 384

After fix: MSE < 0.10 for d=384 (well within theoretical bound). Code size: 244 bytes/vector (4 norm + 192 MSE + 48 QJL).

API Reference

The orchestrator exposes a REST API on port 8080 (configurable via HTTP_PORT).

POST /query

GET /health

GET /stats

DELETE /cache/:key

Sidecar

POST/query

Main entry point. Checks L1 → L2 → LLM. Returns the answer with cache metadata.

Request Body

{
  "query": "What is the return policy?",
  "context": ["I bought a jacket last week", "It doesn't fit"],
  "user_id": "u_123",
  "session_id": "sess_abc",
  "domain": "ecommerce",
  "correlation_id": "req_xyz"
}

Field	Type	Description
queryrequired	string	The user's question
context	string[]	Prior conversation turns. Sorted canonically before hashing - order doesn't matter.
user_id	string	Hashed to 16-char scope token. Never stored raw.
session_id	string	Passed through to logs for distributed tracing. Not used in cache logic.
domain	string	Domain tag (e.g. ecommerce, healthcare). Defaults to "general".
correlation_id	string	Echoed in structured logs.

Response

{
  "answer": "You can return items within 30 days of purchase.",
  "cache_hit": true,
  "confidence": 0.93,
  "cache_entry_id": "lc:l1:a3f2c1d8...",
  "latency_ms": 45
}

Field	Type	Description
answer	string	The response text
cache_hit	boolean	true if served from L1 or L2 cache
confidence	float	Validation score 0.0-1.0. 1.0 on L1 hit (no scoring needed). 0.0 on LLM fallback.
cache_entry_id	string	FAISS entry ID on L2 hit; L1 key on L1 hit; empty on miss.
latency_ms	integer	Total server-side processing time

GET/health

Dependency health check. Returns 200 when all services healthy, 503 when degraded. Used as Kubernetes readiness probe.

// 200 OK - healthy
{
  "status": "ok",
  "redis": true,
  "embedding_sidecar": true,
  "faiss_entries": 1042,
  "queue_depth": 3
}

// 503 Service Unavailable - degraded
{
  "status": "degraded",
  "redis": false,
  "embedding_sidecar": true,
  "faiss_entries": 1042,
  "queue_depth": 0
}

GET/stats

Lightweight stats snapshot. No dependency checks - always fast.

{
  "faiss_entries": 1042,
  "queue_depth": 3
}

DELETE/cache/{key}

Evicts a specific entry from FAISS and Redis L1. Use when a cached response is stale or incorrect.

curl -X DELETE http://localhost:8080/cache/a3f2c1d8e9f0b2c3

// 200 OK
{ "deleted": true, "key": "a3f2c1d8e9f0b2c3" }

// 404 Not Found
{ "deleted": false, "key": "a3f2c1d8e9f0b2c3" }

The Python sidecar runs on port 8001 (internal; not exposed to clients directly).

POST/embed

// Request
{ "text": "What is the return policy?" }

// Response
{
  "embedding": [0.023, -0.147, 0.891, "...384 floats..."],
  "model": "all-MiniLM-L6-v2",
  "dimension": 384
}

POST/embed_batch

// Request (up to 256 texts)
{ "texts": ["query one", "query two"] }

// Response
{
  "embeddings": [[...], [...]],
  "model": "all-MiniLM-L6-v2",
  "dimension": 384,
  "count": 2
}

Configuration

All runtime configuration is via environment variables. No config files needed.

Environment Variables

Variable	Default	Description
REDIS_HOST	localhost	Redis server hostname
REDIS_PORT	6379	Redis server port
EMBED_URL	http://localhost:8001	Python sidecar base URL
EMBED_DIM	384	Embedding dimension - must match model
OPENAI_API_KEY	empty	OpenAI API key. Empty → stub response mode (useful for testing)
LLM_MODEL	gpt-4o-mini	OpenAI model name (e.g. gpt-4o, gpt-3.5-turbo)
HTTP_PORT	8080	Orchestrator HTTP listen port
FAISS_INDEX_PATH	./faiss.index	Path for FAISS binary + .meta.json sidecar
ENABLE_TURBO_QUANT	unset	Set to `1` to enable TurboQuantizer (7.8× compression, unbiased cosine)

Source-Code-Only Tuning

Parameter	Location	Notes
Validation threshold (0.85)	src/api/HttpServer.cpp:47	Lower = more hits; higher = fewer false positives
Scoring weights (0.60/0.25/0.15)	src/validation/ValidationService.h	Must sum to 1.0
Admission min_frequency (2)	src/api/HttpServer.cpp:48	Lower = more cache entries; higher = fewer one-offs
Admission window (300s)	src/api/HttpServer.cpp:48	Rolling frequency window
FAISS NLIST/NPROBE/M_PQ	src/cache/FaissVectorStore.h	NLIST=100, NPROBE=10, M_PQ=8, NBITS=8
TurboQuant seeds	TurboQuantizer constructor	rotation_seed=42, qjl_seed=137 (hardcoded)
L1 TTL (3600s)	src/builder/CacheBuilderWorker.h	Redis key TTL in seconds

Quick Start

Get LettuceCache running locally in under 5 minutes using Docker Compose.

Clone the repository

git clone git@github.com:Ciphercrypt/LettuceCache.git
cd LettuceCache

Set your API key

export OPENAI_API_KEY=sk-...
# Or leave blank to use stub mode (returns "[LLM not configured] {query}")

Start all services

docker compose up
# Starts: Redis 7 (:6379), Python sidecar (:8001), C++ orchestrator (:8080)
# Wait for: orchestrator | [INFO] HttpServer listening on port 8080

Verify health

curl http://localhost:8080/health

{ "status": "ok", "redis": true, "embedding_sidecar": true, "faiss_entries": 0 }

First and second queries - LLM called (admission requires 2 appearances)

curl -s -X POST http://localhost:8080/query \
  -H 'Content-Type: application/json' \
  -d '{"query":"What is the capital of France?","domain":"geography"}' | jq .

{ "answer": "The capital of France is Paris.", "cache_hit": false, "latency_ms": 712 }

Third query - served from cache (~15× faster)

# Third call hits L2 cache (~45ms); subsequent identical calls hit L1 at ~2ms (~350× faster)
curl -s -X POST http://localhost:8080/query \
  -H 'Content-Type: application/json' \
  -d '{"query":"What is the capital of France?","domain":"geography"}' | jq .

{ "answer": "The capital of France is Paris.", "cache_hit": true, "confidence": 0.94, "latency_ms": 45 }

Enable TurboQuant for compressed storage

ENABLE_TURBO_QUANT=1 LLM_MODEL=gpt-4o-mini \
  OPENAI_API_KEY=$OPENAI_API_KEY \
  FAISS_INDEX_PATH=./my.faiss ./build/lettucecache

Reduces stored embedding size from 1536 bytes to 244 bytes (d=384) with unbiased cosine scoring.

🧪

Run the test suite:

# Unit tests (no external deps required)
cmake -B build && cmake --build build --target unit_tests
cd build && ctest --output-on-failure

# Full happy-path integration test (starts stack automatically)
bash tests/happy_path.sh

# Live demo with real LLM
export OPENAI_API_KEY=sk-...
bash tests/live_demo.sh