Caching
Caching
Section titled “Caching”Caches sit between the pipeline and the recogniser. They are opt-in — build a pipeline without one and every region hits the engine.
The shortest path
Section titled “The shortest path”cache= on any recognizer factory accepts a string, a Cache, or None:
from scriva.recognize import openai
openai(model="gpt-4o", cache=".scriva_cache") # FileSystemCacheopenai(model="gpt-4o", cache=Cache.layered(".scriva")) # exact + semanticopenai(model="gpt-4o", cache=Cache.redis("redis://...")) # shared workersopenai(model="gpt-4o", cache=None) # no cache (default)When you pass a string or path, scriva creates a FileSystemCache at that
location. When you want both tiers, Cache.layered(path) is the canonical
default.
Two tiers
Section titled “Two tiers”Exact cache
Section titled “Exact cache”Keyed by a deterministic hash of (crop_bytes, recognizer.name, recognizer.version, prompt.hash, model). A hit returns the previous
Recognition verbatim; bit-identical input always hits.
Cache.fs(".scriva_cache") # FileSystemCache shorthandCheap, fast, lossless. Always safe.
Semantic cache
Section titled “Semantic cache”Keyed by an embedding of the crop. A hit returns a previous result when the
cosine similarity to a cached embedding exceeds threshold (default
0.99 — very high, because OCR is unforgiving and “64” vs “65” cells can
look 0.97-similar).
from scriva.embedders import OpenAIEmbedder
Cache.vector( embedder=OpenAIEmbedder(model="text-embedding-3-small"), threshold=0.99, path=".scriva_cache/vec",)Pays for itself on documents with repeating elements (column headers, boilerplate, common labels).
Layered cache
Section titled “Layered cache”Combine the two. Exact is checked first; semantic only on miss:
Cache.layered( ".scriva_cache", semantic=Cache.vector(embedder=OpenAIEmbedder(...)),)Cache.layered(path) with no semantic= is just an exact FileSystemCache
— but the form survives later upgrades without changing the call site.
Where cache lives
Section titled “Where cache lives”cache= is an argument on the recognizer factory; it is also accepted as
.recognize(stage, cache=...) in the fluent builder. The cache is not a
global — different recognisers in a fallback chain can keep their hits
separate.
Protocol
Section titled “Protocol”class Cache(Engine, Protocol): name: str async def get(self, key: CacheKey) -> CacheHit | None: ... async def put(self, key: CacheKey, recognition: Recognition) -> None: ...CacheHit carries both the recognition and the provenance (which tier
hit, similarity score, original key). The pipeline records this in events
and on the recognition.cache slot so a UI can render hit/miss.
Invalidation
Section titled “Invalidation”There is no clever invalidation. The cache key includes the recognizer name, version, and prompt hash — bumping any of those is the invalidation.
Two patterns:
- Prompt change — set a new
Promptor bump itsversion. Old entries remain on disk but never match. - Forced refresh — pass
cache_policy="bypass"to a singlepipeline(doc)call to skip lookups for that run only. Writes still happen, so subsequent runs benefit.
Sizing
Section titled “Sizing”Cache.fs(...)stores one small JSON per entry underpath/. Cleanup is the caller’s job — most users let it grow forever and accept the inode cost.Cache.vector(...)storesembeddings.npy+responses.json. Memory isO(entries × dim × 4 bytes); 100k entries × 1024 dim ≈ 400 MB.
What not to cache
Section titled “What not to cache”- Outputs flagged
Capability.HANDWRITING. Handwriting recognisers tend to be context-sensitive; caching them produces brittle results. - Anything where the recognised text is itself the cache key. Caches consult the image, not the text.
Custom backends
Section titled “Custom backends”Implement the protocol. Two patterns are common:
- Redis exact cache — for shared workers. Five lines on top of
redis.asyncio. - pgvector semantic cache — for systems that already run Postgres. Replaces the in-memory NumPy index.
Both are trivial to write against the protocol; Cache.redis(...) ships in
core for the first case.