Caching

Caches sit between the pipeline and the recogniser. They are opt-in — build a pipeline without one and every region hits the engine.

The shortest path

cache= on any recognizer factory accepts a string, a Cache, or None:

from scriva.recognize import openai

openai(model="gpt-4o", cache=".scriva_cache")                  # FileSystemCache
openai(model="gpt-4o", cache=Cache.layered(".scriva"))         # exact + semantic
openai(model="gpt-4o", cache=Cache.redis("redis://..."))       # shared workers
openai(model="gpt-4o", cache=None)                             # no cache (default)

When you pass a string or path, scriva creates a FileSystemCache at that location. When you want both tiers, Cache.layered(path) is the canonical default.

Two tiers

Exact cache

Keyed by a deterministic hash of (crop_bytes, recognizer.name, recognizer.version, prompt.hash, model). A hit returns the previous Recognition verbatim; bit-identical input always hits.

Cache.fs(".scriva_cache")                  # FileSystemCache shorthand

Cheap, fast, lossless. Always safe.

Semantic cache

Keyed by an embedding of the crop. A hit returns a previous result when the cosine similarity to a cached embedding exceeds threshold (default 0.99 — very high, because OCR is unforgiving and “64” vs “65” cells can look 0.97-similar).

from scriva.embedders import OpenAIEmbedder

Cache.vector(
    embedder=OpenAIEmbedder(model="text-embedding-3-small"),
    threshold=0.99,
    path=".scriva_cache/vec",
)

Pays for itself on documents with repeating elements (column headers, boilerplate, common labels).

Layered cache

Combine the two. Exact is checked first; semantic only on miss:

Cache.layered(
    ".scriva_cache",
    semantic=Cache.vector(embedder=OpenAIEmbedder(...)),
)

Cache.layered(path) with no semantic= is just an exact FileSystemCache — but the form survives later upgrades without changing the call site.

Where cache lives

cache= is an argument on the recognizer factory; it is also accepted as .recognize(stage, cache=...) in the fluent builder. The cache is not a global — different recognisers in a fallback chain can keep their hits separate.

Protocol

class Cache(Engine, Protocol):
    name: str
    async def get(self, key: CacheKey) -> CacheHit | None: ...
    async def put(self, key: CacheKey, recognition: Recognition) -> None: ...

CacheHit carries both the recognition and the provenance (which tier hit, similarity score, original key). The pipeline records this in events and on the recognition.cache slot so a UI can render hit/miss.

Invalidation

There is no clever invalidation. The cache key includes the recognizer name, version, and prompt hash — bumping any of those is the invalidation.

Two patterns:

Prompt change — set a new Prompt or bump its version. Old entries remain on disk but never match.
Forced refresh — pass cache_policy="bypass" to a single pipeline(doc) call to skip lookups for that run only. Writes still happen, so subsequent runs benefit.

Sizing

Cache.fs(...) stores one small JSON per entry under path/. Cleanup is the caller’s job — most users let it grow forever and accept the inode cost.
Cache.vector(...) stores embeddings.npy + responses.json. Memory is O(entries × dim × 4 bytes); 100k entries × 1024 dim ≈ 400 MB.

What not to cache

Outputs flagged Capability.HANDWRITING. Handwriting recognisers tend to be context-sensitive; caching them produces brittle results.
Anything where the recognised text is itself the cache key. Caches consult the image, not the text.

Custom backends

Implement the protocol. Two patterns are common:

Redis exact cache — for shared workers. Five lines on top of redis.asyncio.
pgvector semantic cache — for systems that already run Postgres. Replaces the in-memory NumPy index.

Both are trivial to write against the protocol; Cache.redis(...) ships in core for the first case.