Skip to content

Caching

Caches sit between the pipeline and the recogniser. They are opt-in — build a pipeline without one and every region hits the engine.

cache= on any recognizer factory accepts a string, a Cache, or None:

from scriva.recognize import openai
openai(model="gpt-4o", cache=".scriva_cache") # FileSystemCache
openai(model="gpt-4o", cache=Cache.layered(".scriva")) # exact + semantic
openai(model="gpt-4o", cache=Cache.redis("redis://...")) # shared workers
openai(model="gpt-4o", cache=None) # no cache (default)

When you pass a string or path, scriva creates a FileSystemCache at that location. When you want both tiers, Cache.layered(path) is the canonical default.

Keyed by a deterministic hash of (crop_bytes, recognizer.name, recognizer.version, prompt.hash, model). A hit returns the previous Recognition verbatim; bit-identical input always hits.

Cache.fs(".scriva_cache") # FileSystemCache shorthand

Cheap, fast, lossless. Always safe.

Keyed by an embedding of the crop. A hit returns a previous result when the cosine similarity to a cached embedding exceeds threshold (default 0.99 — very high, because OCR is unforgiving and “64” vs “65” cells can look 0.97-similar).

from scriva.embedders import OpenAIEmbedder
Cache.vector(
embedder=OpenAIEmbedder(model="text-embedding-3-small"),
threshold=0.99,
path=".scriva_cache/vec",
)

Pays for itself on documents with repeating elements (column headers, boilerplate, common labels).

Combine the two. Exact is checked first; semantic only on miss:

Cache.layered(
".scriva_cache",
semantic=Cache.vector(embedder=OpenAIEmbedder(...)),
)

Cache.layered(path) with no semantic= is just an exact FileSystemCache — but the form survives later upgrades without changing the call site.

cache= is an argument on the recognizer factory; it is also accepted as .recognize(stage, cache=...) in the fluent builder. The cache is not a global — different recognisers in a fallback chain can keep their hits separate.

class Cache(Engine, Protocol):
name: str
async def get(self, key: CacheKey) -> CacheHit | None: ...
async def put(self, key: CacheKey, recognition: Recognition) -> None: ...

CacheHit carries both the recognition and the provenance (which tier hit, similarity score, original key). The pipeline records this in events and on the recognition.cache slot so a UI can render hit/miss.

There is no clever invalidation. The cache key includes the recognizer name, version, and prompt hash — bumping any of those is the invalidation.

Two patterns:

  • Prompt change — set a new Prompt or bump its version. Old entries remain on disk but never match.
  • Forced refresh — pass cache_policy="bypass" to a single pipeline(doc) call to skip lookups for that run only. Writes still happen, so subsequent runs benefit.
  • Cache.fs(...) stores one small JSON per entry under path/. Cleanup is the caller’s job — most users let it grow forever and accept the inode cost.
  • Cache.vector(...) stores embeddings.npy + responses.json. Memory is O(entries × dim × 4 bytes); 100k entries × 1024 dim ≈ 400 MB.
  • Outputs flagged Capability.HANDWRITING. Handwriting recognisers tend to be context-sensitive; caching them produces brittle results.
  • Anything where the recognised text is itself the cache key. Caches consult the image, not the text.

Implement the protocol. Two patterns are common:

  • Redis exact cache — for shared workers. Five lines on top of redis.asyncio.
  • pgvector semantic cache — for systems that already run Postgres. Replaces the in-memory NumPy index.

Both are trivial to write against the protocol; Cache.redis(...) ships in core for the first case.