Skip to content

Architecture

┌──────────────────────────────────────────────────────────────────┐
│ domains/ forms.py pid.py agentic.py │ ← opinionated, ready-to-run pipelines
├──────────────────────────────────────────────────────────────────┤
│ recognize/ detect/ classify/ postprocess/ export/ │ ← concrete adapters (factories)
│ recognizers/ detectors/ ... │ ← same adapters as classes
│ cache/ prompts/ │
├──────────────────────────────────────────────────────────────────┤
│ pipeline.py types.py events.py errors.py │ ← core protocols + orchestrator
│ __init__.py: read(), Pipeline, decorators │ ← top-level surface
└──────────────────────────────────────────────────────────────────┘

Each upper layer depends only on the layer below it. The core layer has no dependency on any OCR engine, image library, or storage backend — those live in the adapter layer behind optional extras.

┌────────────────┐
Document.load ──► │ Document │
└───────┬────────┘
│ pages()
┌────────────────┐
│ Page (n) │
└───────┬────────┘
┌───────────────────┴────────────────────┐
▼ │
┌───────────────┐ │
│ Preprocessor │ deskew, denoise, normalise │
└───────┬───────┘ │
▼ │
┌───────────────┐ │
│LayoutDetector │ → Layout(regions, relations) │
└───────┬───────┘ │
▼ │
┌───────────────┐ │
│ Classifier │ → marks roles (blank, header) │
└───────┬───────┘ │
▼ │
┌───────────────────┐ │
│ RegionPreprocessor│ per-cell binarise, slice, │
│ │ pad — sets crop_override │
└───────┬───────────┘ │
▼ │
┌───────────────┐ │
│ Recognizer │ ←── Cache (exact + semantic) │
│ │ ←── RecognitionHint (few-shot │
│ │ from SampleStore) │
└───────┬───────┘ │
▼ │
┌───────────────┐ │
│PostProcessor* │ chain: dict, split, score… │
└───────┬───────┘ │
▼ │
┌───────────────┐ │
│ PageResult │ │
└───────┬───────┘ │
└─────────────────► aggregated ─────────►│
┌────────────────┐
│ DocumentResult │ ← knows how to .to_excel(), .to_json(), …
└───────┬────────┘
┌────────────────┐
│ Exporter* │ ← optional in-run writes
└────────────────┘

Every arrow is a Stage. Every box is replaceable. The two *-marked rows are chains: you can stack as many post-processors and exporters as you like.

  • The pipeline is internally async. pipeline(doc) is a sync entry point that wraps the coroutine; await pipeline.aio(doc) is the async one.
  • Calling pipeline(doc) from within a running event loop raises with a pointer to aio. We don’t try to be clever.
  • Recognizers expose max_concurrency; the library enforces it with an asyncio.Semaphore. You never call the recogniser directly — you submit regions and the recogniser batches.
  • Per-page work runs sequentially by default; pass page_concurrency=N to parallelise across pages. Per-region work inside a recogniser is always parallel up to its budget.

Context.cancel is an asyncio.Event. Long-running stages must check it at yield points; the standard adapters do. The pipeline exposes pipeline.cancel() and pipeline.cancelled for callers.

A custom stage that iterates regions should make the check explicit:

async def run(self, ctx: Context) -> Context:
for region in ctx.layout.regions:
if ctx.cancelled.is_set():
raise asyncio.CancelledError
await self._process(region)
return ctx

A stage that ignores ctx.cancelled will simply run to completion after pipeline.cancel() is called — it will not deadlock, but it will not honour the cancel either. Wrap any blocking subprocess or HTTP call in asyncio.wait_for(..., timeout=...) if you cannot insert your own checkpoint.

Three error classes, all under scriva.errors:

  • ConfigurationError — raised at construction time. Wrong wiring, missing capabilities, malformed config. Never raised mid-run.
  • EngineError — raised by an adapter when its upstream (an HTTP API, a binary on $PATH, a model file) failed. Carries engine.name and the original exception.
  • RecognitionError — raised when a region was attempted but no recogniser could produce text. The pipeline’s error_policy decides whether this aborts the page or attaches an empty Recognition with error set.

Stages never raise raw exceptions to the caller — they wrap, attach context, and emit a structured error event before re-raising.

Events are typed:

class Event(BaseModel):
stage: str # "recognize"
kind: Literal["started", "progress", "finished", "error", "cache_hit"]
timestamp: datetime
payload: dict[str, Any] # stage-specific; see table below

Three subscription paths, one canonical recommendation each:

  • Callbackpipeline(doc, on_event=callback). Best default.
  • Async iteratorasync for item in pipeline.events(doc): yields events and the final result through one channel.
  • SSEscriva.events.to_sse(pipeline.events(doc)) for web servers.

Every standard adapter emits events shaped for the SSE form so a UI can render a progress bar without further translation.

The payload schema for every built-in stage is stable and versioned alongside the JSON exporter schema. Custom stages may emit any extra keys; UI code should ignore unknown keys.

stagekindpayload keys
preprocessfinishedms, plus stage-specific (e.g. angle from orientation)
detectfinishedrows, cols, cells, ms
classifyfinishedblank, merged, ambiguous, ms
region_preprocessfinishedtransformed, sliced_in, sliced_out, ms
recognizestartedtotal
recognizeprogressdone, total, cache_hits, region_id, cell (grid only)
recognizeprogressregion_id, partial_text, tokens (streaming recognizers only)
recognizecache_hitregion_id, tier ("exact" or "semantic"), similarity
recognizefinisheddone, total, cache_hits, ms
post_processfinishedstage, plus stage-specific (e.g. corrected from dictionary)
exportfinishedpath, format, bytes
(any)errorerror, message, traceback, region_id (when applicable)

The cell payload on a recognize/progress event is present when the region has region.grid set. Its shape is:

{
"row": int,
"col": int,
"rowspan": int,
"colspan": int,
"text": str | None,
"is_blank": bool,
"confidence": float | None,
"cache": {"tier": "exact" | "semantic", "similarity": float} | None,
}

This is the same shape recognition.to_event_dict() returns, so a UI that consumes one can consume the other.

The cache sits inside the recogniser stage as a wrapper: the recogniser looks up each region before calling its engine, records hits/misses in the event stream, and writes back on miss. Cache adapters never reach into engine adapters; engine adapters never know whether they are cached. The hit provenance shows up on recognition.cache.

  • It does not render PDFs. PDF rendering is a Document plugin behind the pdf extra.
  • It does not call any HTTP API. Every API call lives in an adapter.
  • It does not assume a storage backend. Cache, SampleStore, Exporter, and Document loading are all swappable.
  • It does not impose a database. Persistence of jobs/results is out of scope; if you need it, your application owns it and uses scriva for the pure OCR transform.

This is what makes the same library suitable for a CLI, a worker, a notebook, or a web server.