Architecture
Architecture
Section titled “Architecture”Layering
Section titled “Layering”┌──────────────────────────────────────────────────────────────────┐│ domains/ forms.py pid.py agentic.py │ ← opinionated, ready-to-run pipelines├──────────────────────────────────────────────────────────────────┤│ recognize/ detect/ classify/ postprocess/ export/ │ ← concrete adapters (factories)│ recognizers/ detectors/ ... │ ← same adapters as classes│ cache/ prompts/ │├──────────────────────────────────────────────────────────────────┤│ pipeline.py types.py events.py errors.py │ ← core protocols + orchestrator│ __init__.py: read(), Pipeline, decorators │ ← top-level surface└──────────────────────────────────────────────────────────────────┘Each upper layer depends only on the layer below it. The core layer has no dependency on any OCR engine, image library, or storage backend — those live in the adapter layer behind optional extras.
The flow of a single run
Section titled “The flow of a single run” ┌────────────────┐ Document.load ──► │ Document │ └───────┬────────┘ │ pages() ▼ ┌────────────────┐ │ Page (n) │ └───────┬────────┘ │ ┌───────────────────┴────────────────────┐ ▼ │ ┌───────────────┐ │ │ Preprocessor │ deskew, denoise, normalise │ └───────┬───────┘ │ ▼ │ ┌───────────────┐ │ │LayoutDetector │ → Layout(regions, relations) │ └───────┬───────┘ │ ▼ │ ┌───────────────┐ │ │ Classifier │ → marks roles (blank, header) │ └───────┬───────┘ │ ▼ │ ┌───────────────────┐ │ │ RegionPreprocessor│ per-cell binarise, slice, │ │ │ pad — sets crop_override │ └───────┬───────────┘ │ ▼ │ ┌───────────────┐ │ │ Recognizer │ ←── Cache (exact + semantic) │ │ │ ←── RecognitionHint (few-shot │ │ │ from SampleStore) │ └───────┬───────┘ │ ▼ │ ┌───────────────┐ │ │PostProcessor* │ chain: dict, split, score… │ └───────┬───────┘ │ ▼ │ ┌───────────────┐ │ │ PageResult │ │ └───────┬───────┘ │ └─────────────────► aggregated ─────────►│ ▼ ┌────────────────┐ │ DocumentResult │ ← knows how to .to_excel(), .to_json(), … └───────┬────────┘ ▼ ┌────────────────┐ │ Exporter* │ ← optional in-run writes └────────────────┘Every arrow is a Stage. Every box is replaceable. The two *-marked rows
are chains: you can stack as many post-processors and exporters as you like.
Concurrency model
Section titled “Concurrency model”- The pipeline is internally
async.pipeline(doc)is a sync entry point that wraps the coroutine;await pipeline.aio(doc)is the async one. - Calling
pipeline(doc)from within a running event loop raises with a pointer toaio. We don’t try to be clever. - Recognizers expose
max_concurrency; the library enforces it with anasyncio.Semaphore. You never call the recogniser directly — you submit regions and the recogniser batches. - Per-page work runs sequentially by default; pass
page_concurrency=Nto parallelise across pages. Per-region work inside a recogniser is always parallel up to its budget.
Cancellation
Section titled “Cancellation”Context.cancel is an asyncio.Event. Long-running stages must check it at
yield points; the standard adapters do. The pipeline exposes
pipeline.cancel() and pipeline.cancelled for callers.
A custom stage that iterates regions should make the check explicit:
async def run(self, ctx: Context) -> Context: for region in ctx.layout.regions: if ctx.cancelled.is_set(): raise asyncio.CancelledError await self._process(region) return ctxA stage that ignores ctx.cancelled will simply run to completion after
pipeline.cancel() is called — it will not deadlock, but it will not
honour the cancel either. Wrap any blocking subprocess or HTTP call in
asyncio.wait_for(..., timeout=...) if you cannot insert your own
checkpoint.
Errors
Section titled “Errors”Three error classes, all under scriva.errors:
ConfigurationError— raised at construction time. Wrong wiring, missing capabilities, malformed config. Never raised mid-run.EngineError— raised by an adapter when its upstream (an HTTP API, a binary on$PATH, a model file) failed. Carriesengine.nameand the original exception.RecognitionError— raised when a region was attempted but no recogniser could produce text. The pipeline’serror_policydecides whether this aborts the page or attaches an emptyRecognitionwitherrorset.
Stages never raise raw exceptions to the caller — they wrap, attach context, and emit a structured error event before re-raising.
Observability
Section titled “Observability”Events are typed:
class Event(BaseModel): stage: str # "recognize" kind: Literal["started", "progress", "finished", "error", "cache_hit"] timestamp: datetime payload: dict[str, Any] # stage-specific; see table belowThree subscription paths, one canonical recommendation each:
- Callback —
pipeline(doc, on_event=callback). Best default. - Async iterator —
async for item in pipeline.events(doc):yields events and the final result through one channel. - SSE —
scriva.events.to_sse(pipeline.events(doc))for web servers.
Every standard adapter emits events shaped for the SSE form so a UI can render a progress bar without further translation.
Standard payloads
Section titled “Standard payloads”The payload schema for every built-in stage is stable and versioned alongside the JSON exporter schema. Custom stages may emit any extra keys; UI code should ignore unknown keys.
| stage | kind | payload keys |
|---|---|---|
preprocess | finished | ms, plus stage-specific (e.g. angle from orientation) |
detect | finished | rows, cols, cells, ms |
classify | finished | blank, merged, ambiguous, ms |
region_preprocess | finished | transformed, sliced_in, sliced_out, ms |
recognize | started | total |
recognize | progress | done, total, cache_hits, region_id, cell (grid only) |
recognize | progress | region_id, partial_text, tokens (streaming recognizers only) |
recognize | cache_hit | region_id, tier ("exact" or "semantic"), similarity |
recognize | finished | done, total, cache_hits, ms |
post_process | finished | stage, plus stage-specific (e.g. corrected from dictionary) |
export | finished | path, format, bytes |
| (any) | error | error, message, traceback, region_id (when applicable) |
The cell payload on a recognize/progress event is present when the
region has region.grid set. Its shape is:
{ "row": int, "col": int, "rowspan": int, "colspan": int, "text": str | None, "is_blank": bool, "confidence": float | None, "cache": {"tier": "exact" | "semantic", "similarity": float} | None,}This is the same shape recognition.to_event_dict() returns, so a UI
that consumes one can consume the other.
Caching, in one sentence
Section titled “Caching, in one sentence”The cache sits inside the recogniser stage as a wrapper: the recogniser
looks up each region before calling its engine, records hits/misses in the
event stream, and writes back on miss. Cache adapters never reach into engine
adapters; engine adapters never know whether they are cached. The hit
provenance shows up on recognition.cache.
What the core layer does not do
Section titled “What the core layer does not do”- It does not render PDFs. PDF rendering is a
Documentplugin behind thepdfextra. - It does not call any HTTP API. Every API call lives in an adapter.
- It does not assume a storage backend.
Cache,SampleStore,Exporter, andDocumentloading are all swappable. - It does not impose a database. Persistence of jobs/results is out of scope; if you need it, your application owns it and uses scriva for the pure OCR transform.
This is what makes the same library suitable for a CLI, a worker, a notebook, or a web server.