Skip to content

Concepts

scriva is built on nine abstractions. Everything else — every adapter, every domain pack, every exporter — is a concrete implementation of one of them. Internalise this page and the rest of the library reads itself.

A Document is the input. It is not “a PDF” or “an image” — it is an opaque handle to one or more rendered pages, with optional metadata (filename, source URI, MIME, capture rotation, DPI).

doc = Document.load("scan.pdf") # multi-page
doc = Document.load("photo.jpg") # single page
doc = Document.from_bytes(png_bytes) # in-memory

Pipelines accept a path, bytes, or a Document directly — pipeline("x.pdf") loads internally. Pages are lazy: doc.pages() yields one Page at a time (a rendered image plus page-level metadata). Recognisers see one page at a time.

A Region is a bounded area on a page that the rest of the pipeline can talk about. Regions are intentionally general:

  • A grid cell in a form
  • A polygon around a P&ID symbol
  • A free-form bounding box around a paragraph
  • An entire page

A Region carries geometry (bbox or polygon), an optional role (header, data, blank, etc.), an optional typed grid field for grid cells, and an open attrs bag for engine-specific hints.

region.bbox # (x, y, w, h) — origin top-left
region.polygon # list[(x, y)] | None
region.role # "data" | "header" | "blank" | "merged" | str
region.grid # GridCell(row=2, col=3, rowspan=1, colspan=1) | None
region.merge_group_id # str | None — shared by regions in one logical merge
region.parent_region_id # str | None — set by slicers; the region this was split from
region.crop_override # PageCrop | None — set by RegionPreprocessors; the recognizer reads this in preference to page+bbox
region.attrs # open dict for engine-specific extension

The common grid metadata used to live as attrs["row"] / attrs["col"]; it was promoted to a typed grid field because two-thirds of stages need it. Regions can be nested (a row contains cells; a section contains paragraphs) and merged (rowspan/colspan, or “this polygon and that polygon are one logical region”). The library never assumes a rectangular grid; the grid case is just region.grid is not None.

A Layout is the set of regions on a page plus the relationships between them. It is the output of a LayoutDetector and the input to a Recognizer.

A layout is a graph, not a list: regions know their parent, their siblings, their merge-group, and (for grids) their row/column index. Iterating a layout in reading order is a single call — layout.in_reading_order().

A Recognition is the recognised content for one region. It carries:

  • text — the recognised string (may be None if the region is blank)
  • confidence[0.0, 1.0] (may be None if no scorer ran)
  • language — ISO 639-1 if detected
  • source — which recogniser produced it
  • cacheCacheProvenance(tier, similarity, key) if it came from a cache
  • alternatives — optional n-best list
  • error — populated when an error_policy other than "abort" swallowed an exception
  • attrs — open bag for engine-specific output (token boxes, raw JSON, etc.)

Recognition is immutable. Use .with_text(...), .with_confidence(...), or .replace(**kwargs) to derive a new one.

A PageResult bundles a page’s Layout and the Recognition for each region. A DocumentResult is a sequence of PageResults plus document-level metadata.

DocumentResult knows how to serialise itself — exporters in the pipeline are for in-run writes; these methods are for ad-hoc inspection:

result.render() # str — reading-order text
result.to_json() # dict; schema in exporters.md
result.to_dataframe() # pandas; one row per region
result.to_excel("out.xlsx") # writes file
result.to_markdown() # str
result.show() # matplotlib visualisation (optional extra)

A Stage is a Protocol:

class Stage(Protocol):
name: str
capabilities: frozenset[Capability]
async def run(self, ctx: Context) -> Context: ...

The name defaults to a kebab-cased version of the class name; override when you have two of the same kind in one pipeline.

A Pipeline is an ordered list of stages plus a Context that flows through them. The Context is a typed, mutable bag — ctx.document, ctx.layout, ctx.recognitions, ctx.events, ctx.cancel. Stages read what they need and write what they produce.

A pipeline is callable:

result = pipeline(doc) # sync; blocks until done
result = await pipeline.aio(doc) # async coroutine
async for page in pipeline.stream(doc): # streaming per page
...

Six kinds of stage are blessed by the standard pipeline because they correspond to the natural OCR phases:

KindReadsWrites
Preprocessorpagepage
LayoutDetectorpagelayout
RegionClassifierpage, layoutlayout (role tags)
RegionPreprocessorpage, layoutlayout (crop overrides, sliced regions)
Recognizerpage, layoutrecognitions
PostProcessorrecognitionsrecognitions
Exporterresult(side effect: file)

RegionPreprocessor is the per-region image-manipulation slot — per-cell binarisation, slicing a tall region into rows, padding header crops differently from data crops. See preprocessors.md › Region preprocessors.

The positional Pipeline(A(), B(), C(), …) constructor slots each stage into the correct phase by sniffing its Protocol. Out-of-order arguments still produce the correct chain. You can register your own stage kinds via Pipeline.builder().stage(...) for power use.

A Cache is a Protocol with two complementary tiers:

  • Exact cache — keyed by a deterministic hash of the input bytes plus the recogniser configuration. A cache hit returns the previous Recognition verbatim.
  • Semantic cache — keyed by an embedding of the input region. A hit returns the previous result if the cosine similarity exceeds a configurable threshold (default 0.99 — high, because OCR is unforgiving).

Caches accept string shorthand: cache=".scriva_cache" becomes a FileSystemCache. Cache.layered(path) gives you both tiers with sensible defaults. Cache is always plumbed through the recogniser as a wrapper, never hidden inside an engine adapter, so you can see hit/miss in the event stream and in recognition.cache.

A SampleStore is labelled-crop persistence. It is to labelled crops what Cache is to recognised crops: an opt-in protocol whose adapters back onto filesystem, sqlite, pgvector, or S3.

class SampleStore(Engine, Protocol):
async def put(self, sample: Sample) -> SampleId: ...
async def get(self, id: SampleId) -> Sample | None: ...
async def find(self, *, where=None, near=None, limit=50) -> Sequence[Sample]: ...
async def remove(self, id: SampleId) -> None: ...

Each Sample carries a crop, the primary recognizer’s output, an optional human label, an optional embedding, and a source pointer. Four things in the library use sample stores: the samples exporter, the embedding classifier’s training endpoint, few-shot retrieval via RecognitionHint.from_store(...), and supervised correction via postprocess.dictionary.from_samples(...). The last two read the same labelled crops on opposite sides of the recognizer — see samples.md › Two roads from a SampleStore.

Detectors, recognisers, post-processors, scorers, and exporters all conform to a single root Engine protocol with three things:

class Engine(Protocol):
name: str
version: str
capabilities: frozenset[Capability]

capabilities is what lets a pipeline assemble itself or warn you when a recogniser cannot, say, return token-level boxes. It is also what the capability negotiation at build time uses — passing a recogniser without Capability.LANGUAGE_DETECTION while also configuring a language-aware post-processor is a build-time error, not a runtime surprise.


Concretely, one full run of a grid-form OCR looks like this:

Document ──► Preprocessor (deskew, denoise)
──► LayoutDetector (grid → Layout of cell Regions)
──► RegionClassifier (mark blank cells, expand merged cells)
──► RegionPreprocessor (pad data cells, binarise, slice overflow)
──► Recognizer (VLM call per non-blank cell, cache-aware,
hints=RecognitionHint.from_store(samples) for few-shot)
──► PostProcessor: dictionary (from yaml + from_samples)
──► PostProcessor: rule_splitter (internal ruled lines)
──► PostProcessor: confidence_score (round-trip rendering match)
──► Exporter (excel writes .xlsx with merges and colours)

Every arrow is a Stage. Every box is something you can swap, mock, or extend. That is the whole library.