Concepts

scriva is built on nine abstractions. Everything else — every adapter, every domain pack, every exporter — is a concrete implementation of one of them. Internalise this page and the rest of the library reads itself.

1. `Document`

A Document is the input. It is not “a PDF” or “an image” — it is an opaque handle to one or more rendered pages, with optional metadata (filename, source URI, MIME, capture rotation, DPI).

doc = Document.load("scan.pdf")        # multi-page
doc = Document.load("photo.jpg")       # single page
doc = Document.from_bytes(png_bytes)   # in-memory

Pipelines accept a path, bytes, or a Document directly — pipeline("x.pdf") loads internally. Pages are lazy: doc.pages() yields one Page at a time (a rendered image plus page-level metadata). Recognisers see one page at a time.

2. `Region`

A Region is a bounded area on a page that the rest of the pipeline can talk about. Regions are intentionally general:

A grid cell in a form
A polygon around a P&ID symbol
A free-form bounding box around a paragraph
An entire page

A Region carries geometry (bbox or polygon), an optional role (header, data, blank, etc.), an optional typed grid field for grid cells, and an open attrs bag for engine-specific hints.

region.bbox            # (x, y, w, h) — origin top-left
region.polygon         # list[(x, y)] | None
region.role            # "data" | "header" | "blank" | "merged" | str
region.grid            # GridCell(row=2, col=3, rowspan=1, colspan=1) | None
region.merge_group_id  # str | None — shared by regions in one logical merge
region.parent_region_id # str | None — set by slicers; the region this was split from
region.crop_override   # PageCrop | None — set by RegionPreprocessors; the recognizer reads this in preference to page+bbox
region.attrs           # open dict for engine-specific extension

The common grid metadata used to live as attrs["row"] / attrs["col"]; it was promoted to a typed grid field because two-thirds of stages need it. Regions can be nested (a row contains cells; a section contains paragraphs) and merged (rowspan/colspan, or “this polygon and that polygon are one logical region”). The library never assumes a rectangular grid; the grid case is just region.grid is not None.

3. `Layout`

A Layout is the set of regions on a page plus the relationships between them. It is the output of a LayoutDetector and the input to a Recognizer.

A layout is a graph, not a list: regions know their parent, their siblings, their merge-group, and (for grids) their row/column index. Iterating a layout in reading order is a single call — layout.in_reading_order().

4. `Recognition`

A Recognition is the recognised content for one region. It carries:

text — the recognised string (may be None if the region is blank)
confidence — [0.0, 1.0] (may be None if no scorer ran)
language — ISO 639-1 if detected
source — which recogniser produced it
cache — CacheProvenance(tier, similarity, key) if it came from a cache
alternatives — optional n-best list
error — populated when an error_policy other than "abort" swallowed an exception
attrs — open bag for engine-specific output (token boxes, raw JSON, etc.)

Recognition is immutable. Use .with_text(...), .with_confidence(...), or .replace(**kwargs) to derive a new one.

5. `PageResult` & `DocumentResult`

A PageResult bundles a page’s Layout and the Recognition for each region. A DocumentResult is a sequence of PageResults plus document-level metadata.

DocumentResult knows how to serialise itself — exporters in the pipeline are for in-run writes; these methods are for ad-hoc inspection:

result.render()              # str — reading-order text
result.to_json()             # dict; schema in exporters.md
result.to_dataframe()        # pandas; one row per region
result.to_excel("out.xlsx")  # writes file
result.to_markdown()         # str
result.show()                # matplotlib visualisation (optional extra)

6. `Stage` and `Pipeline`

A Stage is a Protocol:

class Stage(Protocol):
    name: str
    capabilities: frozenset[Capability]
    async def run(self, ctx: Context) -> Context: ...

The name defaults to a kebab-cased version of the class name; override when you have two of the same kind in one pipeline.

A Pipeline is an ordered list of stages plus a Context that flows through them. The Context is a typed, mutable bag — ctx.document, ctx.layout, ctx.recognitions, ctx.events, ctx.cancel. Stages read what they need and write what they produce.

A pipeline is callable:

result = pipeline(doc)              # sync; blocks until done
result = await pipeline.aio(doc)    # async coroutine
async for page in pipeline.stream(doc):  # streaming per page
    ...

Six kinds of stage are blessed by the standard pipeline because they correspond to the natural OCR phases:

Kind	Reads	Writes
`Preprocessor`	`page`	`page`
`LayoutDetector`	`page`	`layout`
`RegionClassifier`	`page`, `layout`	`layout` (role tags)
`RegionPreprocessor`	`page`, `layout`	`layout` (crop overrides, sliced regions)
`Recognizer`	`page`, `layout`	`recognitions`
`PostProcessor`	`recognitions`	`recognitions`
`Exporter`	`result`	(side effect: file)

RegionPreprocessor is the per-region image-manipulation slot — per-cell binarisation, slicing a tall region into rows, padding header crops differently from data crops. See preprocessors.md › Region preprocessors.

The positional Pipeline(A(), B(), C(), …) constructor slots each stage into the correct phase by sniffing its Protocol. Out-of-order arguments still produce the correct chain. You can register your own stage kinds via Pipeline.builder().stage(...) for power use.

7. `Cache`

A Cache is a Protocol with two complementary tiers:

Exact cache — keyed by a deterministic hash of the input bytes plus the recogniser configuration. A cache hit returns the previous Recognition verbatim.
Semantic cache — keyed by an embedding of the input region. A hit returns the previous result if the cosine similarity exceeds a configurable threshold (default 0.99 — high, because OCR is unforgiving).

Caches accept string shorthand: cache=".scriva_cache" becomes a FileSystemCache. Cache.layered(path) gives you both tiers with sensible defaults. Cache is always plumbed through the recogniser as a wrapper, never hidden inside an engine adapter, so you can see hit/miss in the event stream and in recognition.cache.

8. `SampleStore`

A SampleStore is labelled-crop persistence. It is to labelled crops what Cache is to recognised crops: an opt-in protocol whose adapters back onto filesystem, sqlite, pgvector, or S3.

class SampleStore(Engine, Protocol):
    async def put(self, sample: Sample) -> SampleId: ...
    async def get(self, id: SampleId) -> Sample | None: ...
    async def find(self, *, where=None, near=None, limit=50) -> Sequence[Sample]: ...
    async def remove(self, id: SampleId) -> None: ...

Each Sample carries a crop, the primary recognizer’s output, an optional human label, an optional embedding, and a source pointer. Four things in the library use sample stores: the samples exporter, the embedding classifier’s training endpoint, few-shot retrieval via RecognitionHint.from_store(...), and supervised correction via postprocess.dictionary.from_samples(...). The last two read the same labelled crops on opposite sides of the recognizer — see samples.md › Two roads from a SampleStore.

9. `Engine` (the unifying protocol)

Detectors, recognisers, post-processors, scorers, and exporters all conform to a single root Engine protocol with three things:

class Engine(Protocol):
    name: str
    version: str
    capabilities: frozenset[Capability]

capabilities is what lets a pipeline assemble itself or warn you when a recogniser cannot, say, return token-level boxes. It is also what the capability negotiation at build time uses — passing a recogniser without Capability.LANGUAGE_DETECTION while also configuring a language-aware post-processor is a build-time error, not a runtime surprise.

Putting it together

Concretely, one full run of a grid-form OCR looks like this:

Document ──► Preprocessor (deskew, denoise)
         ──► LayoutDetector (grid → Layout of cell Regions)
         ──► RegionClassifier (mark blank cells, expand merged cells)
         ──► RegionPreprocessor (pad data cells, binarise, slice overflow)
         ──► Recognizer (VLM call per non-blank cell, cache-aware,
                         hints=RecognitionHint.from_store(samples) for few-shot)
         ──► PostProcessor: dictionary (from yaml + from_samples)
         ──► PostProcessor: rule_splitter (internal ruled lines)
         ──► PostProcessor: confidence_score (round-trip rendering match)
         ──► Exporter (excel writes .xlsx with merges and colours)

Every arrow is a Stage. Every box is something you can swap, mock, or extend. That is the whole library.

Concepts

Concepts

1. Document

2. Region

3. Layout

4. Recognition

5. PageResult & DocumentResult

6. Stage and Pipeline

7. Cache

8. SampleStore

9. Engine (the unifying protocol)

Putting it together

1. `Document`

2. `Region`

3. `Layout`

4. `Recognition`

5. `PageResult` & `DocumentResult`

6. `Stage` and `Pipeline`

7. `Cache`

8. `SampleStore`

9. `Engine` (the unifying protocol)