Concepts
Concepts
Section titled “Concepts”scriva is built on nine abstractions. Everything else — every adapter,
every domain pack, every exporter — is a concrete implementation of one of
them. Internalise this page and the rest of the library reads itself.
1. Document
Section titled “1. Document”A Document is the input. It is not “a PDF” or “an image” — it is an
opaque handle to one or more rendered pages, with optional metadata
(filename, source URI, MIME, capture rotation, DPI).
doc = Document.load("scan.pdf") # multi-pagedoc = Document.load("photo.jpg") # single pagedoc = Document.from_bytes(png_bytes) # in-memoryPipelines accept a path, bytes, or a Document directly — pipeline("x.pdf")
loads internally. Pages are lazy: doc.pages() yields one Page at a time
(a rendered image plus page-level metadata). Recognisers see one page at a
time.
2. Region
Section titled “2. Region”A Region is a bounded area on a page that the rest of the pipeline can
talk about. Regions are intentionally general:
- A grid cell in a form
- A polygon around a P&ID symbol
- A free-form bounding box around a paragraph
- An entire page
A Region carries geometry (bbox or polygon), an optional role (header,
data, blank, etc.), an optional typed grid field for grid cells, and an
open attrs bag for engine-specific hints.
region.bbox # (x, y, w, h) — origin top-leftregion.polygon # list[(x, y)] | Noneregion.role # "data" | "header" | "blank" | "merged" | strregion.grid # GridCell(row=2, col=3, rowspan=1, colspan=1) | Noneregion.merge_group_id # str | None — shared by regions in one logical mergeregion.parent_region_id # str | None — set by slicers; the region this was split fromregion.crop_override # PageCrop | None — set by RegionPreprocessors; the recognizer reads this in preference to page+bboxregion.attrs # open dict for engine-specific extensionThe common grid metadata used to live as attrs["row"] / attrs["col"]; it
was promoted to a typed grid field because two-thirds of stages need it.
Regions can be nested (a row contains cells; a section contains
paragraphs) and merged (rowspan/colspan, or “this polygon and that
polygon are one logical region”). The library never assumes a rectangular
grid; the grid case is just region.grid is not None.
3. Layout
Section titled “3. Layout”A Layout is the set of regions on a page plus the relationships between
them. It is the output of a LayoutDetector and the input to a Recognizer.
A layout is a graph, not a list: regions know their parent, their siblings,
their merge-group, and (for grids) their row/column index. Iterating a layout
in reading order is a single call — layout.in_reading_order().
4. Recognition
Section titled “4. Recognition”A Recognition is the recognised content for one region. It carries:
text— the recognised string (may beNoneif the region is blank)confidence—[0.0, 1.0](may beNoneif no scorer ran)language— ISO 639-1 if detectedsource— which recogniser produced itcache—CacheProvenance(tier, similarity, key)if it came from a cachealternatives— optional n-best listerror— populated when anerror_policyother than"abort"swallowed an exceptionattrs— open bag for engine-specific output (token boxes, raw JSON, etc.)
Recognition is immutable. Use .with_text(...), .with_confidence(...),
or .replace(**kwargs) to derive a new one.
5. PageResult & DocumentResult
Section titled “5. PageResult & DocumentResult”A PageResult bundles a page’s Layout and the Recognition for each
region. A DocumentResult is a sequence of PageResults plus document-level
metadata.
DocumentResult knows how to serialise itself — exporters in the pipeline
are for in-run writes; these methods are for ad-hoc inspection:
result.render() # str — reading-order textresult.to_json() # dict; schema in exporters.mdresult.to_dataframe() # pandas; one row per regionresult.to_excel("out.xlsx") # writes fileresult.to_markdown() # strresult.show() # matplotlib visualisation (optional extra)6. Stage and Pipeline
Section titled “6. Stage and Pipeline”A Stage is a Protocol:
class Stage(Protocol): name: str capabilities: frozenset[Capability] async def run(self, ctx: Context) -> Context: ...The name defaults to a kebab-cased version of the class name; override
when you have two of the same kind in one pipeline.
A Pipeline is an ordered list of stages plus a Context that flows through
them. The Context is a typed, mutable bag — ctx.document, ctx.layout,
ctx.recognitions, ctx.events, ctx.cancel. Stages read what they need and
write what they produce.
A pipeline is callable:
result = pipeline(doc) # sync; blocks until doneresult = await pipeline.aio(doc) # async coroutineasync for page in pipeline.stream(doc): # streaming per page ...Six kinds of stage are blessed by the standard pipeline because they correspond to the natural OCR phases:
| Kind | Reads | Writes |
|---|---|---|
Preprocessor | page | page |
LayoutDetector | page | layout |
RegionClassifier | page, layout | layout (role tags) |
RegionPreprocessor | page, layout | layout (crop overrides, sliced regions) |
Recognizer | page, layout | recognitions |
PostProcessor | recognitions | recognitions |
Exporter | result | (side effect: file) |
RegionPreprocessor is the per-region image-manipulation slot — per-cell
binarisation, slicing a tall region into rows, padding header crops
differently from data crops. See preprocessors.md › Region
preprocessors.
The positional Pipeline(A(), B(), C(), …) constructor slots each stage into
the correct phase by sniffing its Protocol. Out-of-order arguments still
produce the correct chain. You can register your own stage kinds via
Pipeline.builder().stage(...) for power use.
7. Cache
Section titled “7. Cache”A Cache is a Protocol with two complementary tiers:
- Exact cache — keyed by a deterministic hash of the input bytes plus the
recogniser configuration. A cache hit returns the previous
Recognitionverbatim. - Semantic cache — keyed by an embedding of the input region. A hit returns the previous result if the cosine similarity exceeds a configurable threshold (default 0.99 — high, because OCR is unforgiving).
Caches accept string shorthand: cache=".scriva_cache" becomes a
FileSystemCache. Cache.layered(path) gives you both tiers with sensible
defaults. Cache is always plumbed through the recogniser as a wrapper,
never hidden inside an engine adapter, so you can see hit/miss in the event
stream and in recognition.cache.
8. SampleStore
Section titled “8. SampleStore”A SampleStore is labelled-crop persistence. It is to labelled crops
what Cache is to recognised crops: an opt-in protocol whose adapters
back onto filesystem, sqlite, pgvector, or S3.
class SampleStore(Engine, Protocol): async def put(self, sample: Sample) -> SampleId: ... async def get(self, id: SampleId) -> Sample | None: ... async def find(self, *, where=None, near=None, limit=50) -> Sequence[Sample]: ... async def remove(self, id: SampleId) -> None: ...Each Sample carries a crop, the primary recognizer’s output, an optional
human label, an optional embedding, and a source pointer. Four things in
the library use sample stores: the samples exporter, the embedding
classifier’s training endpoint, few-shot retrieval via
RecognitionHint.from_store(...), and supervised correction via
postprocess.dictionary.from_samples(...). The last two read the same
labelled crops on opposite sides of the recognizer — see samples.md ›
Two roads from a SampleStore.
9. Engine (the unifying protocol)
Section titled “9. Engine (the unifying protocol)”Detectors, recognisers, post-processors, scorers, and exporters all conform to
a single root Engine protocol with three things:
class Engine(Protocol): name: str version: str capabilities: frozenset[Capability]capabilities is what lets a pipeline assemble itself or warn you when a
recogniser cannot, say, return token-level boxes. It is also what the
capability negotiation at build time uses — passing a recogniser without
Capability.LANGUAGE_DETECTION while also configuring a language-aware
post-processor is a build-time error, not a runtime surprise.
Putting it together
Section titled “Putting it together”Concretely, one full run of a grid-form OCR looks like this:
Document ──► Preprocessor (deskew, denoise) ──► LayoutDetector (grid → Layout of cell Regions) ──► RegionClassifier (mark blank cells, expand merged cells) ──► RegionPreprocessor (pad data cells, binarise, slice overflow) ──► Recognizer (VLM call per non-blank cell, cache-aware, hints=RecognitionHint.from_store(samples) for few-shot) ──► PostProcessor: dictionary (from yaml + from_samples) ──► PostProcessor: rule_splitter (internal ruled lines) ──► PostProcessor: confidence_score (round-trip rendering match) ──► Exporter (excel writes .xlsx with merges and colours)Every arrow is a Stage. Every box is something you can swap, mock, or
extend. That is the whole library.