Domain packs
Domain packs
Section titled “Domain packs”A domain pack is a pre-built pipeline for a specific document family.
Each one is a function that returns a ready-to-call Pipeline — you can
read the source for any of them, copy it, and adapt it.
import scrivapipeline = scriva.domains.forms.tabular(language="ja")result = pipeline("scan.pdf")Domain packs are examples, not abstractions. They live in
scriva/domains/ and depend only on adapters already in the library.
domains.forms
Section titled “domains.forms”Tabular forms with visible rules (scanned spreadsheets, mill sheets, inspection reports).
forms.tabular( *, language: str = "en", model: str = "gpt-4o", cache: Cache | str = ".scriva_cache", excel_out: Path | str | None = None, json_out: Path | str | None = None,) -> PipelineComposes:
deskew ──► detect.fallback(morphological_grid, hough_grid) ──► classify.hybrid (rule + embedding) ──► recognize.openai (Prompt.ocr localised) cache: layered (filesystem + vector) ──► postprocess.whitespace ──► postprocess.blank_suppress ──► postprocess.dictionary (if dict.yaml exists) ──► postprocess.rule_splitter ──► postprocess.confidence_score.rendering ──► export.excel + export.json_The result is the same shape as the original ocr-agent produced, but every
piece is replaceable from the call site:
pipeline = scriva.domains.forms.tabular(language="ja")pipeline.replace("recognize", scriva.recognize.anthropic(model="claude-opus-4-7"))result = pipeline("scan.pdf")If you need to insert a stage into a domain pack mid-chain, ask for the underlying builder:
builder = scriva.domains.forms.tabular.builder(language="ja") # not built yetbuilder.insert_after("recognize", scriva.postprocess.language_detector())pipeline = builder.build()domains.pid
Section titled “domains.pid”Piping & instrumentation diagrams. Different layout, different recogniser prompt, different output schema.
pid.diagram( *, box_detector: LayoutDetector | None = None, model: str = "gpt-4o", json_out: Path | str,) -> PipelineComposes:
detect.whole_page (or your box detector) ──► recognize.openai (Prompt.structured(schema=PidSymbol)) ──► postprocess.whitespace ──► export.json_The output is a list of {type, text, bbox} objects, not a grid. The
exporter exists because P&ID consumers want a stable structured dump, not
human-readable Excel.
domains.annotation
Section titled “domains.annotation”The annotation pack runs a primary recognizer over every region, routes
the most-uncertain k to an oracle, and persists both versions to a
SampleStore for downstream training. It is the
reference workflow for active learning on top of scriva.
annotation.review( *, primary: Recognizer, oracle: Recognizer, store: SampleStore, k: int = 20, # how many uncertain regions to route to the oracle json_out: Path | str,) -> PipelineComposes:
detect.morphological_grid ──► classify.rule_based ──► recognize.uncertainty_first(primary, oracle, k=k, store=store) ──► postprocess.whitespace ──► postprocess.confidence_score.rendering ──► export.json_ + export.samples(store)A typical loop is three calls:
import scrivafrom scriva import samplesfrom scriva.recognize import openai, anthropic
store = samples.fs(".scriva_samples")
# 1. Run the annotation pipeline over a batchpipeline = scriva.domains.annotation.review( primary=openai(model="gpt-4o", cache=".scriva_cache"), oracle=anthropic(model="claude-opus-4-7"), store=store, k=20, json_out="annotations.json",)pipeline("scan.pdf")
# 2. (Optional) human review — mark good samples in the store
# 3. Train a head on the labelled samplesclf = scriva.classify.embedding.train( samples=store, where=lambda s: s.label is not None, embedder=scriva.embedders.OpenAIEmbedder(model="text-embedding-3-small"),)clf.save("models/cells.joblib")The pack does not bundle a UI — annotation review is the application’s
job. See samples.put / .with_label for the
machine-readable side of the loop.
domains.agentic
Section titled “domains.agentic”Free-form documents where the schema is not known up front. Two-phase:
- Schema discovery — repeated VLM calls produce a candidate field list.
- Row OCR + extraction — OCR the document, then prompt the VLM with the discovered schema to extract field→value pairs.
agentic.extract( *, schema: type[BaseModel] | None = None, # if None, discover rounds: int = 3, model: str = "gpt-4o", json_out: Path | str,) -> PipelineThe output is one row per discovered field with field, value,
source_region, and note.
When you already have a pydantic.BaseModel, pass it as schema and skip
the discovery phase entirely.
Writing your own domain pack
Section titled “Writing your own domain pack”A domain pack is a function that returns a Pipeline. Keep them flat and
read-once:
import scrivafrom scriva.detect import whole_pagefrom scriva.recognize import openaifrom scriva.postprocess import whitespacefrom scriva.export import json_from scriva.prompts import Prompt
def invoice(*, model: str = "gpt-4o", json_out: str) -> scriva.Pipeline: return scriva.Pipeline( whole_page(), openai(model=model, prompt=Prompt.structured(schema=Invoice)), whitespace(), json_(json_out), )Domain packs do not extend the library; they configure it. Anything they need that is not in core should land as a generic adapter first, then be used from the domain pack.