Skip to content

Domain packs

A domain pack is a pre-built pipeline for a specific document family. Each one is a function that returns a ready-to-call Pipeline — you can read the source for any of them, copy it, and adapt it.

import scriva
pipeline = scriva.domains.forms.tabular(language="ja")
result = pipeline("scan.pdf")

Domain packs are examples, not abstractions. They live in scriva/domains/ and depend only on adapters already in the library.

Tabular forms with visible rules (scanned spreadsheets, mill sheets, inspection reports).

forms.tabular(
*,
language: str = "en",
model: str = "gpt-4o",
cache: Cache | str = ".scriva_cache",
excel_out: Path | str | None = None,
json_out: Path | str | None = None,
) -> Pipeline

Composes:

deskew
──► detect.fallback(morphological_grid, hough_grid)
──► classify.hybrid (rule + embedding)
──► recognize.openai (Prompt.ocr localised)
cache: layered (filesystem + vector)
──► postprocess.whitespace
──► postprocess.blank_suppress
──► postprocess.dictionary (if dict.yaml exists)
──► postprocess.rule_splitter
──► postprocess.confidence_score.rendering
──► export.excel + export.json_

The result is the same shape as the original ocr-agent produced, but every piece is replaceable from the call site:

pipeline = scriva.domains.forms.tabular(language="ja")
pipeline.replace("recognize", scriva.recognize.anthropic(model="claude-opus-4-7"))
result = pipeline("scan.pdf")

If you need to insert a stage into a domain pack mid-chain, ask for the underlying builder:

builder = scriva.domains.forms.tabular.builder(language="ja") # not built yet
builder.insert_after("recognize", scriva.postprocess.language_detector())
pipeline = builder.build()

Piping & instrumentation diagrams. Different layout, different recogniser prompt, different output schema.

pid.diagram(
*,
box_detector: LayoutDetector | None = None,
model: str = "gpt-4o",
json_out: Path | str,
) -> Pipeline

Composes:

detect.whole_page (or your box detector)
──► recognize.openai (Prompt.structured(schema=PidSymbol))
──► postprocess.whitespace
──► export.json_

The output is a list of {type, text, bbox} objects, not a grid. The exporter exists because P&ID consumers want a stable structured dump, not human-readable Excel.

The annotation pack runs a primary recognizer over every region, routes the most-uncertain k to an oracle, and persists both versions to a SampleStore for downstream training. It is the reference workflow for active learning on top of scriva.

annotation.review(
*,
primary: Recognizer,
oracle: Recognizer,
store: SampleStore,
k: int = 20, # how many uncertain regions to route to the oracle
json_out: Path | str,
) -> Pipeline

Composes:

detect.morphological_grid
──► classify.rule_based
──► recognize.uncertainty_first(primary, oracle, k=k, store=store)
──► postprocess.whitespace
──► postprocess.confidence_score.rendering
──► export.json_ + export.samples(store)

A typical loop is three calls:

import scriva
from scriva import samples
from scriva.recognize import openai, anthropic
store = samples.fs(".scriva_samples")
# 1. Run the annotation pipeline over a batch
pipeline = scriva.domains.annotation.review(
primary=openai(model="gpt-4o", cache=".scriva_cache"),
oracle=anthropic(model="claude-opus-4-7"),
store=store,
k=20,
json_out="annotations.json",
)
pipeline("scan.pdf")
# 2. (Optional) human review — mark good samples in the store
# 3. Train a head on the labelled samples
clf = scriva.classify.embedding.train(
samples=store,
where=lambda s: s.label is not None,
embedder=scriva.embedders.OpenAIEmbedder(model="text-embedding-3-small"),
)
clf.save("models/cells.joblib")

The pack does not bundle a UI — annotation review is the application’s job. See samples.put / .with_label for the machine-readable side of the loop.

Free-form documents where the schema is not known up front. Two-phase:

  1. Schema discovery — repeated VLM calls produce a candidate field list.
  2. Row OCR + extraction — OCR the document, then prompt the VLM with the discovered schema to extract field→value pairs.
agentic.extract(
*,
schema: type[BaseModel] | None = None, # if None, discover
rounds: int = 3,
model: str = "gpt-4o",
json_out: Path | str,
) -> Pipeline

The output is one row per discovered field with field, value, source_region, and note.

When you already have a pydantic.BaseModel, pass it as schema and skip the discovery phase entirely.

A domain pack is a function that returns a Pipeline. Keep them flat and read-once:

scriva/domains/invoices.py
import scriva
from scriva.detect import whole_page
from scriva.recognize import openai
from scriva.postprocess import whitespace
from scriva.export import json_
from scriva.prompts import Prompt
def invoice(*, model: str = "gpt-4o", json_out: str) -> scriva.Pipeline:
return scriva.Pipeline(
whole_page(),
openai(model=model, prompt=Prompt.structured(schema=Invoice)),
whitespace(),
json_(json_out),
)

Domain packs do not extend the library; they configure it. Anything they need that is not in core should land as a generic adapter first, then be used from the domain pack.