Domain packs

A domain pack is a pre-built pipeline for a specific document family. Each one is a function that returns a ready-to-call Pipeline — you can read the source for any of them, copy it, and adapt it.

import scriva
pipeline = scriva.domains.forms.tabular(language="ja")
result = pipeline("scan.pdf")

Domain packs are examples, not abstractions. They live in scriva/domains/ and depend only on adapters already in the library.

`domains.forms`

Tabular forms with visible rules (scanned spreadsheets, mill sheets, inspection reports).

forms.tabular(
    *,
    language: str = "en",
    model: str = "gpt-4o",
    cache: Cache | str = ".scriva_cache",
    excel_out: Path | str | None = None,
    json_out: Path | str | None = None,
) -> Pipeline

Composes:

deskew
  ──► detect.fallback(morphological_grid, hough_grid)
  ──► classify.hybrid (rule + embedding)
  ──► recognize.openai (Prompt.ocr localised)
       cache: layered (filesystem + vector)
  ──► postprocess.whitespace
  ──► postprocess.blank_suppress
  ──► postprocess.dictionary (if dict.yaml exists)
  ──► postprocess.rule_splitter
  ──► postprocess.confidence_score.rendering
  ──► export.excel + export.json_

The result is the same shape as the original ocr-agent produced, but every piece is replaceable from the call site:

pipeline = scriva.domains.forms.tabular(language="ja")
pipeline.replace("recognize", scriva.recognize.anthropic(model="claude-opus-4-7"))
result = pipeline("scan.pdf")

If you need to insert a stage into a domain pack mid-chain, ask for the underlying builder:

builder = scriva.domains.forms.tabular.builder(language="ja")  # not built yet
builder.insert_after("recognize", scriva.postprocess.language_detector())
pipeline = builder.build()

`domains.pid`

Piping & instrumentation diagrams. Different layout, different recogniser prompt, different output schema.

pid.diagram(
    *,
    box_detector: LayoutDetector | None = None,
    model: str = "gpt-4o",
    json_out: Path | str,
) -> Pipeline

Composes:

detect.whole_page (or your box detector)
  ──► recognize.openai (Prompt.structured(schema=PidSymbol))
  ──► postprocess.whitespace
  ──► export.json_

The output is a list of {type, text, bbox} objects, not a grid. The exporter exists because P&ID consumers want a stable structured dump, not human-readable Excel.

`domains.annotation`

The annotation pack runs a primary recognizer over every region, routes the most-uncertain k to an oracle, and persists both versions to a SampleStore for downstream training. It is the reference workflow for active learning on top of scriva.

annotation.review(
    *,
    primary: Recognizer,
    oracle: Recognizer,
    store: SampleStore,
    k: int = 20,                              # how many uncertain regions to route to the oracle
    json_out: Path | str,
) -> Pipeline

Composes:

detect.morphological_grid
  ──► classify.rule_based
  ──► recognize.uncertainty_first(primary, oracle, k=k, store=store)
  ──► postprocess.whitespace
  ──► postprocess.confidence_score.rendering
  ──► export.json_ + export.samples(store)

A typical loop is three calls:

import scriva
from scriva import samples
from scriva.recognize import openai, anthropic

store = samples.fs(".scriva_samples")

# 1. Run the annotation pipeline over a batch
pipeline = scriva.domains.annotation.review(
    primary=openai(model="gpt-4o", cache=".scriva_cache"),
    oracle=anthropic(model="claude-opus-4-7"),
    store=store,
    k=20,
    json_out="annotations.json",
)
pipeline("scan.pdf")

# 2. (Optional) human review — mark good samples in the store

# 3. Train a head on the labelled samples
clf = scriva.classify.embedding.train(
    samples=store,
    where=lambda s: s.label is not None,
    embedder=scriva.embedders.OpenAIEmbedder(model="text-embedding-3-small"),
)
clf.save("models/cells.joblib")

The pack does not bundle a UI — annotation review is the application’s job. See samples.put / .with_label for the machine-readable side of the loop.

`domains.agentic`

Free-form documents where the schema is not known up front. Two-phase:

Schema discovery — repeated VLM calls produce a candidate field list.
Row OCR + extraction — OCR the document, then prompt the VLM with the discovered schema to extract field→value pairs.

agentic.extract(
    *,
    schema: type[BaseModel] | None = None,  # if None, discover
    rounds: int = 3,
    model: str = "gpt-4o",
    json_out: Path | str,
) -> Pipeline

The output is one row per discovered field with field, value, source_region, and note.

When you already have a pydantic.BaseModel, pass it as schema and skip the discovery phase entirely.

Writing your own domain pack

A domain pack is a function that returns a Pipeline. Keep them flat and read-once:

import scriva
from scriva.detect     import whole_page
from scriva.recognize  import openai
from scriva.postprocess import whitespace
from scriva.export     import json_
from scriva.prompts    import Prompt

def invoice(*, model: str = "gpt-4o", json_out: str) -> scriva.Pipeline:
    return scriva.Pipeline(
        whole_page(),
        openai(model=model, prompt=Prompt.structured(schema=Invoice)),
        whitespace(),
        json_(json_out),
    )

Domain packs do not extend the library; they configure it. Anything they need that is not in core should land as a generic adapter first, then be used from the domain pack.