Recognizers

A recognizer turns a Region into a Recognition. This is where the actual OCR happens — by calling a vision-language model, a classical OCR binary, or whatever you bring.

Protocol

class Recognizer(Engine, Protocol):
    name: str
    max_concurrency: int

    async def recognize(
        self,
        page: Page,
        region: Region,
        *,
        hint: RecognitionHint | None = None,
    ) -> Recognition: ...

You implement recognize. The pipeline:

Iterates the layout’s recognisable regions (skips role == "blank").
Bounds parallelism with asyncio.Semaphore(max_concurrency).
Wraps each call with the configured cache (if any).
Emits progress events with done/total/cache_hits.

You never write the parallelism, the cache, or the events. The pipeline does.

Built-in recognizers

Factories under scriva.recognize (lowercase) and classes under scriva.recognizers (subclassable):

Factory	Class	Engine	Extra	Capabilities
`recognize.openai(...)`	`OpenAIVisionRecognizer`	OpenAI Vision (`gpt-4o`, etc.)	`openai`	`JSON_OUTPUT`, `LANGUAGE_DETECTION`
`recognize.anthropic(...)`	`AnthropicVisionRecognizer`	Anthropic Vision (`claude-*`)	`anthropic`	`JSON_OUTPUT`, `LANGUAGE_DETECTION`
`recognize.bedrock(...)`	`BedrockVisionRecognizer`	AWS Bedrock (Qwen-VL, Claude)	`bedrock`	`JSON_OUTPUT`
`recognize.tesseract(...)`	`TesseractRecognizer`	tesseract-ocr CLI	`tesseract`	`TOKEN_BOXES`, `LANGUAGE_DETECTION`
`recognize.echo(...)`	`EchoRecognizer`	returns a fixed string	core	—
`recognize.fallback(*chain)`	`FallbackRecognizer`	try each in order on failure	core	union of children
`recognize.uncertainty_first(primary, oracle, *, by=…, k=…)`	`UncertaintyFirstRecognizer`	route k most-uncertain to an oracle	core	union of children
`recognize.consensus(*recogs)`	`ConsensusRecognizer`	parallel vote across recognizers	core	intersection of children

recognize.echo exists so you can build a complete test pipeline without any network or binary dependency.

`uncertainty_first`

Two-stage dispatch. Every region runs through primary; the k most-uncertain (or all those exceeding a threshold) are then re-recognised by oracle and the oracle’s output wins:

from scriva.recognize import openai, anthropic, uncertainty_first

recognize = uncertainty_first(
    primary=openai(model="gpt-4o", cache=".scriva_cache"),
    oracle=anthropic(model="claude-opus-4-7"),
    by=lambda r: abs((r.confidence or 0.5) - 0.5),   # default: distance from 0.5
    k=20,                                             # top-k uncertain
    # threshold=0.3,                                  # alternative to k
    store=None,                                       # optional SampleStore
)

Pass store= to persist both versions as a Sample — primary recognition in .recognition, oracle text in .label. The wrapper does the routing; samples does the persistence. This is the foundation of the annotation workflow in domains.annotation.

by is any Callable[[Recognition], float] returning a “priority” where higher means more interesting. The default (|confidence - 0.5|) treats both near-0 and near-1 as more decisive than uncertain mid-range. Pass your own for application-specific routing — e.g. lambda r: 1.0 - (r.confidence or 0) for “low-confidence only”.

`consensus`

Calls multiple recognizers in parallel and resolves disagreements:

from scriva.recognize import consensus, openai, anthropic, bedrock

recognize = consensus(
    openai(model="gpt-4o"),
    anthropic(model="claude-opus-4-7"),
    bedrock(model="qwen.qwen3-vl-235b-a22b"),
    agreement: float = 0.5,                          # min fraction that must agree
    tiebreaker: Recognizer | None = None,
    on_disagreement: Literal["confidence", "majority", "tiebreaker"] = "confidence",
)

The capability set is the intersection of the children — a consensus recognizer only declares what every member supports. The on_disagreement strategies:

"confidence" — highest-confidence wins.
"majority" — most common normalised text wins; ties broken by confidence.
"tiebreaker" — call tiebreaker with hint.text set to the majority answer; the tiebreaker’s output is final.

Cost scales linearly with the number of children. Use it sparingly — pair with cache= on each child so re-runs are free.

VLM recognizers, in detail

All VLM-based recognizers share the same options:

Option	Type	Default	Notes
`model`	`str`	required	e.g. `"gpt-4o"`, `"claude-opus-4-7"`
`max_concurrency`	`int`	`8`	Per-recognizer semaphore
`prompt`	`Prompt`	`Prompt.ocr()`	A `Prompt` (see prompts)
`temperature`	`float`	`0.0`	OCR is deterministic; do not raise unless you mean it
`max_tokens`	`int`	`500`	Cap per region
`image_detail`	`str`	`"high"`	Provider-specific; passthrough
`crop_padding_px`	`int`	`4`	Pad each region before cropping to give the VLM context
`request_timeout_s`	`float`	`60.0`	Per-call timeout
`retry`	`Retry`	`Retry.default()`	Exponential backoff for 429/5xx
`cache`	`Cache \| str \| None`	`None`	Wraps the recognizer; see caching.md
`stream`	`bool`	`False`	Emit `recognize/progress` events with partial text as tokens arrive

Crop padding matters: a region cropped exactly to its bbox often loses the top of capital letters or the bottom of descenders. The default 4 px is right for most scans at 300 DPI; raise it for low-DPI inputs.

Streaming

stream=True opts the recognizer into incremental output. The final Recognition shape is unchanged; what changes is the event stream — a streaming recognizer emits additional recognize/progress events with {region_id, partial_text, tokens} payloads as tokens arrive, then a final finished carrying the assembled Recognition. Useful for long text-block regions when a UI wants to render a typewriter effect; pure noise for cell-by-cell OCR. Off by default.

cache= is the simplest way to enable caching:

from scriva.recognize import openai
openai(model="gpt-4o", cache=".scriva_cache")        # FileSystemCache
openai(model="gpt-4o", cache=Cache.layered(".scriva")) # exact + semantic

Prompts

Recognizers do not bake in a prompt. Each accepts a Prompt object:

from scriva.prompts import Prompt

ocr_en = Prompt.from_template("""
Read every character in this image. If the image is blank, return exactly the
string "BLANK". Return only the recognised text. No commentary.
""")

ocr_ja  = Prompt.ocr(locale="ja")           # built-in localised variants
extract = Prompt.structured(schema=Invoice) # function-calling / structured output
code    = Prompt.layout_aware()             # preserves whitespace

Three built-in prompt families, each a function call:

Prompt.ocr(locale=...) — plain text recognition. Locales en, ja, zh, ko, de, fr, es.
Prompt.structured(schema=...) — structured output bound to a Pydantic model.
Prompt.layout_aware() — preserves internal whitespace and line breaks for code/tables.

`RecognitionHint`

A RecognitionHint is an optional signal passed to recognize(...) that biases the engine toward a particular answer. Hints are how the pipeline plumbs “we already think this region says X” through to a second-pass recognizer:

class RecognitionHint(BaseModel):
    text: str | None = None              # previous best-guess text
    language: str | None = None          # ISO 639-1
    schema: type[BaseModel] | None = None  # for structured prompts
    examples: list[FewShotExample] = []  # few-shot exemplars (see below)
    attrs: dict[str, Any] = {}           # engine-specific extension

class FewShotExample(BaseModel):
    image: bytes | Path                  # PNG bytes or a path the recognizer can read
    text: str                            # the correct recognition for this image
    note: str | None = None              # optional caption spliced into the prompt

    @classmethod
    def from_sample(cls, sample: Sample) -> "FewShotExample": ...

Built-in VLM recognizers fold the hint into the prompt automatically when the prompt is a hint-aware variant:

from scriva.prompts import Prompt

Prompt.ocr()                # ignores hints
Prompt.ocr_with_hint()      # includes hint.text in the message
Prompt.ocr_few_shot()       # includes hint.examples ahead of the target crop
Prompt.structured(schema=…) # uses hint.schema when present
Prompt.structured_few_shot(schema=…) # both

Constructors:

RecognitionHint(text="ENCLOSURE PLATE")           # explicit
RecognitionHint.from_recognition(prev)            # copy text/language from a prior Recognition
RecognitionHint.from_result(prev_result)          # mapping {region_id: hint} from a whole DocumentResult
RecognitionHint.from_store(store, near=region, k=5)  # populate examples from a SampleStore by nearest neighbour

The from_store(...) constructor is the path for retrieval-augmented OCR: it looks up the k nearest labelled samples and packages them as FewShotExamples. The store must declare Capability.NEAREST_NEIGHBOUR — see samples.md.

Pair this with postprocess.dictionary.from_samples(store) in the same pipeline. The two read the same labelled crops but apply them on opposite sides of recognize: few-shot exemplars steer the model before it answers; the derived dictionary corrects known errors after. See samples.md › Two roads from a SampleStore for the joint recipe.

The hints= kwarg on a recognizer factory accepts:

a single RecognitionHint (applied to every region),
a dict[RegionId, RecognitionHint] (per-region overrides), or
a Callable[[Region], RecognitionHint] (computed per region — the right shape for retrieval-augmented OCR via RecognitionHint.from_store).

See Pipeline › Confidence-driven re-OCR for the canonical refinement pattern, and the few-shot recipe below for retrieval-augmented OCR.

from scriva.recognize import openai
from scriva.prompts   import Prompt
from scriva import samples, RecognitionHint

store = samples.fs(".scriva_samples")
pipeline = scriva.Pipeline(
    morphological_grid(),
    openai(
        model="gpt-4o",
        prompt=Prompt.ocr_few_shot(),
        hints=lambda region: RecognitionHint.from_store(store, near=region, k=5),
    ),
    excel("out.xlsx"),
)

Custom recognizers receive the hint through the keyword:

async def recognize(self, page, region, *, hint=None):
    if hint and hint.text:
        prompt = self._prompt_with_hint(hint.text)
    else:
        prompt = self._prompt
    ...

Hints are advisory. A recognizer is free to ignore them — for example, when the cached crop already hits exactly.

Capability negotiation

class Capability(StrEnum):
    JSON_OUTPUT = "json_output"
    LANGUAGE_DETECTION = "language_detection"
    LAYOUT_PRESERVING = "layout_preserving"
    TOKEN_BOXES = "token_boxes"
    HANDWRITING = "handwriting"
    GRID = "grid"
    POLYGON = "polygon"
    SELF_CONFIDENCE = "self_confidence"

A recognizer declares its capabilities; downstream stages declare their requirements. Construction reports an actionable error before you load anything:

ConfigurationError: post-processor `rule_splitter` requires
Capability.LAYOUT_PRESERVING from the recognizer; openai provides
{JSON_OUTPUT, LANGUAGE_DETECTION}. Set prompt=Prompt.layout_aware() or
choose a different recognizer.

Writing your own recognizer

Subclass:

from scriva import Recognizer, Recognition, Capability

class MyRecognizer(Recognizer):
    capabilities = frozenset({Capability.JSON_OUTPUT})
    max_concurrency = 4

    async def recognize(self, page, region, *, hint=None):
        crop = page.crop(region.bbox, padding=4)
        text = await self._call_my_model(crop)
        return Recognition(text=text, source=self.name)

…or decorate a function (for stateless cases):

from scriva import recognizer, Recognition, Capability

@recognizer(capabilities=frozenset({Capability.JSON_OUTPUT}), max_concurrency=4)
async def my_recognizer(page, region, *, hint=None):
    crop = page.crop(region.bbox, padding=4)
    return Recognition(text=await call_my_model(crop), source="my-recognizer")

You do not implement caching, batching, retry, or progress reporting. The pipeline owns all of that.

The concurrency contract

max_concurrency is enforced by the pipeline’s semaphore around your recognize(...) call. If your implementation fans out internally — multiple subprocess calls per region, an unbounded asyncio.gather, or a helper that spawns its own tasks — the semaphore will not see those, and you can over-subscribe the backend.

Two safe patterns:

One network call per recognize() invocation. The semaphore’s budget is honoured by construction.

A second semaphore inside the recognizer. Useful when the call genuinely needs to fan out:

class MyRecognizer(Recognizer):
    max_concurrency = 8

    def __init__(self):
        self._inner = asyncio.Semaphore(2)   # per-region fanout budget

        async def recognize(self, page, region, *, hint=None):
            async with self._inner:
                ...

If you cannot insert a checkpoint, wrap the inner call in asyncio.wait_for(...) so a cancelled run does not hang on a blocked subprocess.

Cost notes

VLM recognizers are by far the dominant cost of a typical run. Two levers:

Cache aggressively. Most forms have repeated headers across pages — semantic cache hit rates of 30–60% are common. See caching.md.
Skip blank regions. Configure a RegionClassifier upstream; blank regions are never sent to the recognizer.