Skip to content

Recognizers

A recognizer turns a Region into a Recognition. This is where the actual OCR happens — by calling a vision-language model, a classical OCR binary, or whatever you bring.

class Recognizer(Engine, Protocol):
name: str
max_concurrency: int
async def recognize(
self,
page: Page,
region: Region,
*,
hint: RecognitionHint | None = None,
) -> Recognition: ...

You implement recognize. The pipeline:

  • Iterates the layout’s recognisable regions (skips role == "blank").
  • Bounds parallelism with asyncio.Semaphore(max_concurrency).
  • Wraps each call with the configured cache (if any).
  • Emits progress events with done/total/cache_hits.

You never write the parallelism, the cache, or the events. The pipeline does.

Factories under scriva.recognize (lowercase) and classes under scriva.recognizers (subclassable):

FactoryClassEngineExtraCapabilities
recognize.openai(...)OpenAIVisionRecognizerOpenAI Vision (gpt-4o, etc.)openaiJSON_OUTPUT, LANGUAGE_DETECTION
recognize.anthropic(...)AnthropicVisionRecognizerAnthropic Vision (claude-*)anthropicJSON_OUTPUT, LANGUAGE_DETECTION
recognize.bedrock(...)BedrockVisionRecognizerAWS Bedrock (Qwen-VL, Claude)bedrockJSON_OUTPUT
recognize.tesseract(...)TesseractRecognizertesseract-ocr CLItesseractTOKEN_BOXES, LANGUAGE_DETECTION
recognize.echo(...)EchoRecognizerreturns a fixed stringcore
recognize.fallback(*chain)FallbackRecognizertry each in order on failurecoreunion of children
recognize.uncertainty_first(primary, oracle, *, by=…, k=…)UncertaintyFirstRecognizerroute k most-uncertain to an oraclecoreunion of children
recognize.consensus(*recogs)ConsensusRecognizerparallel vote across recognizerscoreintersection of children

recognize.echo exists so you can build a complete test pipeline without any network or binary dependency.

Two-stage dispatch. Every region runs through primary; the k most-uncertain (or all those exceeding a threshold) are then re-recognised by oracle and the oracle’s output wins:

from scriva.recognize import openai, anthropic, uncertainty_first
recognize = uncertainty_first(
primary=openai(model="gpt-4o", cache=".scriva_cache"),
oracle=anthropic(model="claude-opus-4-7"),
by=lambda r: abs((r.confidence or 0.5) - 0.5), # default: distance from 0.5
k=20, # top-k uncertain
# threshold=0.3, # alternative to k
store=None, # optional SampleStore
)

Pass store= to persist both versions as a Sample — primary recognition in .recognition, oracle text in .label. The wrapper does the routing; samples does the persistence. This is the foundation of the annotation workflow in domains.annotation.

by is any Callable[[Recognition], float] returning a “priority” where higher means more interesting. The default (|confidence - 0.5|) treats both near-0 and near-1 as more decisive than uncertain mid-range. Pass your own for application-specific routing — e.g. lambda r: 1.0 - (r.confidence or 0) for “low-confidence only”.

Calls multiple recognizers in parallel and resolves disagreements:

from scriva.recognize import consensus, openai, anthropic, bedrock
recognize = consensus(
openai(model="gpt-4o"),
anthropic(model="claude-opus-4-7"),
bedrock(model="qwen.qwen3-vl-235b-a22b"),
agreement: float = 0.5, # min fraction that must agree
tiebreaker: Recognizer | None = None,
on_disagreement: Literal["confidence", "majority", "tiebreaker"] = "confidence",
)

The capability set is the intersection of the children — a consensus recognizer only declares what every member supports. The on_disagreement strategies:

  • "confidence" — highest-confidence wins.
  • "majority" — most common normalised text wins; ties broken by confidence.
  • "tiebreaker" — call tiebreaker with hint.text set to the majority answer; the tiebreaker’s output is final.

Cost scales linearly with the number of children. Use it sparingly — pair with cache= on each child so re-runs are free.

All VLM-based recognizers share the same options:

OptionTypeDefaultNotes
modelstrrequirede.g. "gpt-4o", "claude-opus-4-7"
max_concurrencyint8Per-recognizer semaphore
promptPromptPrompt.ocr()A Prompt (see prompts)
temperaturefloat0.0OCR is deterministic; do not raise unless you mean it
max_tokensint500Cap per region
image_detailstr"high"Provider-specific; passthrough
crop_padding_pxint4Pad each region before cropping to give the VLM context
request_timeout_sfloat60.0Per-call timeout
retryRetryRetry.default()Exponential backoff for 429/5xx
cacheCache | str | NoneNoneWraps the recognizer; see caching.md
streamboolFalseEmit recognize/progress events with partial text as tokens arrive

Crop padding matters: a region cropped exactly to its bbox often loses the top of capital letters or the bottom of descenders. The default 4 px is right for most scans at 300 DPI; raise it for low-DPI inputs.

stream=True opts the recognizer into incremental output. The final Recognition shape is unchanged; what changes is the event stream — a streaming recognizer emits additional recognize/progress events with {region_id, partial_text, tokens} payloads as tokens arrive, then a final finished carrying the assembled Recognition. Useful for long text-block regions when a UI wants to render a typewriter effect; pure noise for cell-by-cell OCR. Off by default.

cache= is the simplest way to enable caching:

from scriva.recognize import openai
openai(model="gpt-4o", cache=".scriva_cache") # FileSystemCache
openai(model="gpt-4o", cache=Cache.layered(".scriva")) # exact + semantic

Recognizers do not bake in a prompt. Each accepts a Prompt object:

from scriva.prompts import Prompt
ocr_en = Prompt.from_template("""
Read every character in this image. If the image is blank, return exactly the
string "BLANK". Return only the recognised text. No commentary.
""")
ocr_ja = Prompt.ocr(locale="ja") # built-in localised variants
extract = Prompt.structured(schema=Invoice) # function-calling / structured output
code = Prompt.layout_aware() # preserves whitespace

Three built-in prompt families, each a function call:

  • Prompt.ocr(locale=...) — plain text recognition. Locales en, ja, zh, ko, de, fr, es.
  • Prompt.structured(schema=...) — structured output bound to a Pydantic model.
  • Prompt.layout_aware() — preserves internal whitespace and line breaks for code/tables.

Register your own with Prompt.register("name", template, locale="en").

A RecognitionHint is an optional signal passed to recognize(...) that biases the engine toward a particular answer. Hints are how the pipeline plumbs “we already think this region says X” through to a second-pass recognizer:

class RecognitionHint(BaseModel):
text: str | None = None # previous best-guess text
language: str | None = None # ISO 639-1
schema: type[BaseModel] | None = None # for structured prompts
examples: list[FewShotExample] = [] # few-shot exemplars (see below)
attrs: dict[str, Any] = {} # engine-specific extension
class FewShotExample(BaseModel):
image: bytes | Path # PNG bytes or a path the recognizer can read
text: str # the correct recognition for this image
note: str | None = None # optional caption spliced into the prompt
@classmethod
def from_sample(cls, sample: Sample) -> "FewShotExample": ...

Built-in VLM recognizers fold the hint into the prompt automatically when the prompt is a hint-aware variant:

from scriva.prompts import Prompt
Prompt.ocr() # ignores hints
Prompt.ocr_with_hint() # includes hint.text in the message
Prompt.ocr_few_shot() # includes hint.examples ahead of the target crop
Prompt.structured(schema=) # uses hint.schema when present
Prompt.structured_few_shot(schema=) # both

Constructors:

RecognitionHint(text="ENCLOSURE PLATE") # explicit
RecognitionHint.from_recognition(prev) # copy text/language from a prior Recognition
RecognitionHint.from_result(prev_result) # mapping {region_id: hint} from a whole DocumentResult
RecognitionHint.from_store(store, near=region, k=5) # populate examples from a SampleStore by nearest neighbour

The from_store(...) constructor is the path for retrieval-augmented OCR: it looks up the k nearest labelled samples and packages them as FewShotExamples. The store must declare Capability.NEAREST_NEIGHBOUR — see samples.md.

Pair this with postprocess.dictionary.from_samples(store) in the same pipeline. The two read the same labelled crops but apply them on opposite sides of recognize: few-shot exemplars steer the model before it answers; the derived dictionary corrects known errors after. See samples.md › Two roads from a SampleStore for the joint recipe.

The hints= kwarg on a recognizer factory accepts:

  • a single RecognitionHint (applied to every region),
  • a dict[RegionId, RecognitionHint] (per-region overrides), or
  • a Callable[[Region], RecognitionHint] (computed per region — the right shape for retrieval-augmented OCR via RecognitionHint.from_store).

See Pipeline › Confidence-driven re-OCR for the canonical refinement pattern, and the few-shot recipe below for retrieval-augmented OCR.

from scriva.recognize import openai
from scriva.prompts import Prompt
from scriva import samples, RecognitionHint
store = samples.fs(".scriva_samples")
pipeline = scriva.Pipeline(
morphological_grid(),
openai(
model="gpt-4o",
prompt=Prompt.ocr_few_shot(),
hints=lambda region: RecognitionHint.from_store(store, near=region, k=5),
),
excel("out.xlsx"),
)

Custom recognizers receive the hint through the keyword:

async def recognize(self, page, region, *, hint=None):
if hint and hint.text:
prompt = self._prompt_with_hint(hint.text)
else:
prompt = self._prompt
...

Hints are advisory. A recognizer is free to ignore them — for example, when the cached crop already hits exactly.

class Capability(StrEnum):
JSON_OUTPUT = "json_output"
LANGUAGE_DETECTION = "language_detection"
LAYOUT_PRESERVING = "layout_preserving"
TOKEN_BOXES = "token_boxes"
HANDWRITING = "handwriting"
GRID = "grid"
POLYGON = "polygon"
SELF_CONFIDENCE = "self_confidence"

A recognizer declares its capabilities; downstream stages declare their requirements. Construction reports an actionable error before you load anything:

ConfigurationError: post-processor `rule_splitter` requires
Capability.LAYOUT_PRESERVING from the recognizer; openai provides
{JSON_OUTPUT, LANGUAGE_DETECTION}. Set prompt=Prompt.layout_aware() or
choose a different recognizer.

Subclass:

from scriva import Recognizer, Recognition, Capability
class MyRecognizer(Recognizer):
capabilities = frozenset({Capability.JSON_OUTPUT})
max_concurrency = 4
async def recognize(self, page, region, *, hint=None):
crop = page.crop(region.bbox, padding=4)
text = await self._call_my_model(crop)
return Recognition(text=text, source=self.name)

…or decorate a function (for stateless cases):

from scriva import recognizer, Recognition, Capability
@recognizer(capabilities=frozenset({Capability.JSON_OUTPUT}), max_concurrency=4)
async def my_recognizer(page, region, *, hint=None):
crop = page.crop(region.bbox, padding=4)
return Recognition(text=await call_my_model(crop), source="my-recognizer")

You do not implement caching, batching, retry, or progress reporting. The pipeline owns all of that.

max_concurrency is enforced by the pipeline’s semaphore around your recognize(...) call. If your implementation fans out internally — multiple subprocess calls per region, an unbounded asyncio.gather, or a helper that spawns its own tasks — the semaphore will not see those, and you can over-subscribe the backend.

Two safe patterns:

  • One network call per recognize() invocation. The semaphore’s budget is honoured by construction.

  • A second semaphore inside the recognizer. Useful when the call genuinely needs to fan out:

    class MyRecognizer(Recognizer):
    max_concurrency = 8
    def __init__(self):
    self._inner = asyncio.Semaphore(2) # per-region fanout budget
    async def recognize(self, page, region, *, hint=None):
    async with self._inner:
    ...

If you cannot insert a checkpoint, wrap the inner call in asyncio.wait_for(...) so a cancelled run does not hang on a blocked subprocess.

VLM recognizers are by far the dominant cost of a typical run. Two levers:

  1. Cache aggressively. Most forms have repeated headers across pages — semantic cache hit rates of 30–60% are common. See caching.md.
  2. Skip blank regions. Configure a RegionClassifier upstream; blank regions are never sent to the recognizer.