Recognizers
Recognizers
Section titled “Recognizers”A recognizer turns a Region into a Recognition. This is where the
actual OCR happens — by calling a vision-language model, a classical OCR
binary, or whatever you bring.
Protocol
Section titled “Protocol”class Recognizer(Engine, Protocol): name: str max_concurrency: int
async def recognize( self, page: Page, region: Region, *, hint: RecognitionHint | None = None, ) -> Recognition: ...You implement recognize. The pipeline:
- Iterates the layout’s recognisable regions (skips
role == "blank"). - Bounds parallelism with
asyncio.Semaphore(max_concurrency). - Wraps each call with the configured cache (if any).
- Emits
progressevents withdone/total/cache_hits.
You never write the parallelism, the cache, or the events. The pipeline does.
Built-in recognizers
Section titled “Built-in recognizers”Factories under scriva.recognize (lowercase) and classes under
scriva.recognizers (subclassable):
| Factory | Class | Engine | Extra | Capabilities |
|---|---|---|---|---|
recognize.openai(...) | OpenAIVisionRecognizer | OpenAI Vision (gpt-4o, etc.) | openai | JSON_OUTPUT, LANGUAGE_DETECTION |
recognize.anthropic(...) | AnthropicVisionRecognizer | Anthropic Vision (claude-*) | anthropic | JSON_OUTPUT, LANGUAGE_DETECTION |
recognize.bedrock(...) | BedrockVisionRecognizer | AWS Bedrock (Qwen-VL, Claude) | bedrock | JSON_OUTPUT |
recognize.tesseract(...) | TesseractRecognizer | tesseract-ocr CLI | tesseract | TOKEN_BOXES, LANGUAGE_DETECTION |
recognize.echo(...) | EchoRecognizer | returns a fixed string | core | — |
recognize.fallback(*chain) | FallbackRecognizer | try each in order on failure | core | union of children |
recognize.uncertainty_first(primary, oracle, *, by=…, k=…) | UncertaintyFirstRecognizer | route k most-uncertain to an oracle | core | union of children |
recognize.consensus(*recogs) | ConsensusRecognizer | parallel vote across recognizers | core | intersection of children |
recognize.echo exists so you can build a complete test pipeline without any
network or binary dependency.
uncertainty_first
Section titled “uncertainty_first”Two-stage dispatch. Every region runs through primary; the k
most-uncertain (or all those exceeding a threshold) are then
re-recognised by oracle and the oracle’s output wins:
from scriva.recognize import openai, anthropic, uncertainty_first
recognize = uncertainty_first( primary=openai(model="gpt-4o", cache=".scriva_cache"), oracle=anthropic(model="claude-opus-4-7"), by=lambda r: abs((r.confidence or 0.5) - 0.5), # default: distance from 0.5 k=20, # top-k uncertain # threshold=0.3, # alternative to k store=None, # optional SampleStore)Pass store= to persist both versions as a Sample — primary recognition
in .recognition, oracle text in .label. The wrapper does the routing;
samples does the persistence. This is the foundation of
the annotation workflow in domains.annotation.
by is any Callable[[Recognition], float] returning a “priority”
where higher means more interesting. The default (|confidence - 0.5|)
treats both near-0 and near-1 as more decisive than uncertain mid-range.
Pass your own for application-specific routing — e.g. lambda r: 1.0 - (r.confidence or 0) for “low-confidence only”.
consensus
Section titled “consensus”Calls multiple recognizers in parallel and resolves disagreements:
from scriva.recognize import consensus, openai, anthropic, bedrock
recognize = consensus( openai(model="gpt-4o"), anthropic(model="claude-opus-4-7"), bedrock(model="qwen.qwen3-vl-235b-a22b"), agreement: float = 0.5, # min fraction that must agree tiebreaker: Recognizer | None = None, on_disagreement: Literal["confidence", "majority", "tiebreaker"] = "confidence",)The capability set is the intersection of the children — a consensus
recognizer only declares what every member supports. The
on_disagreement strategies:
"confidence"— highest-confidence wins."majority"— most common normalised text wins; ties broken by confidence."tiebreaker"— calltiebreakerwithhint.textset to the majority answer; the tiebreaker’s output is final.
Cost scales linearly with the number of children. Use it sparingly — pair
with cache= on each child so re-runs are free.
VLM recognizers, in detail
Section titled “VLM recognizers, in detail”All VLM-based recognizers share the same options:
| Option | Type | Default | Notes |
|---|---|---|---|
model | str | required | e.g. "gpt-4o", "claude-opus-4-7" |
max_concurrency | int | 8 | Per-recognizer semaphore |
prompt | Prompt | Prompt.ocr() | A Prompt (see prompts) |
temperature | float | 0.0 | OCR is deterministic; do not raise unless you mean it |
max_tokens | int | 500 | Cap per region |
image_detail | str | "high" | Provider-specific; passthrough |
crop_padding_px | int | 4 | Pad each region before cropping to give the VLM context |
request_timeout_s | float | 60.0 | Per-call timeout |
retry | Retry | Retry.default() | Exponential backoff for 429/5xx |
cache | Cache | str | None | None | Wraps the recognizer; see caching.md |
stream | bool | False | Emit recognize/progress events with partial text as tokens arrive |
Crop padding matters: a region cropped exactly to its bbox often loses the
top of capital letters or the bottom of descenders. The default 4 px is
right for most scans at 300 DPI; raise it for low-DPI inputs.
Streaming
Section titled “Streaming”stream=True opts the recognizer into incremental output. The final
Recognition shape is unchanged; what changes is the event stream — a
streaming recognizer emits additional recognize/progress events with
{region_id, partial_text, tokens} payloads as tokens arrive, then a
final finished carrying the assembled Recognition. Useful for long
text-block regions when a UI wants to render a typewriter effect; pure
noise for cell-by-cell OCR. Off by default.
cache= is the simplest way to enable caching:
from scriva.recognize import openaiopenai(model="gpt-4o", cache=".scriva_cache") # FileSystemCacheopenai(model="gpt-4o", cache=Cache.layered(".scriva")) # exact + semanticPrompts
Section titled “Prompts”Recognizers do not bake in a prompt. Each accepts a Prompt object:
from scriva.prompts import Prompt
ocr_en = Prompt.from_template("""Read every character in this image. If the image is blank, return exactly thestring "BLANK". Return only the recognised text. No commentary.""")
ocr_ja = Prompt.ocr(locale="ja") # built-in localised variantsextract = Prompt.structured(schema=Invoice) # function-calling / structured outputcode = Prompt.layout_aware() # preserves whitespaceThree built-in prompt families, each a function call:
Prompt.ocr(locale=...)— plain text recognition. Localesen,ja,zh,ko,de,fr,es.Prompt.structured(schema=...)— structured output bound to a Pydantic model.Prompt.layout_aware()— preserves internal whitespace and line breaks for code/tables.
Register your own with Prompt.register("name", template, locale="en").
RecognitionHint
Section titled “RecognitionHint”A RecognitionHint is an optional signal passed to recognize(...)
that biases the engine toward a particular answer. Hints are how the
pipeline plumbs “we already think this region says X” through to a
second-pass recognizer:
class RecognitionHint(BaseModel): text: str | None = None # previous best-guess text language: str | None = None # ISO 639-1 schema: type[BaseModel] | None = None # for structured prompts examples: list[FewShotExample] = [] # few-shot exemplars (see below) attrs: dict[str, Any] = {} # engine-specific extension
class FewShotExample(BaseModel): image: bytes | Path # PNG bytes or a path the recognizer can read text: str # the correct recognition for this image note: str | None = None # optional caption spliced into the prompt
@classmethod def from_sample(cls, sample: Sample) -> "FewShotExample": ...Built-in VLM recognizers fold the hint into the prompt automatically when the prompt is a hint-aware variant:
from scriva.prompts import Prompt
Prompt.ocr() # ignores hintsPrompt.ocr_with_hint() # includes hint.text in the messagePrompt.ocr_few_shot() # includes hint.examples ahead of the target cropPrompt.structured(schema=…) # uses hint.schema when presentPrompt.structured_few_shot(schema=…) # bothConstructors:
RecognitionHint(text="ENCLOSURE PLATE") # explicitRecognitionHint.from_recognition(prev) # copy text/language from a prior RecognitionRecognitionHint.from_result(prev_result) # mapping {region_id: hint} from a whole DocumentResultRecognitionHint.from_store(store, near=region, k=5) # populate examples from a SampleStore by nearest neighbourThe from_store(...) constructor is the path for retrieval-augmented OCR:
it looks up the k nearest labelled samples and packages them as
FewShotExamples. The store must declare Capability.NEAREST_NEIGHBOUR
— see samples.md.
Pair this with
postprocess.dictionary.from_samples(store)
in the same pipeline. The two read the same labelled crops but apply
them on opposite sides of recognize: few-shot exemplars steer the
model before it answers; the derived dictionary corrects known
errors after. See samples.md › Two roads from a
SampleStore for the joint
recipe.
The hints= kwarg on a recognizer factory accepts:
- a single
RecognitionHint(applied to every region), - a
dict[RegionId, RecognitionHint](per-region overrides), or - a
Callable[[Region], RecognitionHint](computed per region — the right shape for retrieval-augmented OCR viaRecognitionHint.from_store).
See Pipeline › Confidence-driven re-OCR for the canonical refinement pattern, and the few-shot recipe below for retrieval-augmented OCR.
from scriva.recognize import openaifrom scriva.prompts import Promptfrom scriva import samples, RecognitionHint
store = samples.fs(".scriva_samples")pipeline = scriva.Pipeline( morphological_grid(), openai( model="gpt-4o", prompt=Prompt.ocr_few_shot(), hints=lambda region: RecognitionHint.from_store(store, near=region, k=5), ), excel("out.xlsx"),)Custom recognizers receive the hint through the keyword:
async def recognize(self, page, region, *, hint=None): if hint and hint.text: prompt = self._prompt_with_hint(hint.text) else: prompt = self._prompt ...Hints are advisory. A recognizer is free to ignore them — for example, when the cached crop already hits exactly.
Capability negotiation
Section titled “Capability negotiation”class Capability(StrEnum): JSON_OUTPUT = "json_output" LANGUAGE_DETECTION = "language_detection" LAYOUT_PRESERVING = "layout_preserving" TOKEN_BOXES = "token_boxes" HANDWRITING = "handwriting" GRID = "grid" POLYGON = "polygon" SELF_CONFIDENCE = "self_confidence"A recognizer declares its capabilities; downstream stages declare their requirements. Construction reports an actionable error before you load anything:
ConfigurationError: post-processor `rule_splitter` requiresCapability.LAYOUT_PRESERVING from the recognizer; openai provides{JSON_OUTPUT, LANGUAGE_DETECTION}. Set prompt=Prompt.layout_aware() orchoose a different recognizer.Writing your own recognizer
Section titled “Writing your own recognizer”Subclass:
from scriva import Recognizer, Recognition, Capability
class MyRecognizer(Recognizer): capabilities = frozenset({Capability.JSON_OUTPUT}) max_concurrency = 4
async def recognize(self, page, region, *, hint=None): crop = page.crop(region.bbox, padding=4) text = await self._call_my_model(crop) return Recognition(text=text, source=self.name)…or decorate a function (for stateless cases):
from scriva import recognizer, Recognition, Capability
@recognizer(capabilities=frozenset({Capability.JSON_OUTPUT}), max_concurrency=4)async def my_recognizer(page, region, *, hint=None): crop = page.crop(region.bbox, padding=4) return Recognition(text=await call_my_model(crop), source="my-recognizer")You do not implement caching, batching, retry, or progress reporting. The pipeline owns all of that.
The concurrency contract
Section titled “The concurrency contract”max_concurrency is enforced by the pipeline’s semaphore around your
recognize(...) call. If your implementation fans out internally —
multiple subprocess calls per region, an unbounded asyncio.gather, or a
helper that spawns its own tasks — the semaphore will not see those, and
you can over-subscribe the backend.
Two safe patterns:
-
One network call per
recognize()invocation. The semaphore’s budget is honoured by construction. -
A second semaphore inside the recognizer. Useful when the call genuinely needs to fan out:
class MyRecognizer(Recognizer):max_concurrency = 8def __init__(self):self._inner = asyncio.Semaphore(2) # per-region fanout budgetasync def recognize(self, page, region, *, hint=None):async with self._inner:...
If you cannot insert a checkpoint, wrap the inner call in
asyncio.wait_for(...) so a cancelled run does not hang on a blocked
subprocess.
Cost notes
Section titled “Cost notes”VLM recognizers are by far the dominant cost of a typical run. Two levers:
- Cache aggressively. Most forms have repeated headers across pages — semantic cache hit rates of 30–60% are common. See caching.md.
- Skip blank regions. Configure a
RegionClassifierupstream; blank regions are never sent to the recognizer.