Post-processors
Post-processors
Section titled “Post-processors”A post-processor refines Recognition objects after the recogniser has
run. They form a chain — each one reads the current state, mutates or
replaces recognitions, and passes the result to the next.
Protocol
Section titled “Protocol”class PostProcessor(Engine, Protocol): name: str async def process( self, page: Page, layout: Layout, recognitions: dict[RegionId, Recognition], ) -> dict[RegionId, Recognition]: ...Post-processors are pure with respect to the page: they read page for
context (cropping, rendering), but they only mutate recognitions.
Built-in post-processors
Section titled “Built-in post-processors”Factories under scriva.postprocess (lowercase) and classes under
scriva.postprocessors (subclassable):
| Factory | Class | What it does |
|---|---|---|
postprocess.dictionary(...) | DictionaryCorrector | Fixes known OCR errors using a supervised dictionary + fuzzy matching |
postprocess.vocabulary(...) | Vocabulary | Snaps low-confidence outputs onto a domain vocabulary |
postprocess.rule_splitter(...) | RuleSplitter | Detects internal ruled lines (│, ─) in output and splits the region |
postprocess.language_detector() | LanguageDetector | Sets Recognition.language from text |
postprocess.whitespace() | WhitespaceNormalizer | Collapses whitespace, normalises newlines, strips trailing dashes |
postprocess.blank_suppress() | BlankSuppressor | Demotes “noise” outputs (“None”, ”-”, “N/A”) to text=None |
postprocess.confidence_score(...) | ConfidenceScorer | Attaches a confidence score (see below) |
postprocess.filter(pred) | Filter | Drops recognitions for which pred(recognition) is False |
postprocess.pipe(*processors) | Pipe | Bundles a chain so it can be inserted as one stage |
dictionary
Section titled “dictionary”A dictionary is the workhorse of the user-environment accuracy loop:
you keep a YAML (or a SampleStore) of project-specific corrections,
and every run pulls it in.
Two dictionary types:
- Supervised —
{wrong: correct}pairs. Four passes: exact, substring (longest-first greedy), full-string fuzzy (difflib, threshold 0.7), token-level fuzzy. - Unsupervised — a vocabulary of known-good terms. Used for validation only; does not rewrite.
postprocess.dictionary.from_yaml("corrections.yaml")postprocess.dictionary.from_pairs({"ENCL0SURE": "ENCLOSURE"})postprocess.dictionary.from_csv("vendors.csv", from_col="raw", to_col="canonical")postprocess.dictionary(supervised=..., unsupervised=..., fuzzy_threshold=0.7)dictionary.from_samples — derive a dictionary from your labelled store
Section titled “dictionary.from_samples — derive a dictionary from your labelled store”When you maintain a SampleStore of corrections (every
time a human edits a recognised cell, the corrected label is saved
alongside the original recognition.text), scriva can build a
supervised dictionary from it automatically:
from scriva import samplesfrom scriva.postprocess import dictionary
store = samples.fs(".scriva_samples")
dictionary.from_samples( store, *, where: Callable[[Sample], bool] | None = None, # filter samples min_observations: int = 2, # only adopt pairs seen >= N times fuzzy_threshold: float = 0.7, refresh: Literal["startup", "every_run", float] = "startup", # or seconds between refreshes)A pair (wrong → right) is added whenever a sample’s recognition.text
differs from its human label. The min_observations floor protects
against one-off typos in the labels.
This is the same store that
RecognitionHint.from_store(...)
reads for few-shot examples — see samples.md › Two roads from a
SampleStore. A labelled store
improves accuracy on the input side (few-shot exemplars guide the
recogniser) and on the output side (corrections rewrite known
errors). Build the store once; both paths benefit.
rule_splitter
Section titled “rule_splitter”When a recognised cell contains │ or ─, the cell was actually a
sub-grid the detector missed. rule_splitter:
- Parses the text into a sub-grid.
- Expands the original
Layoutto add the sub-cells. - Assigns the sub-strings to the new regions.
Requires the recognizer to be layout-aware (Capability.LAYOUT_PRESERVING).
confidence_score
Section titled “confidence_score”The confidence-scoring stage is itself pluggable. Two implementations ship:
postprocess.confidence_score.rendering(...)— a round-trip cross-check against the original crop. Renders the recognised text as an image (via cairosvg + a configurable font), embeds both the original crop and the rendering, and scores by cosine similarity. When the rendered text doesn’t visually match the crop, confidence drops. Engine-agnostic — pass anyImageEmbedder.postprocess.confidence_score.self()— uses the recognizer’s own probability/logit output if available (Capability.SELF_CONFIDENCE). Cheap, but only as well-calibrated as the model.
Both populate Recognition.confidence in [0.0, 1.0].
confidence_score.rendering(...) is the canonical cross-check against
the original data lever in
Pipeline › High-accuracy patterns:
the model’s answer has to look like the pixels it came from.
Chains
Section titled “Chains”The pipeline runs post-processors in the order added. Order matters:
import scrivafrom scriva.postprocess import whitespace, blank_suppress, dictionary, \ rule_splitter, confidence_score
pipeline = scriva.Pipeline( ..., whitespace(), # clean up first blank_suppress(), # then drop noise dictionary.from_yaml("corrections.yaml"),# then fix rule_splitter(), # then restructure confidence_score.rendering(), # score last, against final text)For shared chains, bundle into one stage with pipe:
from scriva.postprocess import pipe, whitespace, blank_suppresscleanup = pipe(whitespace(), blank_suppress())pipeline = scriva.Pipeline(..., cleanup, ...)Ad-hoc with @postprocessor
Section titled “Ad-hoc with @postprocessor”For one-off transforms, skip the class:
from scriva import postprocessor
@postprocessorasync def strip_quotes(page, layout, recognitions): return { rid: r.with_text(r.text.strip('"').strip("'")) if r.text else r for rid, r in recognitions.items() }
pipeline = scriva.Pipeline(..., strip_quotes, ...)The decorator accepts name= and capabilities= keyword args.
Writing your own (subclass form)
Section titled “Writing your own (subclass form)”from scriva import PostProcessor
class StripQuotes(PostProcessor): async def process(self, page, layout, recognitions): return { rid: r.with_text(r.text.strip('"').strip("'")) if r.text else r for rid, r in recognitions.items() }Recognition is immutable; use .with_text(), .with_confidence(), or
.replace(**kwargs) to derive new ones. This makes post-processor chains
safe to test and to re-run on the same input.
When not to use a post-processor
Section titled “When not to use a post-processor”If your refinement requires another VLM call, that is a second
recognizer, not a post-processor. Use recognize.fallback(...) or a
custom recognizer that wraps two engines. Post-processors are for transforms
on already-recognised text.