Post-processors

A post-processor refines Recognition objects after the recogniser has run. They form a chain — each one reads the current state, mutates or replaces recognitions, and passes the result to the next.

Protocol

class PostProcessor(Engine, Protocol):
    name: str
    async def process(
        self,
        page: Page,
        layout: Layout,
        recognitions: dict[RegionId, Recognition],
    ) -> dict[RegionId, Recognition]: ...

Post-processors are pure with respect to the page: they read page for context (cropping, rendering), but they only mutate recognitions.

Built-in post-processors

Factories under scriva.postprocess (lowercase) and classes under scriva.postprocessors (subclassable):

Factory	Class	What it does
`postprocess.dictionary(...)`	`DictionaryCorrector`	Fixes known OCR errors using a supervised dictionary + fuzzy matching
`postprocess.vocabulary(...)`	`Vocabulary`	Snaps low-confidence outputs onto a domain vocabulary
`postprocess.rule_splitter(...)`	`RuleSplitter`	Detects internal ruled lines (`│`, `─`) in output and splits the region
`postprocess.language_detector()`	`LanguageDetector`	Sets `Recognition.language` from text
`postprocess.whitespace()`	`WhitespaceNormalizer`	Collapses whitespace, normalises newlines, strips trailing dashes
`postprocess.blank_suppress()`	`BlankSuppressor`	Demotes “noise” outputs (“None”, ”-”, “N/A”) to `text=None`
`postprocess.confidence_score(...)`	`ConfidenceScorer`	Attaches a confidence score (see below)
`postprocess.filter(pred)`	`Filter`	Drops recognitions for which `pred(recognition) is False`
`postprocess.pipe(*processors)`	`Pipe`	Bundles a chain so it can be inserted as one stage

`dictionary`

A dictionary is the workhorse of the user-environment accuracy loop: you keep a YAML (or a SampleStore) of project-specific corrections, and every run pulls it in.

Two dictionary types:

Supervised — {wrong: correct} pairs. Four passes: exact, substring (longest-first greedy), full-string fuzzy (difflib, threshold 0.7), token-level fuzzy.
Unsupervised — a vocabulary of known-good terms. Used for validation only; does not rewrite.

postprocess.dictionary.from_yaml("corrections.yaml")
postprocess.dictionary.from_pairs({"ENCL0SURE": "ENCLOSURE"})
postprocess.dictionary.from_csv("vendors.csv", from_col="raw", to_col="canonical")
postprocess.dictionary(supervised=..., unsupervised=..., fuzzy_threshold=0.7)

`dictionary.from_samples` — derive a dictionary from your labelled store

When you maintain a SampleStore of corrections (every time a human edits a recognised cell, the corrected label is saved alongside the original recognition.text), scriva can build a supervised dictionary from it automatically:

from scriva import samples
from scriva.postprocess import dictionary

store = samples.fs(".scriva_samples")

dictionary.from_samples(
    store,
    *,
    where: Callable[[Sample], bool] | None = None,    # filter samples
    min_observations: int = 2,                         # only adopt pairs seen >= N times
    fuzzy_threshold: float = 0.7,
    refresh: Literal["startup", "every_run", float] = "startup",  # or seconds between refreshes
)

A pair (wrong → right) is added whenever a sample’s recognition.text differs from its human label. The min_observations floor protects against one-off typos in the labels.

This is the same store that RecognitionHint.from_store(...) reads for few-shot examples — see samples.md › Two roads from a SampleStore. A labelled store improves accuracy on the input side (few-shot exemplars guide the recogniser) and on the output side (corrections rewrite known errors). Build the store once; both paths benefit.

`rule_splitter`

When a recognised cell contains │ or ─, the cell was actually a sub-grid the detector missed. rule_splitter:

Parses the text into a sub-grid.
Expands the original Layout to add the sub-cells.
Assigns the sub-strings to the new regions.

Requires the recognizer to be layout-aware (Capability.LAYOUT_PRESERVING).

`confidence_score`

The confidence-scoring stage is itself pluggable. Two implementations ship:

postprocess.confidence_score.rendering(...) — a round-trip cross-check against the original crop. Renders the recognised text as an image (via cairosvg + a configurable font), embeds both the original crop and the rendering, and scores by cosine similarity. When the rendered text doesn’t visually match the crop, confidence drops. Engine-agnostic — pass any ImageEmbedder.
postprocess.confidence_score.self() — uses the recognizer’s own probability/logit output if available (Capability.SELF_CONFIDENCE). Cheap, but only as well-calibrated as the model.

Both populate Recognition.confidence in [0.0, 1.0].

confidence_score.rendering(...) is the canonical cross-check against the original data lever in Pipeline › High-accuracy patterns: the model’s answer has to look like the pixels it came from.

Chains

The pipeline runs post-processors in the order added. Order matters:

import scriva
from scriva.postprocess import whitespace, blank_suppress, dictionary, \
                                rule_splitter, confidence_score

pipeline = scriva.Pipeline(
    ...,
    whitespace(),                            # clean up first
    blank_suppress(),                        # then drop noise
    dictionary.from_yaml("corrections.yaml"),# then fix
    rule_splitter(),                         # then restructure
    confidence_score.rendering(),            # score last, against final text
)

For shared chains, bundle into one stage with pipe:

from scriva.postprocess import pipe, whitespace, blank_suppress
cleanup = pipe(whitespace(), blank_suppress())
pipeline = scriva.Pipeline(..., cleanup, ...)

Ad-hoc with `@postprocessor`

For one-off transforms, skip the class:

from scriva import postprocessor

@postprocessor
async def strip_quotes(page, layout, recognitions):
    return {
        rid: r.with_text(r.text.strip('"').strip("'")) if r.text else r
        for rid, r in recognitions.items()
    }

pipeline = scriva.Pipeline(..., strip_quotes, ...)

The decorator accepts name= and capabilities= keyword args.

Writing your own (subclass form)

from scriva import PostProcessor

class StripQuotes(PostProcessor):
    async def process(self, page, layout, recognitions):
        return {
            rid: r.with_text(r.text.strip('"').strip("'")) if r.text else r
            for rid, r in recognitions.items()
        }

Recognition is immutable; use .with_text(), .with_confidence(), or .replace(**kwargs) to derive new ones. This makes post-processor chains safe to test and to re-run on the same input.

When not to use a post-processor

If your refinement requires another VLM call, that is a second recognizer, not a post-processor. Use recognize.fallback(...) or a custom recognizer that wraps two engines. Post-processors are for transforms on already-recognised text.

Post-processors

Post-processors

Protocol

Built-in post-processors

dictionary

dictionary.from_samples — derive a dictionary from your labelled store

rule_splitter

confidence_score

Chains

Ad-hoc with @postprocessor

Writing your own (subclass form)

When not to use a post-processor

`dictionary`

`dictionary.from_samples` — derive a dictionary from your labelled store

`rule_splitter`

`confidence_score`

Ad-hoc with `@postprocessor`