Skip to content

Post-processors

A post-processor refines Recognition objects after the recogniser has run. They form a chain — each one reads the current state, mutates or replaces recognitions, and passes the result to the next.

class PostProcessor(Engine, Protocol):
name: str
async def process(
self,
page: Page,
layout: Layout,
recognitions: dict[RegionId, Recognition],
) -> dict[RegionId, Recognition]: ...

Post-processors are pure with respect to the page: they read page for context (cropping, rendering), but they only mutate recognitions.

Factories under scriva.postprocess (lowercase) and classes under scriva.postprocessors (subclassable):

FactoryClassWhat it does
postprocess.dictionary(...)DictionaryCorrectorFixes known OCR errors using a supervised dictionary + fuzzy matching
postprocess.vocabulary(...)VocabularySnaps low-confidence outputs onto a domain vocabulary
postprocess.rule_splitter(...)RuleSplitterDetects internal ruled lines (, ) in output and splits the region
postprocess.language_detector()LanguageDetectorSets Recognition.language from text
postprocess.whitespace()WhitespaceNormalizerCollapses whitespace, normalises newlines, strips trailing dashes
postprocess.blank_suppress()BlankSuppressorDemotes “noise” outputs (“None”, ”-”, “N/A”) to text=None
postprocess.confidence_score(...)ConfidenceScorerAttaches a confidence score (see below)
postprocess.filter(pred)FilterDrops recognitions for which pred(recognition) is False
postprocess.pipe(*processors)PipeBundles a chain so it can be inserted as one stage

A dictionary is the workhorse of the user-environment accuracy loop: you keep a YAML (or a SampleStore) of project-specific corrections, and every run pulls it in.

Two dictionary types:

  • Supervised{wrong: correct} pairs. Four passes: exact, substring (longest-first greedy), full-string fuzzy (difflib, threshold 0.7), token-level fuzzy.
  • Unsupervised — a vocabulary of known-good terms. Used for validation only; does not rewrite.
postprocess.dictionary.from_yaml("corrections.yaml")
postprocess.dictionary.from_pairs({"ENCL0SURE": "ENCLOSURE"})
postprocess.dictionary.from_csv("vendors.csv", from_col="raw", to_col="canonical")
postprocess.dictionary(supervised=..., unsupervised=..., fuzzy_threshold=0.7)

dictionary.from_samples — derive a dictionary from your labelled store

Section titled “dictionary.from_samples — derive a dictionary from your labelled store”

When you maintain a SampleStore of corrections (every time a human edits a recognised cell, the corrected label is saved alongside the original recognition.text), scriva can build a supervised dictionary from it automatically:

from scriva import samples
from scriva.postprocess import dictionary
store = samples.fs(".scriva_samples")
dictionary.from_samples(
store,
*,
where: Callable[[Sample], bool] | None = None, # filter samples
min_observations: int = 2, # only adopt pairs seen >= N times
fuzzy_threshold: float = 0.7,
refresh: Literal["startup", "every_run", float] = "startup", # or seconds between refreshes
)

A pair (wrong → right) is added whenever a sample’s recognition.text differs from its human label. The min_observations floor protects against one-off typos in the labels.

This is the same store that RecognitionHint.from_store(...) reads for few-shot examples — see samples.md › Two roads from a SampleStore. A labelled store improves accuracy on the input side (few-shot exemplars guide the recogniser) and on the output side (corrections rewrite known errors). Build the store once; both paths benefit.

When a recognised cell contains or , the cell was actually a sub-grid the detector missed. rule_splitter:

  1. Parses the text into a sub-grid.
  2. Expands the original Layout to add the sub-cells.
  3. Assigns the sub-strings to the new regions.

Requires the recognizer to be layout-aware (Capability.LAYOUT_PRESERVING).

The confidence-scoring stage is itself pluggable. Two implementations ship:

  • postprocess.confidence_score.rendering(...) — a round-trip cross-check against the original crop. Renders the recognised text as an image (via cairosvg + a configurable font), embeds both the original crop and the rendering, and scores by cosine similarity. When the rendered text doesn’t visually match the crop, confidence drops. Engine-agnostic — pass any ImageEmbedder.
  • postprocess.confidence_score.self() — uses the recognizer’s own probability/logit output if available (Capability.SELF_CONFIDENCE). Cheap, but only as well-calibrated as the model.

Both populate Recognition.confidence in [0.0, 1.0].

confidence_score.rendering(...) is the canonical cross-check against the original data lever in Pipeline › High-accuracy patterns: the model’s answer has to look like the pixels it came from.

The pipeline runs post-processors in the order added. Order matters:

import scriva
from scriva.postprocess import whitespace, blank_suppress, dictionary, \
rule_splitter, confidence_score
pipeline = scriva.Pipeline(
...,
whitespace(), # clean up first
blank_suppress(), # then drop noise
dictionary.from_yaml("corrections.yaml"),# then fix
rule_splitter(), # then restructure
confidence_score.rendering(), # score last, against final text
)

For shared chains, bundle into one stage with pipe:

from scriva.postprocess import pipe, whitespace, blank_suppress
cleanup = pipe(whitespace(), blank_suppress())
pipeline = scriva.Pipeline(..., cleanup, ...)

For one-off transforms, skip the class:

from scriva import postprocessor
@postprocessor
async def strip_quotes(page, layout, recognitions):
return {
rid: r.with_text(r.text.strip('"').strip("'")) if r.text else r
for rid, r in recognitions.items()
}
pipeline = scriva.Pipeline(..., strip_quotes, ...)

The decorator accepts name= and capabilities= keyword args.

from scriva import PostProcessor
class StripQuotes(PostProcessor):
async def process(self, page, layout, recognitions):
return {
rid: r.with_text(r.text.strip('"').strip("'")) if r.text else r
for rid, r in recognitions.items()
}

Recognition is immutable; use .with_text(), .with_confidence(), or .replace(**kwargs) to derive new ones. This makes post-processor chains safe to test and to re-run on the same input.

If your refinement requires another VLM call, that is a second recognizer, not a post-processor. Use recognize.fallback(...) or a custom recognizer that wraps two engines. Post-processors are for transforms on already-recognised text.