Skip to content

Preprocessors

A preprocessor manipulates pixels before they reach the recognizer. scriva ships two flavours, distinguished by what they see:

  • Page preprocessors (Preprocessor) — transform a whole Page before detection. Deskew, denoise, rotate, normalise contrast.
  • Region preprocessors (RegionPreprocessor) — transform crops or split regions after detection, before recognition. Per-cell binarise, slice a tall region into rows, pad, dewarp, glare-remove.

Both live under scriva.preprocess and scriva.preprocessors. The pipeline slots each one into the correct phase by sniffing its Protocol — you write the imports flat:

from scriva.preprocess import (
orientation, deskew, # page-level
binarize, slice_horizontal, # region-level
)

A typical full chain:

preprocess (Page→Page) ──► detect ──► classify
──► preprocess (Layout→Layout) ──► recognize

The two phases sit either side of detect/classify because they work on different inputs (raw pixels vs identified crops).


class Preprocessor(Engine, Protocol):
name: str
async def process(self, page: Page) -> Page: ...

The result is a new Page — preprocessors are pure, never in-place. The original page is preserved on ctx.original_page so downstream stages can refer back if they need to.

Factories under scriva.preprocess (lowercase) and classes under scriva.preprocessors (subclassable):

FactoryClassWhat it does
preprocess.orientation(...)OrientationCorrectorCoarse rotation (0/90/180/270°) via Tesseract OSD
preprocess.deskew(...)DeskewerSub-degree skew correction via projection-profile alignment
preprocess.denoise(...)DenoiserBilateral-filter noise removal
preprocess.normalize(...)NormalizerContrast stretch and optional DPI resample
preprocess.crop(...)CropperFixed-rectangle crop applied before everything else
preprocess.dewarp(...)DewarperDocument-perspective correction (photo of a curled page)
preprocess.whiteboard_clean(...)WhiteboardCleanerStrip marker hue, normalise lighting (Office-Lens-style)
preprocess.binarize(...)BinarizerOtsu / adaptive binarisation for low-quality scans
preprocess.invert()InverterNegative — for white-on-dark documents

Phone-camera and document-scanner outputs often arrive 90°, 180°, or 270° off. orientation() uses Tesseract OSD (Orientation and Script Detection) to estimate the angle and then applies an exact-pixel PIL.Image.transpose — no interpolation blur, no JPEG recompression.

from scriva.preprocess import orientation
orientation(
*,
backend: str = "tesseract", # or a custom OrientationEstimator
on_failure: str = "skip", # "skip" | "raise"
save_format: str = "png", # "png" preserves quality; "keep" reuses input format
)

If Tesseract is not on $PATH, the preprocessor logs once and leaves the page unchanged. Set on_failure="raise" to make missing OSD a build-time EngineError instead.

Custom backends implement the OrientationEstimator protocol:

class OrientationEstimator(Protocol):
async def estimate(self, page: Page) -> int: ... # returns 0, 90, 180, or 270

deskew() handles small skew angles introduced by sheet-feed scanners. Pure OpenCV / NumPy — no model dependency. Running it on an already- straight scan costs less than 30 ms on a 2400 px page, but a safety threshold keeps the no-op cheap:

deskew(
*,
max_correction_deg: float = 5.0, # never rotate by more than this
threshold_deg: float = 0.3, # below this, do nothing
)

Run orientation() first when both are present — a 90°-rotated scan has no meaningful “skew” to measure.

Bilateral filter. Conservative defaults; raise strength only for phone-camera scans of textured paper.

denoise(*, strength: int = 1)

Stretches contrast to [0, 255] and can resample to a target DPI. Most useful in front of detectors trained at a known resolution.

normalize(*, target_dpi: int | None = None, contrast: bool = True)

Applies a fixed-rectangle crop in page-pixel coordinates before any other stage runs. Useful when an upstream UI already let the user select the region of interest.

crop(*, bbox: tuple[int, int, int, int]) # (x, y, w, h)

dewarp, whiteboard_clean, binarize, invert

Section titled “dewarp, whiteboard_clean, binarize, invert”
dewarp() # photo of a curled page → flat
whiteboard_clean(saturation_drop=0.8, # office-lens-style
contrast_boost=1.4)
binarize(method="otsu") # or "adaptive", "sauvola"
invert() # white-on-dark → dark-on-white

binarize is the cheapest accuracy lever for low-DPI grayscale scans — many VLMs and all classical engines score noticeably higher on two-tone input.

import scriva
from scriva.preprocess import orientation, deskew, denoise
from scriva.detect import morphological_grid
pipeline = scriva.Pipeline(
orientation(), # rotate first — everything else assumes upright
deskew(),
denoise(),
morphological_grid(),
...,
)

Out-of-order arguments are still slotted into the preprocessor phase, but the order within the phase is the order you wrote them. Put orientation before deskew, and put any geometry-changing preprocessor (crop, orientation) before any colour-space one (normalize, denoise).

Subclass for stateful preprocessors:

from scriva import Preprocessor, Page
class Grayscale(Preprocessor):
async def process(self, page: Page) -> Page:
return page.with_image(page.image.convert("L"))

…or decorate a function for stateless ones:

from scriva import preprocessor
@preprocessor
async def grayscale(page):
return page.with_image(page.image.convert("L"))
pipeline = scriva.Pipeline(grayscale, ...)

Page is immutable; use page.with_image(...) or page.replace(**kwargs) to derive a new one — same pattern as Recognition.


A RegionPreprocessor operates on the layout after detection — so it can see what the detector found and apply different treatment per region. Two things it can do:

  1. Transform a region’s crop in place (binarise this cell only, sharpen all data cells, add 12 px of padding to header rows).
  2. Split a region into multiple sub-regions (slice a tall paragraph into N rows, split a misdetected merged cell into halves).

Both shapes use the same protocol; what differs is what the stage returns.

class RegionPreprocessor(Engine, Protocol):
name: str
async def process(self, page: Page, layout: Layout) -> Layout: ...

The returned Layout may have:

  • New regions appended (slicers — one in, N out).
  • Existing regions with crop_override set (transforms — the recognizer reads region.crop_override in preference to cropping page at region.bbox).
  • Existing regions with role changed (e.g. demoted to "blank" after a binarisation reveals the cell is empty).

Region.crop_override is a PageCrop — the same shape page.crop(...) returns. Transforms compose naturally: a region that has been sharpened, then padded, then binarised carries the final crop with all three steps baked in.

All factories live under scriva.preprocess alongside the page ones — the pipeline assigns them to the right phase by sniffing the Protocol.

Crop transforms (one region → one region with crop_override)

Section titled “Crop transforms (one region → one region with crop_override)”
FactoryWhat it does
preprocess.pad(px=8)Add uniform padding around each region’s crop
preprocess.pad_each(top=, right=, bottom=, left=)Asymmetric padding
preprocess.binarize(method="otsu")Per-cell binarisation (separately tuned per cell)
preprocess.sharpen(strength=1.0)Unsharp mask
preprocess.contrast(factor=1.4)Local contrast boost
preprocess.upscale(factor=2.0)Lanczos upscale — useful for tiny cells in dense forms
preprocess.grayscale()Drop chroma; cheaper for some VLM providers
preprocess.invert()Per-cell inversion (white-on-dark cells in a mixed page)
preprocess.dewarp()Region-local perspective correction
preprocess.glare_remove(strength=1.0)Remove specular highlights in photo crops
preprocess.whiteboard_clean(...)Region-local marker / lighting normalisation
preprocess.deskew_region(...)Per-region micro-deskew (for cells rotated independently)
preprocess.mask(predicate)Black out pixels for which predicate(x, y) is True — for redaction-style “show only the answer” prompts

pad is the workhorse — many VLMs lose accuracy when a crop is tight to the glyph baseline. The default 4 px on the recognizer is a fallback; when you want different padding per role (more for headers, less for dense data cells), use pad(...) here instead.

FactoryWhat it does
preprocess.slice_horizontal(rows=N, overlap_px=0)Split into N equal-height bands
preprocess.slice_vertical(cols=N, overlap_px=0)Split into N equal-width bands
preprocess.slice_grid(rows=R, cols=C)R × C grid of sub-regions
preprocess.slice_on_separator(direction=...)Split where a horizontal or vertical whitespace gap ≥ min_gap_px is found
preprocess.slice_overflow(max_height_px=..., axis="vertical")Slice only regions exceeding max_height_px; leave the rest alone
preprocess.slice_by(callable)Caller-supplied splitter: Region → list[Region]

Slicers preserve region.id as the parent and assign each child a fresh ID with parent_region_id pointing back. Recognition merges back into the parent on result.merge_slices() — by default, the exporter does this automatically.

Every region preprocessor accepts an apply_to= callable that filters which regions it touches. Regions for which it returns False are passed through untouched:

from scriva.preprocess import binarize, slice_horizontal, pad
binarize(method="otsu", apply_to=lambda r: r.role == "data")
pad(px=12, apply_to=lambda r: r.role == "header")
slice_horizontal(rows=2, apply_to=lambda r: r.bbox.h > 1200)
slice_on_separator(direction="horizontal", min_gap_px=20,
apply_to=lambda r: r.role == "data")

apply_to is the single composition primitive — chains of “different treatment per role” stay readable.

import scriva
from scriva.preprocess import (
orientation, deskew, # page-level
binarize, pad, sharpen, slice_overflow, # region-level
)
from scriva.detect import morphological_grid
from scriva.classify import rule_based
from scriva.recognize import openai
pipeline = scriva.Pipeline(
orientation(), deskew(), # page phase
morphological_grid(), # detect
rule_based(), # classify
pad(px=8, apply_to=lambda r: r.role == "data"), # region phase
binarize(method="otsu", apply_to=lambda r: r.role == "data"),
sharpen(strength=0.6),
slice_overflow(max_height_px=1500), # split very tall cells
openai(model="gpt-4o"),
)

The pipeline figures out that pad/binarize/sharpen/slice_overflow are region preprocessors (Protocol = RegionPreprocessor) and slots them between classify and recognize. Order within the phase is the order you wrote.

For shared chains, bundle into one stage with preprocess.pipe:

from scriva.preprocess import pipe, binarize, sharpen, pad
clean_data_cells = pipe(
pad(px=8),
binarize(method="otsu"),
sharpen(strength=0.6),
apply_to=lambda r: r.role == "data", # applied to all children
)
pipeline = scriva.Pipeline(..., clean_data_cells, ...)

Subclass:

from scriva import RegionPreprocessor, Layout, Page
class StripHeaderBars(RegionPreprocessor):
async def process(self, page: Page, layout: Layout) -> Layout:
for region in layout.regions:
if region.role == "header":
crop = page.crop(region.bbox, padding=0)
region.crop_override = crop.with_image(_strip_top_bar(crop.image))
return layout

…or decorate a function:

from scriva import region_preprocessor
@region_preprocessor(apply_to=lambda r: r.role == "header")
async def strip_header_bars(page, region):
crop = page.crop(region.bbox, padding=0)
return region.with_crop_override(crop.with_image(_strip_top_bar(crop.image)))

The decorated form gets called per-region — the framework iterates and respects apply_to=. The subclass form gets the whole layout in one call, which is what you want when the transform needs cross-region context (e.g. global statistics, neighbour-aware slicing).

Three recurring patterns:

  1. Mixed-quality cells in one form. Header rows are typeset; data cells are stamped or handwritten. Binarise only the data cells.
  2. Tall paragraphs that overflow a VLM’s effective resolution. slice_overflow(max_height_px=1500) splits them into bands; the recognizer sees clean, in-scale text on each.
  3. Tight bboxes from an aggressive detector. pad(px=12) around each crop recovers character tops and descenders without re-detecting.
  • If the transform needs to see the whole page in raw form, use a page preprocessor — region preprocessors get a cropped view by design.
  • If the transform is about output text (whitespace, dictionary correction, splitting on detected ruled lines), use a PostProcessor. Post-processors run after the recognizer; region preprocessors run before it.
  • If the split depends on what the recognizer said (e.g. “split this cell because the recognised text contains ”), use postprocess.rule_splitter — it ships exactly that.
  • Pipeline › High-accuracy patterns — Region preprocessing is one of three accuracy levers; the others are human-in-the-loop and cross-checking.
  • Detectors — region preprocessors run after detection; if the detector itself needs work, start there.
  • Post-processors — the symmetric phase on the output side.