Preprocessors

A preprocessor manipulates pixels before they reach the recognizer. scriva ships two flavours, distinguished by what they see:

Page preprocessors (Preprocessor) — transform a whole Page before detection. Deskew, denoise, rotate, normalise contrast.
Region preprocessors (RegionPreprocessor) — transform crops or split regions after detection, before recognition. Per-cell binarise, slice a tall region into rows, pad, dewarp, glare-remove.

Both live under scriva.preprocess and scriva.preprocessors. The pipeline slots each one into the correct phase by sniffing its Protocol — you write the imports flat:

from scriva.preprocess import (
    orientation, deskew,        # page-level
    binarize, slice_horizontal, # region-level
)

A typical full chain:

preprocess (Page→Page) ──► detect ──► classify
                       ──► preprocess (Layout→Layout) ──► recognize

The two phases sit either side of detect/classify because they work on different inputs (raw pixels vs identified crops).

Page preprocessors

Protocol

class Preprocessor(Engine, Protocol):
    name: str
    async def process(self, page: Page) -> Page: ...

The result is a new Page — preprocessors are pure, never in-place. The original page is preserved on ctx.original_page so downstream stages can refer back if they need to.

Built-in page preprocessors

Factories under scriva.preprocess (lowercase) and classes under scriva.preprocessors (subclassable):

Factory	Class	What it does
`preprocess.orientation(...)`	`OrientationCorrector`	Coarse rotation (0/90/180/270°) via Tesseract OSD
`preprocess.deskew(...)`	`Deskewer`	Sub-degree skew correction via projection-profile alignment
`preprocess.denoise(...)`	`Denoiser`	Bilateral-filter noise removal
`preprocess.normalize(...)`	`Normalizer`	Contrast stretch and optional DPI resample
`preprocess.crop(...)`	`Cropper`	Fixed-rectangle crop applied before everything else
`preprocess.dewarp(...)`	`Dewarper`	Document-perspective correction (photo of a curled page)
`preprocess.whiteboard_clean(...)`	`WhiteboardCleaner`	Strip marker hue, normalise lighting (Office-Lens-style)
`preprocess.binarize(...)`	`Binarizer`	Otsu / adaptive binarisation for low-quality scans
`preprocess.invert()`	`Inverter`	Negative — for white-on-dark documents

`orientation`

Phone-camera and document-scanner outputs often arrive 90°, 180°, or 270° off. orientation() uses Tesseract OSD (Orientation and Script Detection) to estimate the angle and then applies an exact-pixel PIL.Image.transpose — no interpolation blur, no JPEG recompression.

from scriva.preprocess import orientation

orientation(
    *,
    backend: str = "tesseract",   # or a custom OrientationEstimator
    on_failure: str = "skip",     # "skip" | "raise"
    save_format: str = "png",     # "png" preserves quality; "keep" reuses input format
)

If Tesseract is not on $PATH, the preprocessor logs once and leaves the page unchanged. Set on_failure="raise" to make missing OSD a build-time EngineError instead.

Custom backends implement the OrientationEstimator protocol:

class OrientationEstimator(Protocol):
    async def estimate(self, page: Page) -> int: ...   # returns 0, 90, 180, or 270

`deskew`

deskew() handles small skew angles introduced by sheet-feed scanners. Pure OpenCV / NumPy — no model dependency. Running it on an already- straight scan costs less than 30 ms on a 2400 px page, but a safety threshold keeps the no-op cheap:

deskew(
    *,
    max_correction_deg: float = 5.0,   # never rotate by more than this
    threshold_deg: float = 0.3,         # below this, do nothing
)

Run orientation() first when both are present — a 90°-rotated scan has no meaningful “skew” to measure.

`denoise`

Bilateral filter. Conservative defaults; raise strength only for phone-camera scans of textured paper.

denoise(*, strength: int = 1)

`normalize`

Stretches contrast to [0, 255] and can resample to a target DPI. Most useful in front of detectors trained at a known resolution.

normalize(*, target_dpi: int | None = None, contrast: bool = True)

`crop`

Applies a fixed-rectangle crop in page-pixel coordinates before any other stage runs. Useful when an upstream UI already let the user select the region of interest.

crop(*, bbox: tuple[int, int, int, int])   # (x, y, w, h)

`dewarp`, `whiteboard_clean`, `binarize`, `invert`

dewarp()                                    # photo of a curled page → flat
whiteboard_clean(saturation_drop=0.8,       # office-lens-style
                 contrast_boost=1.4)
binarize(method="otsu")                     # or "adaptive", "sauvola"
invert()                                    # white-on-dark → dark-on-white

binarize is the cheapest accuracy lever for low-DPI grayscale scans — many VLMs and all classical engines score noticeably higher on two-tone input.

Composing page preprocessors

import scriva
from scriva.preprocess import orientation, deskew, denoise
from scriva.detect    import morphological_grid

pipeline = scriva.Pipeline(
    orientation(),   # rotate first — everything else assumes upright
    deskew(),
    denoise(),
    morphological_grid(),
    ...,
)

Out-of-order arguments are still slotted into the preprocessor phase, but the order within the phase is the order you wrote them. Put orientation before deskew, and put any geometry-changing preprocessor (crop, orientation) before any colour-space one (normalize, denoise).

Writing your own page preprocessor

Subclass for stateful preprocessors:

from scriva import Preprocessor, Page

class Grayscale(Preprocessor):
    async def process(self, page: Page) -> Page:
        return page.with_image(page.image.convert("L"))

…or decorate a function for stateless ones:

from scriva import preprocessor

@preprocessor
async def grayscale(page):
    return page.with_image(page.image.convert("L"))

pipeline = scriva.Pipeline(grayscale, ...)

Page is immutable; use page.with_image(...) or page.replace(**kwargs) to derive a new one — same pattern as Recognition.

Region preprocessors

A RegionPreprocessor operates on the layout after detection — so it can see what the detector found and apply different treatment per region. Two things it can do:

Transform a region’s crop in place (binarise this cell only, sharpen all data cells, add 12 px of padding to header rows).
Split a region into multiple sub-regions (slice a tall paragraph into N rows, split a misdetected merged cell into halves).

Both shapes use the same protocol; what differs is what the stage returns.

Protocol

class RegionPreprocessor(Engine, Protocol):
    name: str
    async def process(self, page: Page, layout: Layout) -> Layout: ...

The returned Layout may have:

New regions appended (slicers — one in, N out).
Existing regions with crop_override set (transforms — the recognizer reads region.crop_override in preference to cropping page at region.bbox).
Existing regions with role changed (e.g. demoted to "blank" after a binarisation reveals the cell is empty).

Region.crop_override is a PageCrop — the same shape page.crop(...) returns. Transforms compose naturally: a region that has been sharpened, then padded, then binarised carries the final crop with all three steps baked in.

Built-in region preprocessors

All factories live under scriva.preprocess alongside the page ones — the pipeline assigns them to the right phase by sniffing the Protocol.

Crop transforms (one region → one region with `crop_override`)

Factory	What it does
`preprocess.pad(px=8)`	Add uniform padding around each region’s crop
`preprocess.pad_each(top=, right=, bottom=, left=)`	Asymmetric padding
`preprocess.binarize(method="otsu")`	Per-cell binarisation (separately tuned per cell)
`preprocess.sharpen(strength=1.0)`	Unsharp mask
`preprocess.contrast(factor=1.4)`	Local contrast boost
`preprocess.upscale(factor=2.0)`	Lanczos upscale — useful for tiny cells in dense forms
`preprocess.grayscale()`	Drop chroma; cheaper for some VLM providers
`preprocess.invert()`	Per-cell inversion (white-on-dark cells in a mixed page)
`preprocess.dewarp()`	Region-local perspective correction
`preprocess.glare_remove(strength=1.0)`	Remove specular highlights in photo crops
`preprocess.whiteboard_clean(...)`	Region-local marker / lighting normalisation
`preprocess.deskew_region(...)`	Per-region micro-deskew (for cells rotated independently)
`preprocess.mask(predicate)`	Black out pixels for which `predicate(x, y) is True` — for redaction-style “show only the answer” prompts

pad is the workhorse — many VLMs lose accuracy when a crop is tight to the glyph baseline. The default 4 px on the recognizer is a fallback; when you want different padding per role (more for headers, less for dense data cells), use pad(...) here instead.

Slicers (one region → N regions)

Factory	What it does
`preprocess.slice_horizontal(rows=N, overlap_px=0)`	Split into N equal-height bands
`preprocess.slice_vertical(cols=N, overlap_px=0)`	Split into N equal-width bands
`preprocess.slice_grid(rows=R, cols=C)`	R × C grid of sub-regions
`preprocess.slice_on_separator(direction=...)`	Split where a horizontal or vertical whitespace gap ≥ `min_gap_px` is found
`preprocess.slice_overflow(max_height_px=..., axis="vertical")`	Slice only regions exceeding `max_height_px`; leave the rest alone
`preprocess.slice_by(callable)`	Caller-supplied splitter: `Region → list[Region]`

Slicers preserve region.id as the parent and assign each child a fresh ID with parent_region_id pointing back. Recognition merges back into the parent on result.merge_slices() — by default, the exporter does this automatically.

Filtering: `apply_to=`

Every region preprocessor accepts an apply_to= callable that filters which regions it touches. Regions for which it returns False are passed through untouched:

from scriva.preprocess import binarize, slice_horizontal, pad

binarize(method="otsu", apply_to=lambda r: r.role == "data")
pad(px=12,              apply_to=lambda r: r.role == "header")
slice_horizontal(rows=2, apply_to=lambda r: r.bbox.h > 1200)
slice_on_separator(direction="horizontal", min_gap_px=20,
                   apply_to=lambda r: r.role == "data")

apply_to is the single composition primitive — chains of “different treatment per role” stay readable.

Composition

import scriva
from scriva.preprocess import (
    orientation, deskew,                              # page-level
    binarize, pad, sharpen, slice_overflow,           # region-level
)
from scriva.detect    import morphological_grid
from scriva.classify  import rule_based
from scriva.recognize import openai

pipeline = scriva.Pipeline(
    orientation(), deskew(),                          # page phase
    morphological_grid(),                             # detect
    rule_based(),                                     # classify
    pad(px=8, apply_to=lambda r: r.role == "data"),   # region phase
    binarize(method="otsu", apply_to=lambda r: r.role == "data"),
    sharpen(strength=0.6),
    slice_overflow(max_height_px=1500),               # split very tall cells
    openai(model="gpt-4o"),
)

The pipeline figures out that pad/binarize/sharpen/slice_overflow are region preprocessors (Protocol = RegionPreprocessor) and slots them between classify and recognize. Order within the phase is the order you wrote.

For shared chains, bundle into one stage with preprocess.pipe:

from scriva.preprocess import pipe, binarize, sharpen, pad

clean_data_cells = pipe(
    pad(px=8),
    binarize(method="otsu"),
    sharpen(strength=0.6),
    apply_to=lambda r: r.role == "data",   # applied to all children
)
pipeline = scriva.Pipeline(..., clean_data_cells, ...)

Writing your own region preprocessor

Subclass:

from scriva import RegionPreprocessor, Layout, Page

class StripHeaderBars(RegionPreprocessor):
    async def process(self, page: Page, layout: Layout) -> Layout:
        for region in layout.regions:
            if region.role == "header":
                crop = page.crop(region.bbox, padding=0)
                region.crop_override = crop.with_image(_strip_top_bar(crop.image))
        return layout

…or decorate a function:

from scriva import region_preprocessor

@region_preprocessor(apply_to=lambda r: r.role == "header")
async def strip_header_bars(page, region):
    crop = page.crop(region.bbox, padding=0)
    return region.with_crop_override(crop.with_image(_strip_top_bar(crop.image)))

The decorated form gets called per-region — the framework iterates and respects apply_to=. The subclass form gets the whole layout in one call, which is what you want when the transform needs cross-region context (e.g. global statistics, neighbour-aware slicing).

When per-cell preprocessing pays off

Three recurring patterns:

Mixed-quality cells in one form. Header rows are typeset; data cells are stamped or handwritten. Binarise only the data cells.
Tall paragraphs that overflow a VLM’s effective resolution. slice_overflow(max_height_px=1500) splits them into bands; the recognizer sees clean, in-scale text on each.
Tight bboxes from an aggressive detector. pad(px=12) around each crop recovers character tops and descenders without re-detecting.

When not to use a region preprocessor

If the transform needs to see the whole page in raw form, use a page preprocessor — region preprocessors get a cropped view by design.
If the transform is about output text (whitespace, dictionary correction, splitting on detected ruled lines), use a PostProcessor. Post-processors run after the recognizer; region preprocessors run before it.
If the split depends on what the recognizer said (e.g. “split this cell because the recognised text contains │”), use postprocess.rule_splitter — it ships exactly that.

Preprocessors

Preprocessors

Page preprocessors

Protocol

Built-in page preprocessors

orientation

deskew

denoise

normalize

crop

dewarp, whiteboard_clean, binarize, invert

Composing page preprocessors

Writing your own page preprocessor

Region preprocessors

Protocol

Built-in region preprocessors

Crop transforms (one region → one region with crop_override)

Slicers (one region → N regions)

Filtering: apply_to=

Composition

Writing your own region preprocessor

When per-cell preprocessing pays off

When not to use a region preprocessor

What to read next

`orientation`

`deskew`

`denoise`

`normalize`

`crop`

`dewarp`, `whiteboard_clean`, `binarize`, `invert`

Crop transforms (one region → one region with `crop_override`)

Filtering: `apply_to=`