Preprocessors
Preprocessors
Section titled “Preprocessors”A preprocessor manipulates pixels before they reach the recognizer.
scriva ships two flavours, distinguished by what they see:
- Page preprocessors (
Preprocessor) — transform a wholePagebefore detection. Deskew, denoise, rotate, normalise contrast. - Region preprocessors (
RegionPreprocessor) — transform crops or split regions after detection, before recognition. Per-cell binarise, slice a tall region into rows, pad, dewarp, glare-remove.
Both live under scriva.preprocess and scriva.preprocessors. The
pipeline slots each one into the correct phase by sniffing its
Protocol — you write the imports flat:
from scriva.preprocess import ( orientation, deskew, # page-level binarize, slice_horizontal, # region-level)A typical full chain:
preprocess (Page→Page) ──► detect ──► classify ──► preprocess (Layout→Layout) ──► recognizeThe two phases sit either side of detect/classify because they
work on different inputs (raw pixels vs identified crops).
Page preprocessors
Section titled “Page preprocessors”Protocol
Section titled “Protocol”class Preprocessor(Engine, Protocol): name: str async def process(self, page: Page) -> Page: ...The result is a new Page — preprocessors are pure, never in-place. The
original page is preserved on ctx.original_page so downstream stages
can refer back if they need to.
Built-in page preprocessors
Section titled “Built-in page preprocessors”Factories under scriva.preprocess (lowercase) and classes under
scriva.preprocessors (subclassable):
| Factory | Class | What it does |
|---|---|---|
preprocess.orientation(...) | OrientationCorrector | Coarse rotation (0/90/180/270°) via Tesseract OSD |
preprocess.deskew(...) | Deskewer | Sub-degree skew correction via projection-profile alignment |
preprocess.denoise(...) | Denoiser | Bilateral-filter noise removal |
preprocess.normalize(...) | Normalizer | Contrast stretch and optional DPI resample |
preprocess.crop(...) | Cropper | Fixed-rectangle crop applied before everything else |
preprocess.dewarp(...) | Dewarper | Document-perspective correction (photo of a curled page) |
preprocess.whiteboard_clean(...) | WhiteboardCleaner | Strip marker hue, normalise lighting (Office-Lens-style) |
preprocess.binarize(...) | Binarizer | Otsu / adaptive binarisation for low-quality scans |
preprocess.invert() | Inverter | Negative — for white-on-dark documents |
orientation
Section titled “orientation”Phone-camera and document-scanner outputs often arrive 90°, 180°, or 270°
off. orientation() uses Tesseract OSD (Orientation and Script
Detection) to estimate the angle and then applies an exact-pixel
PIL.Image.transpose — no interpolation blur, no JPEG recompression.
from scriva.preprocess import orientation
orientation( *, backend: str = "tesseract", # or a custom OrientationEstimator on_failure: str = "skip", # "skip" | "raise" save_format: str = "png", # "png" preserves quality; "keep" reuses input format)If Tesseract is not on $PATH, the preprocessor logs once and leaves the
page unchanged. Set on_failure="raise" to make missing OSD a build-time
EngineError instead.
Custom backends implement the OrientationEstimator protocol:
class OrientationEstimator(Protocol): async def estimate(self, page: Page) -> int: ... # returns 0, 90, 180, or 270deskew
Section titled “deskew”deskew() handles small skew angles introduced by sheet-feed scanners.
Pure OpenCV / NumPy — no model dependency. Running it on an already-
straight scan costs less than 30 ms on a 2400 px page, but a safety
threshold keeps the no-op cheap:
deskew( *, max_correction_deg: float = 5.0, # never rotate by more than this threshold_deg: float = 0.3, # below this, do nothing)Run orientation() first when both are present — a 90°-rotated scan has
no meaningful “skew” to measure.
denoise
Section titled “denoise”Bilateral filter. Conservative defaults; raise strength only for
phone-camera scans of textured paper.
denoise(*, strength: int = 1)normalize
Section titled “normalize”Stretches contrast to [0, 255] and can resample to a target DPI. Most
useful in front of detectors trained at a known resolution.
normalize(*, target_dpi: int | None = None, contrast: bool = True)Applies a fixed-rectangle crop in page-pixel coordinates before any other stage runs. Useful when an upstream UI already let the user select the region of interest.
crop(*, bbox: tuple[int, int, int, int]) # (x, y, w, h)dewarp, whiteboard_clean, binarize, invert
Section titled “dewarp, whiteboard_clean, binarize, invert”dewarp() # photo of a curled page → flatwhiteboard_clean(saturation_drop=0.8, # office-lens-style contrast_boost=1.4)binarize(method="otsu") # or "adaptive", "sauvola"invert() # white-on-dark → dark-on-whitebinarize is the cheapest accuracy lever for low-DPI grayscale scans —
many VLMs and all classical engines score noticeably higher on
two-tone input.
Composing page preprocessors
Section titled “Composing page preprocessors”import scrivafrom scriva.preprocess import orientation, deskew, denoisefrom scriva.detect import morphological_grid
pipeline = scriva.Pipeline( orientation(), # rotate first — everything else assumes upright deskew(), denoise(), morphological_grid(), ...,)Out-of-order arguments are still slotted into the preprocessor phase,
but the order within the phase is the order you wrote them. Put
orientation before deskew, and put any geometry-changing preprocessor
(crop, orientation) before any colour-space one (normalize,
denoise).
Writing your own page preprocessor
Section titled “Writing your own page preprocessor”Subclass for stateful preprocessors:
from scriva import Preprocessor, Page
class Grayscale(Preprocessor): async def process(self, page: Page) -> Page: return page.with_image(page.image.convert("L"))…or decorate a function for stateless ones:
from scriva import preprocessor
@preprocessorasync def grayscale(page): return page.with_image(page.image.convert("L"))
pipeline = scriva.Pipeline(grayscale, ...)Page is immutable; use page.with_image(...) or
page.replace(**kwargs) to derive a new one — same pattern as
Recognition.
Region preprocessors
Section titled “Region preprocessors”A RegionPreprocessor operates on the layout after detection — so
it can see what the detector found and apply different treatment per
region. Two things it can do:
- Transform a region’s crop in place (binarise this cell only, sharpen all data cells, add 12 px of padding to header rows).
- Split a region into multiple sub-regions (slice a tall paragraph into N rows, split a misdetected merged cell into halves).
Both shapes use the same protocol; what differs is what the stage returns.
Protocol
Section titled “Protocol”class RegionPreprocessor(Engine, Protocol): name: str async def process(self, page: Page, layout: Layout) -> Layout: ...The returned Layout may have:
- New regions appended (slicers — one in, N out).
- Existing regions with
crop_overrideset (transforms — the recognizer readsregion.crop_overridein preference to croppingpageatregion.bbox). - Existing regions with
rolechanged (e.g. demoted to"blank"after a binarisation reveals the cell is empty).
Region.crop_override is a PageCrop — the same shape page.crop(...)
returns. Transforms compose naturally: a region that has been
sharpened, then padded, then binarised carries the final crop with all
three steps baked in.
Built-in region preprocessors
Section titled “Built-in region preprocessors”All factories live under scriva.preprocess alongside the page ones —
the pipeline assigns them to the right phase by sniffing the Protocol.
Crop transforms (one region → one region with crop_override)
Section titled “Crop transforms (one region → one region with crop_override)”| Factory | What it does |
|---|---|
preprocess.pad(px=8) | Add uniform padding around each region’s crop |
preprocess.pad_each(top=, right=, bottom=, left=) | Asymmetric padding |
preprocess.binarize(method="otsu") | Per-cell binarisation (separately tuned per cell) |
preprocess.sharpen(strength=1.0) | Unsharp mask |
preprocess.contrast(factor=1.4) | Local contrast boost |
preprocess.upscale(factor=2.0) | Lanczos upscale — useful for tiny cells in dense forms |
preprocess.grayscale() | Drop chroma; cheaper for some VLM providers |
preprocess.invert() | Per-cell inversion (white-on-dark cells in a mixed page) |
preprocess.dewarp() | Region-local perspective correction |
preprocess.glare_remove(strength=1.0) | Remove specular highlights in photo crops |
preprocess.whiteboard_clean(...) | Region-local marker / lighting normalisation |
preprocess.deskew_region(...) | Per-region micro-deskew (for cells rotated independently) |
preprocess.mask(predicate) | Black out pixels for which predicate(x, y) is True — for redaction-style “show only the answer” prompts |
pad is the workhorse — many VLMs lose accuracy when a crop is tight to
the glyph baseline. The default 4 px on the recognizer is a fallback;
when you want different padding per role (more for headers, less for
dense data cells), use pad(...) here instead.
Slicers (one region → N regions)
Section titled “Slicers (one region → N regions)”| Factory | What it does |
|---|---|
preprocess.slice_horizontal(rows=N, overlap_px=0) | Split into N equal-height bands |
preprocess.slice_vertical(cols=N, overlap_px=0) | Split into N equal-width bands |
preprocess.slice_grid(rows=R, cols=C) | R × C grid of sub-regions |
preprocess.slice_on_separator(direction=...) | Split where a horizontal or vertical whitespace gap ≥ min_gap_px is found |
preprocess.slice_overflow(max_height_px=..., axis="vertical") | Slice only regions exceeding max_height_px; leave the rest alone |
preprocess.slice_by(callable) | Caller-supplied splitter: Region → list[Region] |
Slicers preserve region.id as the parent and assign each child a
fresh ID with parent_region_id pointing back. Recognition merges
back into the parent on result.merge_slices() — by default, the
exporter does this automatically.
Filtering: apply_to=
Section titled “Filtering: apply_to=”Every region preprocessor accepts an apply_to= callable that filters
which regions it touches. Regions for which it returns False are
passed through untouched:
from scriva.preprocess import binarize, slice_horizontal, pad
binarize(method="otsu", apply_to=lambda r: r.role == "data")pad(px=12, apply_to=lambda r: r.role == "header")slice_horizontal(rows=2, apply_to=lambda r: r.bbox.h > 1200)slice_on_separator(direction="horizontal", min_gap_px=20, apply_to=lambda r: r.role == "data")apply_to is the single composition primitive — chains of “different
treatment per role” stay readable.
Composition
Section titled “Composition”import scrivafrom scriva.preprocess import ( orientation, deskew, # page-level binarize, pad, sharpen, slice_overflow, # region-level)from scriva.detect import morphological_gridfrom scriva.classify import rule_basedfrom scriva.recognize import openai
pipeline = scriva.Pipeline( orientation(), deskew(), # page phase morphological_grid(), # detect rule_based(), # classify pad(px=8, apply_to=lambda r: r.role == "data"), # region phase binarize(method="otsu", apply_to=lambda r: r.role == "data"), sharpen(strength=0.6), slice_overflow(max_height_px=1500), # split very tall cells openai(model="gpt-4o"),)The pipeline figures out that pad/binarize/sharpen/slice_overflow
are region preprocessors (Protocol = RegionPreprocessor) and slots
them between classify and recognize. Order within the phase is the
order you wrote.
For shared chains, bundle into one stage with preprocess.pipe:
from scriva.preprocess import pipe, binarize, sharpen, pad
clean_data_cells = pipe( pad(px=8), binarize(method="otsu"), sharpen(strength=0.6), apply_to=lambda r: r.role == "data", # applied to all children)pipeline = scriva.Pipeline(..., clean_data_cells, ...)Writing your own region preprocessor
Section titled “Writing your own region preprocessor”Subclass:
from scriva import RegionPreprocessor, Layout, Page
class StripHeaderBars(RegionPreprocessor): async def process(self, page: Page, layout: Layout) -> Layout: for region in layout.regions: if region.role == "header": crop = page.crop(region.bbox, padding=0) region.crop_override = crop.with_image(_strip_top_bar(crop.image)) return layout…or decorate a function:
from scriva import region_preprocessor
@region_preprocessor(apply_to=lambda r: r.role == "header")async def strip_header_bars(page, region): crop = page.crop(region.bbox, padding=0) return region.with_crop_override(crop.with_image(_strip_top_bar(crop.image)))The decorated form gets called per-region — the framework iterates and
respects apply_to=. The subclass form gets the whole layout in one
call, which is what you want when the transform needs cross-region
context (e.g. global statistics, neighbour-aware slicing).
When per-cell preprocessing pays off
Section titled “When per-cell preprocessing pays off”Three recurring patterns:
- Mixed-quality cells in one form. Header rows are typeset; data cells are stamped or handwritten. Binarise only the data cells.
- Tall paragraphs that overflow a VLM’s effective resolution.
slice_overflow(max_height_px=1500)splits them into bands; the recognizer sees clean, in-scale text on each. - Tight bboxes from an aggressive detector.
pad(px=12)around each crop recovers character tops and descenders without re-detecting.
When not to use a region preprocessor
Section titled “When not to use a region preprocessor”- If the transform needs to see the whole page in raw form, use a page preprocessor — region preprocessors get a cropped view by design.
- If the transform is about output text (whitespace, dictionary
correction, splitting on detected ruled lines), use a
PostProcessor. Post-processors run after the recognizer; region preprocessors run before it. - If the split depends on what the recognizer said (e.g. “split this
cell because the recognised text contains
│”), usepostprocess.rule_splitter— it ships exactly that.
What to read next
Section titled “What to read next”- Pipeline › High-accuracy patterns — Region preprocessing is one of three accuracy levers; the others are human-in-the-loop and cross-checking.
- Detectors — region preprocessors run after detection; if the detector itself needs work, start there.
- Post-processors — the symmetric phase on the output side.