Splitters and classifiers

Detectors

A splitter turns a page into regions. It is the single step in the split phase of the Orchestrator, and it owns layout detection: its output is a Layout of Regions that the rest of the recipe operates on.

Every recipe has exactly one split step. If you omit it, the Orchestrator inserts .split.whole_page() for you.

.preprocess(...)  ──►  .split.<method>()  ──►  .classify(...)  ──►  .recognize(...)

Splitters are stateless and reentrant; you can share the same recipe across threads.

Built-in splitters

Method	What it does	When to reach for it
`.split.grid(...)`	Detect a tabular grid (auto by default)	Scanned tabular forms with visible rules
`.split.vertical(at=...)`	Split into vertical strips at given x-coords or auto	Column-oriented documents, ledger sheets
`.split.horizontal(at=...)`	Split into horizontal strips at given y-coords or auto	Banded receipts, statement rows
`.split.boxes(detector=...)`	Free-form box detection (ML model or callable)	Documents with arbitrary regions; ML layouts
`.split.from_layout(source, where=...)`	Read regions from a sidecar JSON or a prior `DocumentResult`	Re-run OCR on human-corrected regions; refine passes
`.split.whole_page()`	One region per page	Unstructured docs handed to a VLM as one shot

`.split.grid(...)`

Default strategy: OpenCV morphological line extraction + line grouping. Works on scanned tabular forms with visible rules.

.split.grid(
    rows=None,                       # int | None — fix the row count if known
    cols=None,                       # int | None — fix the column count if known
    strategy="morphological",        # "morphological" | "hough" | callable
    min_line_length_ratio=0.1,
    merge_threshold_px=4,
    padding_px=2,
)

Two line-detection strategies ship:

"morphological" — the default. Cheap and reliable when ruled lines are clean.
"hough" — probabilistic Hough transform. Use for forms with broken or faded rules where morphology fails.

For “try morphology, then Hough” behaviour, pass a callable that falls back internally:

.split.grid(strategy=fallback_strategy(morphological, hough))

`.split.vertical(...)` and `.split.horizontal(...)`

Strip splitters. Pass explicit cut coordinates or let the auto detector find whitespace gutters.

.split.vertical(at=[100, 200, 300])           # explicit cuts at x=100,200,300
.split.vertical()                              # auto — detect whitespace gutters
.split.horizontal(at=[80, 160, 240])           # explicit cuts at y=80,160,240
.split.horizontal()                            # auto

Auto-detection options:

.split.vertical(min_gap_px=12, threshold=0.95)

min_gap_px is the minimum whitespace run that counts as a gutter; threshold is the fraction of pixels in a column that must be background to qualify.

`.split.boxes(detector=...)`

Hand off detection to an ML model or any callable that returns boxes. Useful when no axis-aligned grid exists.

.split.boxes(detector=MyMLDetector())
.split.boxes(detector=lambda page: [(x, y, w, h), ...])

The detector’s output is wrapped in Regions with role="data" and no grid. Add .classify(...) after to tag them.

`.split.from_layout(...)`

Read regions from a sidecar JSON (the same shape result.to_json() writes) or directly from a prior DocumentResult. The HITL and confidence-driven re-OCR patterns both use this.

.split.from_layout("layout.json")                          # sidecar
.split.from_layout(prior_result)                           # prior run
.split.from_layout(prior_result,
                   where=lambda r: (r.confidence or 0) < 0.6)  # filter

`.split.whole_page()`

One region per page. The default when no .split.* is in the recipe.

.split.whole_page()

Composing splitters

A recipe has exactly one split step, but the strategy= callable passed to .split.grid(...) lets you compose detectors internally — “try morphological, fall back to Hough” is one such composite.

For the “split then re-split on certain regions” pattern, run a refine slicer after split:

.split.grid()
       .refine.slice_overflow(max_height_px=1500)    # split tall cells
       .refine.slice_on_separator(direction="vertical", min_gap_px=20,
                                  where=lambda r: r.role == "data")

See preprocessors.md › Region refine for the full refine reference.

Writing your own splitter

Subclass:

from scriva import Step, Context, Layout, Region, Capability

class MySplitter(Step):
    phase = "split"
    name  = "my-splitter"
    capabilities = frozenset({Capability.GRID})

    async def run(self, ctx: Context) -> Context:
        boxes = my_model.predict(ctx.page.image)
        regions = [Region(bbox=b, role="data") for b in boxes]
        ctx.layout = Layout.from_regions(regions, page=ctx.page)
        return ctx

…or decorate a function:

from scriva import step, Layout, Region, Capability

@step(phase="split", name="my-splitter",
      capabilities=frozenset({Capability.GRID}))
async def my_splitter(ctx):
    boxes = my_model.predict(ctx.page.image)
    ctx.layout = Layout.from_regions(
        [Region(bbox=b, role="data") for b in boxes],
        page=ctx.page,
    )
    return ctx

recipe = recipe.then(my_splitter)   # the escape hatch for non-standard steps

The only contract you must honour:

Every Region you produce must have a valid bbox in page-pixel coordinates (origin top-left).
If your regions are grid cells, set region.grid = GridCell(row=..., col=...).
If two regions are part of one logical merged region, share a merge_group_id. The Orchestrator handles the rest.

That is the whole interface.

Coordinate conventions

Origin: top-left.
Units: page pixels at the page’s native DPI.
Bounding boxes: (x, y, w, h). Polygons: list of (x, y) clockwise.
Rotation: splitters receive an already-deskewed page if a preprocess step ran; do not re-implement orientation correction in a splitter.

Performance notes

Splitters are CPU-bound. The Orchestrator runs them on the asyncio executor, not the event loop.
A typical morphological grid splitter returns in 50–200 ms for a 2400 px page. If yours is slower, profile before parallelising — pages run sequentially by default and that is usually fine.

Splitters and classifiers

Detectors

Built-in splitters

.split.grid(...)

.split.vertical(...) and .split.horizontal(...)

.split.boxes(detector=...)

.split.from_layout(...)

.split.whole_page()