Skip to content

Detectors

A detector finds regions on a page. scriva ships two families:

  • LayoutDetector — produces a Layout from a Page.
  • RegionClassifier — refines an existing Layout by tagging regions (blank, header, merged…).

Detectors are stateless and reentrant; you can share one instance across threads or pipeline instances.

class LayoutDetector(Engine, Protocol):
name: str
async def detect(self, page: Page) -> Layout: ...
class RegionClassifier(Engine, Protocol):
name: str
async def classify(self, page: Page, layout: Layout) -> Layout: ...

Both write into the pipeline Context; the standard implementations also emit a finished event with a count summary.

Importable as factories from scriva.detect (lowercase) or as classes from scriva.detectors (subclassable). They are the same objects; pick whichever reads better.

FactoryClassStrategyWhen to reach for it
detect.morphological_grid(...)MorphologicalGridDetectorOpenCV morphological line extraction + line groupingScanned tabular forms with visible rules
detect.hough_grid(...)HoughGridDetectorProbabilistic Hough transformForms with broken or faded rules where morphology fails
detect.whole_page(...)WholePageDetectorOne region = the whole pageUnstructured docs you hand to an LLM as one shot
detect.box_annotations(...)BoxAnnotationDetectorRead pre-annotated boxes from a JSON sidecarRe-run OCR on human-corrected regions
detect.fallback(*detectors)FallbackDetectorTry each in order”Try morphology, then Hough”

All grid detectors accept the same options: min_line_length_ratio, merge_threshold_px, padding_px.

FactoryClassMarksHow
classify.rule_based(...)RuleBasedClassifierblank, mergedAdaptive threshold + ink-density + border presence
classify.embedding(...)EmbeddingClassifierblank, mergedImage-embedding + trained sklearn/LightGBM head
classify.hybrid(...)HybridClassifierbothRule-based first; falls back to embedding when low-confidence

classify.embedding is engine-agnostic: pass any ImageEmbedder and any trained model that exposes predict_proba. Two embedders ship — OpenAIEmbedder and AzureVisionEmbedder — both behind optional extras.

classify.embedding.train(...) fits a head from a SampleStore and returns a ready-to-use RegionClassifier:

from scriva import classify, samples
from scriva.embedders import OpenAIEmbedder
from scriva.classify import LightGBMHead
store = samples.fs(".scriva_samples")
clf = classify.embedding.train(
samples=store,
where=lambda s: s.attrs.get("annotated") is True, # optional filter
embedder=OpenAIEmbedder(model="text-embedding-3-small"),
head=LightGBMHead(), # ClassifierHead protocol
target=lambda s: s.attrs.get("role", "data"), # what to predict
)
clf.save("models/cells.joblib")
# Use it in a pipeline
pipeline = scriva.Pipeline(
morphological_grid(),
classify.embedding.load("models/cells.joblib"),
...,
)

The head is a ClassifierHead protocol — anything with fit(X, y) / predict_proba(X) / save(path) / load(path). LightGBMHead, SklearnHead, and TorchMlpHead ship in core; bring your own for exotic shapes.

Training reads samples lazily — large stores are streamed, not materialised.

A pipeline has one detector and at most one classifier, but you can wrap either in a composite:

from scriva.detect import fallback, morphological_grid, hough_grid
from scriva.classify import hybrid, rule_based, embedding
detect = fallback(
morphological_grid(),
hough_grid(min_line_length_ratio=0.05),
)
classify = hybrid(
rule=rule_based(blank_density=0.002),
learned=embedding.load("models/cells.joblib"),
)

Subclass:

from scriva import LayoutDetector, Layout, Region, Page, Capability
class MyDetector(LayoutDetector):
capabilities = frozenset({Capability.GRID})
async def detect(self, page: Page) -> Layout:
boxes = my_model.predict(page.image)
regions = [Region(bbox=b, role="data") for b in boxes]
return Layout.from_regions(regions, page=page)

…or decorate a function:

from scriva import detector, Layout, Region, Capability
@detector(capabilities=frozenset({Capability.GRID}))
async def my_detector(page):
boxes = my_model.predict(page.image)
return Layout.from_regions([Region(bbox=b, role="data") for b in boxes],
page=page)

The only contract you must honour:

  • Every Region you produce must have a valid bbox in page-pixel coordinates (origin top-left).
  • If your regions are grid cells, set region.grid = GridCell(row=..., col=...).
  • If two regions are part of one logical merged region, share a merge_group_id. The pipeline handles the rest.

That is the whole interface.

  • Origin: top-left.
  • Units: page pixels at the page’s native DPI.
  • Bounding boxes: (x, y, w, h). Polygons: list of (x, y) clockwise.
  • Rotation: detectors receive an already-deskewed page if a preprocessor ran; do not re-implement orientation correction in a detector.
  • Detectors are CPU-bound. The pipeline runs them on the asyncio executor, not the event loop.
  • A typical morphological grid detector returns in 50–200 ms for a 2400 px page. If yours is slower, profile before parallelising — pages run sequentially by default and that is usually fine.