Detectors
Detectors
Section titled “Detectors”A detector finds regions on a page. scriva ships two families:
LayoutDetector— produces aLayoutfrom aPage.RegionClassifier— refines an existingLayoutby tagging regions (blank, header, merged…).
Detectors are stateless and reentrant; you can share one instance across threads or pipeline instances.
Protocols
Section titled “Protocols”class LayoutDetector(Engine, Protocol): name: str async def detect(self, page: Page) -> Layout: ...
class RegionClassifier(Engine, Protocol): name: str async def classify(self, page: Page, layout: Layout) -> Layout: ...Both write into the pipeline Context; the standard implementations also
emit a finished event with a count summary.
Built-in layout detectors
Section titled “Built-in layout detectors”Importable as factories from scriva.detect (lowercase) or as classes from
scriva.detectors (subclassable). They are the same objects; pick whichever
reads better.
| Factory | Class | Strategy | When to reach for it |
|---|---|---|---|
detect.morphological_grid(...) | MorphologicalGridDetector | OpenCV morphological line extraction + line grouping | Scanned tabular forms with visible rules |
detect.hough_grid(...) | HoughGridDetector | Probabilistic Hough transform | Forms with broken or faded rules where morphology fails |
detect.whole_page(...) | WholePageDetector | One region = the whole page | Unstructured docs you hand to an LLM as one shot |
detect.box_annotations(...) | BoxAnnotationDetector | Read pre-annotated boxes from a JSON sidecar | Re-run OCR on human-corrected regions |
detect.fallback(*detectors) | FallbackDetector | Try each in order | ”Try morphology, then Hough” |
All grid detectors accept the same options: min_line_length_ratio,
merge_threshold_px, padding_px.
Built-in classifiers
Section titled “Built-in classifiers”| Factory | Class | Marks | How |
|---|---|---|---|
classify.rule_based(...) | RuleBasedClassifier | blank, merged | Adaptive threshold + ink-density + border presence |
classify.embedding(...) | EmbeddingClassifier | blank, merged | Image-embedding + trained sklearn/LightGBM head |
classify.hybrid(...) | HybridClassifier | both | Rule-based first; falls back to embedding when low-confidence |
classify.embedding is engine-agnostic: pass any ImageEmbedder and any
trained model that exposes predict_proba. Two embedders ship —
OpenAIEmbedder and AzureVisionEmbedder — both behind optional extras.
Training an embedding classifier
Section titled “Training an embedding classifier”classify.embedding.train(...) fits a head from a SampleStore
and returns a ready-to-use RegionClassifier:
from scriva import classify, samplesfrom scriva.embedders import OpenAIEmbedderfrom scriva.classify import LightGBMHead
store = samples.fs(".scriva_samples")
clf = classify.embedding.train( samples=store, where=lambda s: s.attrs.get("annotated") is True, # optional filter embedder=OpenAIEmbedder(model="text-embedding-3-small"), head=LightGBMHead(), # ClassifierHead protocol target=lambda s: s.attrs.get("role", "data"), # what to predict)clf.save("models/cells.joblib")
# Use it in a pipelinepipeline = scriva.Pipeline( morphological_grid(), classify.embedding.load("models/cells.joblib"), ...,)The head is a ClassifierHead protocol — anything with
fit(X, y) / predict_proba(X) / save(path) / load(path). LightGBMHead,
SklearnHead, and TorchMlpHead ship in core; bring your own for
exotic shapes.
Training reads samples lazily — large stores are streamed, not materialised.
Composing detectors
Section titled “Composing detectors”A pipeline has one detector and at most one classifier, but you can wrap either in a composite:
from scriva.detect import fallback, morphological_grid, hough_gridfrom scriva.classify import hybrid, rule_based, embedding
detect = fallback( morphological_grid(), hough_grid(min_line_length_ratio=0.05),)classify = hybrid( rule=rule_based(blank_density=0.002), learned=embedding.load("models/cells.joblib"),)Writing your own detector
Section titled “Writing your own detector”Subclass:
from scriva import LayoutDetector, Layout, Region, Page, Capability
class MyDetector(LayoutDetector): capabilities = frozenset({Capability.GRID})
async def detect(self, page: Page) -> Layout: boxes = my_model.predict(page.image) regions = [Region(bbox=b, role="data") for b in boxes] return Layout.from_regions(regions, page=page)…or decorate a function:
from scriva import detector, Layout, Region, Capability
@detector(capabilities=frozenset({Capability.GRID}))async def my_detector(page): boxes = my_model.predict(page.image) return Layout.from_regions([Region(bbox=b, role="data") for b in boxes], page=page)The only contract you must honour:
- Every
Regionyou produce must have a validbboxin page-pixel coordinates (origin top-left). - If your regions are grid cells, set
region.grid = GridCell(row=..., col=...). - If two regions are part of one logical merged region, share a
merge_group_id. The pipeline handles the rest.
That is the whole interface.
Coordinate conventions
Section titled “Coordinate conventions”- Origin: top-left.
- Units: page pixels at the page’s native DPI.
- Bounding boxes:
(x, y, w, h). Polygons: list of(x, y)clockwise. - Rotation: detectors receive an already-deskewed page if a preprocessor ran; do not re-implement orientation correction in a detector.
Performance notes
Section titled “Performance notes”- Detectors are CPU-bound. The pipeline runs them on the asyncio executor, not the event loop.
- A typical morphological grid detector returns in 50–200 ms for a 2400 px page. If yours is slower, profile before parallelising — pages run sequentially by default and that is usually fine.