Cookbook: rebuilding ocr-agent on scriva

Cookbook: rebuilding `ocr-agent` on scriva

ocr-agent is the production system scriva was extracted from: a browser OCR tool for Japanese tabular forms, agentic schema discovery, P&ID diagrams, and a Bedrock-driven annotation loop. This page is the ocr-agent → scriva translation, end-to-end. Every snippet is copy-pasteable into a fresh project.

The shape of the page mirrors ocr-agent’s own surface:

Normal OCR — grid detect → cell OCR → dictionary → rule-splitter → confidence → Excel.
Agentic OCR — discover fields, OCR rows, extract values into a key-value sheet.
P&ID OCR — find symbols by template, OCR each box, emit a structured catalogue.
Annotation loop — primary + oracle recognizers, sample store, classifier training.
Wrapping it in a FastAPI app — SSE progress, cancellation, human-in-the-loop pause/resume.

If you only read one section, read §1 — every other workflow is a rearrangement of the same five stage kinds.

Repository layout

These examples assume one application package on top of scriva:

ocr_app/
├── pipelines/
│   ├── forms.py        # §1
│   ├── agentic.py      # §2
│   ├── pid.py          # §3
│   └── annotation.py   # §4
├── stages/             # custom @postprocessor / @recognizer helpers
│   ├── blank_merges.py
│   └── refusal_strip.py
├── prompts/
│   └── jp.py           # locale-specific Prompt registrations
└── api/                # §5 — FastAPI app, routers, templates

Everything in pipelines/, stages/, and prompts/ is pure scriva configuration. Only api/ knows about HTTP, sessions, or job persistence — that boundary is the same boundary architecture.md draws.

Setup

Bring the extras you actually use:

pip install "scriva[openai,bedrock,excel,pgvector,pdf]"

…and configure the credentials each adapter needs through environment variables (OPENAI_API_KEY, AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, DATABASE_URL).

1. Normal OCR (tabular forms)

This is the workflow /api/ocr/upload runs in ocr-agent. Detect a grid, classify blank + merged cells, OCR each non-blank crop, correct against a dictionary, split internal ruled lines, score confidence, write Excel.

from pathlib import Path

import scriva
from scriva.preprocess  import orientation, deskew, crop
from scriva.detect      import fallback, morphological_grid, hough_grid
from scriva.classify    import hybrid, rule_based, embedding
from scriva.recognize   import openai
from scriva.postprocess import (
    whitespace, blank_suppress, dictionary,
    rule_splitter, confidence_score, postprocessor,
)
from scriva.export      import excel, json_
from scriva.cache       import Cache
from scriva.prompts     import Prompt

# ocr-agent-specific: a blank "merged" region is almost always a
# detection mistake — split it back into individual blank cells.
# See notes/excluded-from-docs.md §4 for why this stays in the app layer.
from ocr_app.stages.blank_merges import dissolve_blank_merges


def tabular_forms_pipeline(
    *,
    dict_yaml: Path | None = None,
    blank_density: float = 0.003,
    excel_out: Path = Path("out.xlsx"),
    json_out: Path | None = None,
    crop_bbox: tuple[int, int, int, int] | None = None,
) -> scriva.Pipeline:
    """The ocr-agent 'normal OCR' pipeline.

    `blank_density` corresponds to ocr-agent's strict/standard/loose
    presets (0.001 / 0.003 / 0.008). `crop_bbox` is what the Cropper.js
    selection becomes after the user confirms the crop screen.
    """
    stages: list = []

    # Preprocess
    if crop_bbox is not None:
        stages.append(crop(bbox=crop_bbox))
    stages += [orientation(), deskew()]

    # Detect: try morphology, fall back to Hough — same as grid_detect.py
    stages.append(fallback(
        morphological_grid(),
        hough_grid(min_line_length_ratio=0.05),
    ))

    # Classify: rule-based first, learned head for the in-between cases.
    # Mirrors `_detect_blank_cells_hybrid` (rulebased → CLIP for ambiguous).
    stages.append(hybrid(
        rule=rule_based(blank_density=blank_density),
        learned=embedding.load("models/cells.joblib"),  # optional; falls back if missing
    ))
    # Domain rule: dissolve all-blank merges back into singletons.
    stages.append(dissolve_blank_merges)

    # Recognize: GPT-4o, Japanese prompt, layered cache (exact + semantic).
    stages.append(openai(
        model="gpt-4o",
        prompt=Prompt.ocr(locale="ja"),
        cache=Cache.layered(".scriva_cache"),
        max_concurrency=8,
        crop_padding_px=0,   # ocr-agent does no padding
    ))

    # Post-process: clean → drop noise → dictionary → split ruled lines →
    # confidence. Order matters — see postprocessors.md.
    stages += [
        whitespace(),
        blank_suppress(),
    ]
    if dict_yaml is not None:
        stages.append(dictionary.from_yaml(dict_yaml))
    stages += [
        rule_splitter(),
        confidence_score.rendering(),  # cairosvg + Azure Vision embedder
    ]

    # Export: Excel (with confidence colouring + legend) and optional JSON
    stages.append(excel(
        excel_out,
        font="MS Gothic",
        confidence_thresholds=(0.6, 0.8),
        legend_sheet=True,
    ))
    if json_out is not None:
        stages.append(json_(json_out))

    return scriva.Pipeline(*stages, page_concurrency=1)

Running it

pipeline = tabular_forms_pipeline(
    dict_yaml=Path("dict.yaml"),
    blank_density=0.003,
    excel_out=Path("outputs/jobs/abc123.xlsx"),
)
result = pipeline("scan.png")

print(f"{len(result.regions_with_text)} cells, "
      f"mean confidence {result.mean_confidence:.2%}")

That run wrote outputs/jobs/abc123.xlsx. For the confidence-coloured download that ocr-agent’s UI offers as a second button, swap the threshold form for a per-cell callback:

def colour(rec):
    if rec.confidence is None or rec.confidence >= 0.8:  return None
    if rec.confidence >= 0.6:                            return "yellow"
    return "red"

pipeline.replace("export-excel", excel(
    Path("outputs/jobs/abc123-coloured.xlsx"),
    confidence_fill=colour,
    legend_sheet=True,
    font="MS Gothic",
))
result = pipeline("scan.png")

Two custom stages worth seeing

The “dissolve all-blank merges” rule is a five-line @postprocessor. It runs against the layout, not the recognitions, so it goes ahead of recognize — which means it has to be a custom classifier-like stage, not a @postprocessor. The right shape is a Stage:

from scriva import Stage, Context, MergeInfo

class DissolveBlankMerges(Stage):
    name = "dissolve-blank-merges"
    capabilities = frozenset()

    async def run(self, ctx: Context) -> Context:
        regions = ctx.layout.regions
        new_regions = []
        for r in regions:
            if r.merge_group_id and r.role == "blank":
                # Strip the merge — leave the region as a standalone blank cell.
                r = r.model_copy(update={
                    "merge_group_id": None,
                    "merge": MergeInfo(),
                })
            new_regions.append(r)
        ctx.layout.regions = new_regions
        return ctx

dissolve_blank_merges = DissolveBlankMerges()

The other thing ocr-agent’s postprocess_vlm_text does that the built-in blank_suppress does not cover is stripping VLM refusal boilerplate (「申し訳ありません … 読み取れません」). One more @postprocessor:

import re
from scriva import postprocessor

_REFUSAL = re.compile(
    r"(申し訳ありません|I'm sorry|I apologize|I cannot).*?"
    r"(?:読み取|確認|判読|認識|識別|read|unable).*?(?:できません|ません|cannot)[。．.]?\s*",
    re.IGNORECASE,
)

@postprocessor(name="strip-vlm-refusal")
async def strip_refusal(page, layout, recognitions):
    return {
        rid: r.with_text(_REFUSAL.sub("", r.text).strip()) if r.text else r
        for rid, r in recognitions.items()
    }

Drop it into the chain right after blank_suppress().

Confidence-driven re-OCR (a second pipeline)

ocr-agent’s /api/ocr/{job}/reocr re-runs only the low-confidence cells through a different model, using the previous text as a hint. The canonical recipe lives in pipeline.md › Confidence-driven re-OCR; here is the ocr-agent adaptation:

from scriva import RecognitionHint
from scriva.detect    import box_annotations
from scriva.recognize import openai, anthropic

def reocr_pipeline(
    result: "scriva.DocumentResult",
    *,
    threshold: float = 0.6,
    model: str = "gpt-4o",
) -> scriva.Pipeline:
    Recognizer = openai if model.startswith(("gpt-", "o1", "o3", "o4")) else anthropic
    return scriva.Pipeline(
        box_annotations.from_result(
            result,
            where=lambda r: (r.confidence or 0) < threshold,
        ),
        Recognizer(
            model=model,
            prompt=scriva.prompts.Prompt.ocr_with_hint(),
            hints=RecognitionHint.from_result(result),
            cache=Cache.layered(".scriva_cache"),
        ),
        confidence_score.rendering(),
        excel(Path("outputs/jobs/abc123-refined.xlsx"), legend_sheet=True),
    )

refined = reocr_pipeline(result, threshold=0.6, model="gpt-4o")("scan.png")
combined = result.merge(refined, strategy="highest_confidence")
combined.to_excel("outputs/jobs/abc123-final.xlsx")

2. Agentic OCR (schema discovery)

ocr-agent’s agentic mode: VLM discovers what fields exist, OCRs the form row-by-row, then extracts each field’s value with the row context in scope. The output is a key-value sheet (項目名 / 抽出値 / 参照行 / 備考), not a grid.

The whole thing collapses to the domains.agentic pack:

from pathlib import Path

import scriva
from scriva import domains
from scriva.cache import Cache


def agentic_pipeline(*, excel_out: Path) -> scriva.Pipeline:
    return domains.agentic.extract(
        schema=None,           # None = discover (3 passes by default)
        rounds=3,
        model="gpt-4o",
        json_out=excel_out.with_suffix(".json"),
    )

When you do know the schema up front — say you have a pydantic.BaseModel for the form — pass it as schema= and skip the discovery phase:

from pydantic import BaseModel

class InspectionForm(BaseModel):
    patient_name: str
    date_of_birth: str
    diagnosis_code: str
    dose_mg: float | None

pipeline = domains.agentic.extract(
    schema=InspectionForm,
    model="gpt-4o",
    json_out=Path("outputs/jobs/abc123.json"),
)

Customising the discovery prompt

ocr-agent’s discovery prompt is bilingual and includes the already-discovered list each round. To match that exactly, register your own and pass the pack a recognizer that uses it:

from scriva.prompts import Prompt
from scriva.recognize import openai

Prompt.register(
    "discover-ja",
    """この帳票画像を分析し、抽出すべきデータ項目（フィールド名）を列挙してください。
{% if known %}
【既に発見済みの項目】
{% for item in known %}- {{ item }}
{% endfor %}
上記以外にまだ発見されていない項目があれば、それらも含めて網羅的に列挙してください。
{% endif %}
各行は「- 項目名」の形式で出力。項目名のみ、値は不要です。""",
    locale="ja",
)

pipeline = domains.agentic.extract(
    rounds=3,
    discover_recognizer=openai(model="gpt-4o", prompt=Prompt.from_registered("discover-ja")),
    json_out=Path("outputs/jobs/abc123.json"),
)

Excel key-value output

DocumentResult.to_excel() writes a flat key-value table when the layout has no grid — exactly the shape ocr-agent’s _export_agentic_excel produces. Replace the bespoke exporter with:

result = pipeline("scan.png")
result.to_excel(
    "outputs/jobs/abc123.xlsx",
    font="MS Gothic",
    columns=["項目名", "抽出値", "参照行", "備考"],   # column header overrides
)

3. P&ID symbol OCR

Two routes match ocr-agent’s behaviour. Pick by how much of the workflow the user drives interactively.

3a. User drags patterns; scriva finds matches

ocr-agent’s UI lets the user drag-select a symbol, then runs cv2.matchTemplate to find similar instances. A custom detector wraps that:

from pathlib import Path

import cv2
import numpy as np
from scriva import LayoutDetector, Layout, Region, BBox, Capability, detector
from scriva.recognize  import openai
from scriva.postprocess import whitespace
from scriva.export      import json_, excel
from scriva.prompts     import Prompt

from pydantic import BaseModel

class PidSymbol(BaseModel):
    types: str   # valve, pump, instrument, tank, …
    text: str    # equipment IDs, numbers, labels


@detector(name="pid-template-match", capabilities=frozenset({Capability.POLYGON}))
async def pid_template_match(page, *, templates: list[Path], threshold=0.8,
                             nms_overlap=0.2, min_distance=10):
    """Run cv2.matchTemplate for each pattern, NMS, min-distance filter."""
    image = page.image_as_ndarray()  # PIL or ndarray; both supported
    boxes: list[tuple[int, int, int, int, float, str]] = []
    for tpl_path in templates:
        tpl = cv2.imread(str(tpl_path))
        score = cv2.matchTemplate(image, tpl, cv2.TM_CCOEFF_NORMED)
        ys, xs = np.where(score >= threshold)
        for x, y in zip(xs, ys):
            boxes.append((int(x), int(y), tpl.shape[1], tpl.shape[0],
                          float(score[y, x]), tpl_path.stem))
    # NMS + min-distance: see template_matching.py for the reference impl.
    boxes = _nms(boxes, overlap=nms_overlap)
    boxes = _min_distance(boxes, distance=min_distance)
    regions = [
        Region(bbox=BBox(x, y, w, h), role="data",
               attrs={"pattern": pat, "match_score": s})
        for x, y, w, h, s, pat in boxes
    ]
    return Layout.from_regions(regions, page=page)


def pid_pipeline(templates: list[Path], *, json_out: Path,
                 excel_out: Path | None = None) -> scriva.Pipeline:
    stages: list = [
        pid_template_match(templates=templates, threshold=0.8,
                           nms_overlap=0.2, min_distance=10),
        openai(
            model="gpt-4o",
            prompt=Prompt.structured(schema=PidSymbol),
            crop_padding_px=int(0.10 * 100),  # 10% padding ≈ ocr-agent's _crop_box_image
            max_concurrency=3,
        ),
        whitespace(),
        json_(json_out),
    ]
    if excel_out is not None:
        # embed_crops=True + crop_height_px=60 matches ocr-agent's
        # pid_export.py (one row per box with a 60px thumbnail).
        stages.append(excel(excel_out, embed_crops=True, crop_height_px=60,
                            crop_column="A"))
    return scriva.Pipeline(*stages)

(_nms and _min_distance are pure NumPy; lift them verbatim from ocr-agent’s pipeline/template_matching.py.)

pipeline = pid_pipeline(
    templates=[Path("uploads/pid_presets/valve.png"),
               Path("uploads/pid_presets/pump.png")],
    json_out=Path("outputs/pid/session-001.json"),
    excel_out=Path("outputs/pid/session-001.xlsx"),
)
result = pipeline("drawing.png")

3b. Whole-page VLM (no template matching)

The opinionated domains.pid pack drops template matching entirely and asks the VLM to find symbols directly. Right when you don’t have templates yet or the drawing style varies:

from scriva import domains

pipeline = domains.pid.diagram(
    model="gpt-4o",
    json_out=Path("outputs/pid/session-002.json"),
)
result = pipeline("drawing.png")

Both shapes are documented as legitimate; ocr-agent’s UI happens to use 3a because of the drag-to-select interaction.

4. Annotation & training loop

ocr-agent’s annotation backend (annotation_runner.py) does three distinct things — blank annotation, merge/border annotation, and OCR annotation — but all three follow the same pattern: primary classifier or recognizer runs first, results sorted by |score - 0.5|, Bedrock Qwen-VL re-runs the uncertain ones, every result becomes a training sample.

scriva.recognize.uncertainty_first + scriva.export.samples + scriva.classify.embedding.train cover it. The pre-built domain pack:

import scriva
from scriva import samples
from scriva.recognize  import openai, bedrock
from scriva.cache      import Cache


def annotation_pipeline(*, dsn: str, json_out: str) -> scriva.Pipeline:
    store = samples.layered(
        fs=samples.fs("training_data/samples"),
        index=samples.pgvector(dsn, dim=1024),
    )

    return scriva.domains.annotation.review(
        primary=openai(model="gpt-4o", cache=Cache.layered(".scriva_cache")),
        oracle=bedrock(model="qwen.qwen3-vl-235b-a22b"),
        store=store,
        k=20,                    # top-k most-uncertain go to the oracle
        json_out=json_out,
    )

pipeline = annotation_pipeline(
    dsn="postgresql+asyncpg://localhost/ocr_agent",
    json_out="outputs/annotations/2026-05-21.json",
)
pipeline("scan.png")    # writes samples to the pgvector store as it runs

The pack composes:

detect.morphological_grid
  ──► classify.rule_based
  ──► recognize.uncertainty_first(primary, oracle, k=20, store=store)
  ──► postprocess.whitespace
  ──► postprocess.confidence_score.rendering
  ──► export.json_ + export.samples(store)

— so the recognizer wrapper persists both the primary recognition and the oracle’s correction on every sample it touches. That replaces AnnotationRunner._register_sample (annotation_runner.py:88) and AnnotationRunner._register_pair_sample (annotation_runner.py:263) both.

Training a classifier from the store

Once samples accumulate, fit a head and drop it into your forms pipeline:

from scriva import classify
from scriva.embedders import OpenAIEmbedder
from scriva.classify import LightGBMHead

store = samples.layered(
    fs=samples.fs("training_data/samples"),
    index=samples.pgvector(dsn, dim=1024),
)

clf = classify.embedding.train(
    samples=store,
    where=lambda s: s.label is not None,     # only labelled samples
    embedder=OpenAIEmbedder(model="text-embedding-3-small"),
    head=LightGBMHead(),
    target=lambda s: "blank" if s.attrs.get("is_blank") else "data",
)
clf.save("models/cells.joblib")

…and back in §1, the call you already had — embedding.load("models/cells.joblib") — picks it up next run. That is the entire ocr-agent active-learning loop, on three named primitives.

Border-pair annotation (the merge mode)

ocr-agent’s merge annotation operates on pairs of adjacent cells, not single cells. The mapping is the same uncertainty_first wrapper, but with a pair detector instead of a cell detector and a different prompt:

from scriva.prompts import Prompt

Prompt.register(
    "border-pair",
    """この画像は帳票の隣接する2つのセルを合わせた領域です。
2つのセルの間に境界線（罫線）が存在するかどうかを判定してください。

以下のJSON形式で返答してください:
{ "has_border": true|false, "direction": "horizontal"|"vertical" }""",
    locale="ja",
)

class BorderLabel(BaseModel):
    has_border: bool
    direction: str

border_pipeline = scriva.Pipeline(
    pair_detector(),                                        # custom (lift _extract_pair_region)
    classify.embedding.load("models/borders.joblib"),       # current border classifier
    recognize.uncertainty_first(
        primary=classify_recognizer(clf=border_classifier), # adapter wrapping the classifier
        oracle=bedrock(
            model="qwen.qwen3-vl-235b-a22b",
            prompt=Prompt.structured_few_shot(
                schema=BorderLabel,
                base="border-pair",
            ),
        ),
        store=store,
        k=20,
    ),
    export.samples(store),
)

The pair_detector is a 30-line custom @detector that yields one Region per adjacent cell pair (the body is _extract_pair_region from cell_detect.py:110). classify_recognizer is a thin @recognizer that calls a trained head and emits a Recognition whose confidence is the head’s predict_proba — its existence is what lets uncertainty_first sort pairs by |confidence - 0.5|.

5. FastAPI hosting

scriva’s architecture page is firm about this: the web layer is the application’s job. But the integration is small — events.to_sse and Pipeline.cancel() are the two primitives — so here’s the canonical wiring that replaces ocr-agent’s routers/ocr.py.

SSE progress for one job

import asyncio
import uuid
from fastapi import APIRouter, BackgroundTasks, UploadFile
from sse_starlette.sse import EventSourceResponse

import scriva
from scriva.events import to_sse

from ocr_app.pipelines.forms import tabular_forms_pipeline

router = APIRouter(prefix="/api/ocr", tags=["ocr"])

# Application-layer registry (architecture.md disclaims this from scriva).
_pipelines: dict[uuid.UUID, scriva.Pipeline] = {}
_results:   dict[uuid.UUID, asyncio.Task] = {}


@router.post("/upload")
async def upload(file: UploadFile, bg: BackgroundTasks) -> dict:
    job_id = uuid.uuid4()
    path = await _save_upload(file, job_id)

    pipeline = tabular_forms_pipeline(
        dict_yaml=Path("dict.yaml"),
        excel_out=Path(f"outputs/jobs/{job_id}.xlsx"),
    )
    _pipelines[job_id] = pipeline
    _results[job_id]   = asyncio.create_task(pipeline.aio(str(path)))
    return {"job_id": str(job_id)}


@router.get("/{job_id}/progress")
async def progress(job_id: uuid.UUID):
    pipeline = _pipelines[job_id]
    return EventSourceResponse(to_sse(pipeline.events(str(path))))


@router.post("/{job_id}/cancel")
async def cancel(job_id: uuid.UUID) -> dict:
    pipeline = _pipelines.get(job_id)
    if pipeline is None:
        return {"cancelled": False}
    pipeline.cancel()        # cooperative; stages exit at their next yield
    return {"cancelled": True}

The to_sse helper emits text/event-stream chunks shaped exactly the way ocr-agent’s existing UI already consumes. The event schema (stage, kind, payload) is documented in architecture.md › Observability and is stable across releases.

Human-in-the-loop pause/resume

ocr-agent’s interactive mode runs detection, pauses on awaiting_review, waits for the user to edit the cell map, then resumes with PUT /api/ocr/{job}/cells. The scriva-native version splits one pipeline into two:

# Phase 1 — write the layout to a sidecar and return
phase1 = scriva.Pipeline(
    orientation(), deskew(),
    fallback(morphological_grid(), hough_grid()),
    hybrid(rule=rule_based(), learned=embedding.load("models/cells.joblib")),
    dissolve_blank_merges,
    json_(f"outputs/jobs/{job_id}/layout.json", select={"regions"}),
)
phase1(str(path))
# Respond to the browser; status = "awaiting_review"


# Phase 2 — when the user PUTs the edited layout back
phase2 = scriva.Pipeline(
    box_annotations(f"outputs/jobs/{job_id}/layout.json"),
    openai(model="gpt-4o", prompt=Prompt.ocr(locale="ja"),
           cache=Cache.layered(".scriva_cache")),
    whitespace(), blank_suppress(), strip_refusal,
    dictionary.from_yaml("dict.yaml"),
    rule_splitter(), confidence_score.rendering(),
    excel(f"outputs/jobs/{job_id}.xlsx", legend_sheet=True),
)
phase2(str(path))

The sidecar’s schema is the same as the regions field of result.to_json(), so the UI can edit it in place and round-trip losslessly. box_annotations exists for exactly this case (see pipeline.md › Human-in-the-loop review).

What stays in the application layer

These pieces of ocr-agent are not in any of the scriva snippets above and are not meant to be. They live in ocr_app/api/:

FastAPI app, routers, Jinja2 templates, static assets.
HMAC session cookies, basic-auth middleware, role checks.
OCRJob SQL model, job lifecycle states, the /jobs page.
Cropper.js UI, cell-editor UI, P&ID drag-to-select.
Dictionary CRUD UI, training-data CRUD UI, CSV import/export endpoints.
P&ID PIDSession + pattern-preset library bookkeeping.

scriva’s surface for them is exactly two calls: pipeline.cancel() for stop, and pipeline.events(doc) → to_sse(...) for progress. Anything beyond that is your application.

Where to go next

The four domain packs above are also documented standalone — see Domain packs.
For the protocol-level shape of each stage (so you can write your own), the reference pages: Detectors, Recognizers, Post-processors, Exporters, Sample stores.
For the full event payload table, Architecture › Observability.
For caching strategy in production, Caching.