Skip to content

Cookbook: rebuilding ocr-agent on scriva

ocr-agent is the production system scriva was extracted from: a browser OCR tool for Japanese tabular forms, agentic schema discovery, P&ID diagrams, and a Bedrock-driven annotation loop. This page is the ocr-agent → scriva translation, end-to-end. Every snippet is copy-pasteable into a fresh project.

The shape of the page mirrors ocr-agent’s own surface:

  1. Normal OCR — grid detect → cell OCR → dictionary → rule-splitter → confidence → Excel.
  2. Agentic OCR — discover fields, OCR rows, extract values into a key-value sheet.
  3. P&ID OCR — find symbols by template, OCR each box, emit a structured catalogue.
  4. Annotation loop — primary + oracle recognizers, sample store, classifier training.
  5. Wrapping it in a FastAPI app — SSE progress, cancellation, human-in-the-loop pause/resume.

If you only read one section, read §1 — every other workflow is a rearrangement of the same five stage kinds.

These examples assume one application package on top of scriva:

ocr_app/
├── pipelines/
│ ├── forms.py # §1
│ ├── agentic.py # §2
│ ├── pid.py # §3
│ └── annotation.py # §4
├── stages/ # custom @postprocessor / @recognizer helpers
│ ├── blank_merges.py
│ └── refusal_strip.py
├── prompts/
│ └── jp.py # locale-specific Prompt registrations
└── api/ # §5 — FastAPI app, routers, templates

Everything in pipelines/, stages/, and prompts/ is pure scriva configuration. Only api/ knows about HTTP, sessions, or job persistence — that boundary is the same boundary architecture.md draws.

Bring the extras you actually use:

Terminal window
pip install "scriva[openai,bedrock,excel,pgvector,pdf]"

…and configure the credentials each adapter needs through environment variables (OPENAI_API_KEY, AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, DATABASE_URL).

This is the workflow /api/ocr/upload runs in ocr-agent. Detect a grid, classify blank + merged cells, OCR each non-blank crop, correct against a dictionary, split internal ruled lines, score confidence, write Excel.

ocr_app/pipelines/forms.py
from pathlib import Path
import scriva
from scriva.preprocess import orientation, deskew, crop
from scriva.detect import fallback, morphological_grid, hough_grid
from scriva.classify import hybrid, rule_based, embedding
from scriva.recognize import openai
from scriva.postprocess import (
whitespace, blank_suppress, dictionary,
rule_splitter, confidence_score, postprocessor,
)
from scriva.export import excel, json_
from scriva.cache import Cache
from scriva.prompts import Prompt
# ocr-agent-specific: a blank "merged" region is almost always a
# detection mistake — split it back into individual blank cells.
# See notes/excluded-from-docs.md §4 for why this stays in the app layer.
from ocr_app.stages.blank_merges import dissolve_blank_merges
def tabular_forms_pipeline(
*,
dict_yaml: Path | None = None,
blank_density: float = 0.003,
excel_out: Path = Path("out.xlsx"),
json_out: Path | None = None,
crop_bbox: tuple[int, int, int, int] | None = None,
) -> scriva.Pipeline:
"""The ocr-agent 'normal OCR' pipeline.
`blank_density` corresponds to ocr-agent's strict/standard/loose
presets (0.001 / 0.003 / 0.008). `crop_bbox` is what the Cropper.js
selection becomes after the user confirms the crop screen.
"""
stages: list = []
# Preprocess
if crop_bbox is not None:
stages.append(crop(bbox=crop_bbox))
stages += [orientation(), deskew()]
# Detect: try morphology, fall back to Hough — same as grid_detect.py
stages.append(fallback(
morphological_grid(),
hough_grid(min_line_length_ratio=0.05),
))
# Classify: rule-based first, learned head for the in-between cases.
# Mirrors `_detect_blank_cells_hybrid` (rulebased → CLIP for ambiguous).
stages.append(hybrid(
rule=rule_based(blank_density=blank_density),
learned=embedding.load("models/cells.joblib"), # optional; falls back if missing
))
# Domain rule: dissolve all-blank merges back into singletons.
stages.append(dissolve_blank_merges)
# Recognize: GPT-4o, Japanese prompt, layered cache (exact + semantic).
stages.append(openai(
model="gpt-4o",
prompt=Prompt.ocr(locale="ja"),
cache=Cache.layered(".scriva_cache"),
max_concurrency=8,
crop_padding_px=0, # ocr-agent does no padding
))
# Post-process: clean → drop noise → dictionary → split ruled lines →
# confidence. Order matters — see postprocessors.md.
stages += [
whitespace(),
blank_suppress(),
]
if dict_yaml is not None:
stages.append(dictionary.from_yaml(dict_yaml))
stages += [
rule_splitter(),
confidence_score.rendering(), # cairosvg + Azure Vision embedder
]
# Export: Excel (with confidence colouring + legend) and optional JSON
stages.append(excel(
excel_out,
font="MS Gothic",
confidence_thresholds=(0.6, 0.8),
legend_sheet=True,
))
if json_out is not None:
stages.append(json_(json_out))
return scriva.Pipeline(*stages, page_concurrency=1)
pipeline = tabular_forms_pipeline(
dict_yaml=Path("dict.yaml"),
blank_density=0.003,
excel_out=Path("outputs/jobs/abc123.xlsx"),
)
result = pipeline("scan.png")
print(f"{len(result.regions_with_text)} cells, "
f"mean confidence {result.mean_confidence:.2%}")

That run wrote outputs/jobs/abc123.xlsx. For the confidence-coloured download that ocr-agent’s UI offers as a second button, swap the threshold form for a per-cell callback:

def colour(rec):
if rec.confidence is None or rec.confidence >= 0.8: return None
if rec.confidence >= 0.6: return "yellow"
return "red"
pipeline.replace("export-excel", excel(
Path("outputs/jobs/abc123-coloured.xlsx"),
confidence_fill=colour,
legend_sheet=True,
font="MS Gothic",
))
result = pipeline("scan.png")

The “dissolve all-blank merges” rule is a five-line @postprocessor. It runs against the layout, not the recognitions, so it goes ahead of recognize — which means it has to be a custom classifier-like stage, not a @postprocessor. The right shape is a Stage:

ocr_app/stages/blank_merges.py
from scriva import Stage, Context, MergeInfo
class DissolveBlankMerges(Stage):
name = "dissolve-blank-merges"
capabilities = frozenset()
async def run(self, ctx: Context) -> Context:
regions = ctx.layout.regions
new_regions = []
for r in regions:
if r.merge_group_id and r.role == "blank":
# Strip the merge — leave the region as a standalone blank cell.
r = r.model_copy(update={
"merge_group_id": None,
"merge": MergeInfo(),
})
new_regions.append(r)
ctx.layout.regions = new_regions
return ctx
dissolve_blank_merges = DissolveBlankMerges()

The other thing ocr-agent’s postprocess_vlm_text does that the built-in blank_suppress does not cover is stripping VLM refusal boilerplate (「申し訳ありません … 読み取れません」). One more @postprocessor:

ocr_app/stages/refusal_strip.py
import re
from scriva import postprocessor
_REFUSAL = re.compile(
r"(申し訳ありません|I'm sorry|I apologize|I cannot).*?"
r"(?:読み取|確認|判読|認識|識別|read|unable).*?(?:できません|ません|cannot)[。..]?\s*",
re.IGNORECASE,
)
@postprocessor(name="strip-vlm-refusal")
async def strip_refusal(page, layout, recognitions):
return {
rid: r.with_text(_REFUSAL.sub("", r.text).strip()) if r.text else r
for rid, r in recognitions.items()
}

Drop it into the chain right after blank_suppress().

Confidence-driven re-OCR (a second pipeline)

Section titled “Confidence-driven re-OCR (a second pipeline)”

ocr-agent’s /api/ocr/{job}/reocr re-runs only the low-confidence cells through a different model, using the previous text as a hint. The canonical recipe lives in pipeline.md › Confidence-driven re-OCR; here is the ocr-agent adaptation:

from scriva import RecognitionHint
from scriva.detect import box_annotations
from scriva.recognize import openai, anthropic
def reocr_pipeline(
result: "scriva.DocumentResult",
*,
threshold: float = 0.6,
model: str = "gpt-4o",
) -> scriva.Pipeline:
Recognizer = openai if model.startswith(("gpt-", "o1", "o3", "o4")) else anthropic
return scriva.Pipeline(
box_annotations.from_result(
result,
where=lambda r: (r.confidence or 0) < threshold,
),
Recognizer(
model=model,
prompt=scriva.prompts.Prompt.ocr_with_hint(),
hints=RecognitionHint.from_result(result),
cache=Cache.layered(".scriva_cache"),
),
confidence_score.rendering(),
excel(Path("outputs/jobs/abc123-refined.xlsx"), legend_sheet=True),
)
refined = reocr_pipeline(result, threshold=0.6, model="gpt-4o")("scan.png")
combined = result.merge(refined, strategy="highest_confidence")
combined.to_excel("outputs/jobs/abc123-final.xlsx")

ocr-agent’s agentic mode: VLM discovers what fields exist, OCRs the form row-by-row, then extracts each field’s value with the row context in scope. The output is a key-value sheet (項目名 / 抽出値 / 参照行 / 備考), not a grid.

The whole thing collapses to the domains.agentic pack:

ocr_app/pipelines/agentic.py
from pathlib import Path
import scriva
from scriva import domains
from scriva.cache import Cache
def agentic_pipeline(*, excel_out: Path) -> scriva.Pipeline:
return domains.agentic.extract(
schema=None, # None = discover (3 passes by default)
rounds=3,
model="gpt-4o",
json_out=excel_out.with_suffix(".json"),
)

When you do know the schema up front — say you have a pydantic.BaseModel for the form — pass it as schema= and skip the discovery phase:

from pydantic import BaseModel
class InspectionForm(BaseModel):
patient_name: str
date_of_birth: str
diagnosis_code: str
dose_mg: float | None
pipeline = domains.agentic.extract(
schema=InspectionForm,
model="gpt-4o",
json_out=Path("outputs/jobs/abc123.json"),
)

ocr-agent’s discovery prompt is bilingual and includes the already-discovered list each round. To match that exactly, register your own and pass the pack a recognizer that uses it:

from scriva.prompts import Prompt
from scriva.recognize import openai
Prompt.register(
"discover-ja",
"""この帳票画像を分析し、抽出すべきデータ項目(フィールド名)を列挙してください。
{% if known %}
【既に発見済みの項目】
{% for item in known %}- {{ item }}
{% endfor %}
上記以外にまだ発見されていない項目があれば、それらも含めて網羅的に列挙してください。
{% endif %}
各行は「- 項目名」の形式で出力。項目名のみ、値は不要です。""",
locale="ja",
)
pipeline = domains.agentic.extract(
rounds=3,
discover_recognizer=openai(model="gpt-4o", prompt=Prompt.from_registered("discover-ja")),
json_out=Path("outputs/jobs/abc123.json"),
)

DocumentResult.to_excel() writes a flat key-value table when the layout has no grid — exactly the shape ocr-agent’s _export_agentic_excel produces. Replace the bespoke exporter with:

result = pipeline("scan.png")
result.to_excel(
"outputs/jobs/abc123.xlsx",
font="MS Gothic",
columns=["項目名", "抽出値", "参照行", "備考"], # column header overrides
)

Two routes match ocr-agent’s behaviour. Pick by how much of the workflow the user drives interactively.

3a. User drags patterns; scriva finds matches

Section titled “3a. User drags patterns; scriva finds matches”

ocr-agent’s UI lets the user drag-select a symbol, then runs cv2.matchTemplate to find similar instances. A custom detector wraps that:

ocr_app/pipelines/pid.py
from pathlib import Path
import cv2
import numpy as np
from scriva import LayoutDetector, Layout, Region, BBox, Capability, detector
from scriva.recognize import openai
from scriva.postprocess import whitespace
from scriva.export import json_, excel
from scriva.prompts import Prompt
from pydantic import BaseModel
class PidSymbol(BaseModel):
types: str # valve, pump, instrument, tank, …
text: str # equipment IDs, numbers, labels
@detector(name="pid-template-match", capabilities=frozenset({Capability.POLYGON}))
async def pid_template_match(page, *, templates: list[Path], threshold=0.8,
nms_overlap=0.2, min_distance=10):
"""Run cv2.matchTemplate for each pattern, NMS, min-distance filter."""
image = page.image_as_ndarray() # PIL or ndarray; both supported
boxes: list[tuple[int, int, int, int, float, str]] = []
for tpl_path in templates:
tpl = cv2.imread(str(tpl_path))
score = cv2.matchTemplate(image, tpl, cv2.TM_CCOEFF_NORMED)
ys, xs = np.where(score >= threshold)
for x, y in zip(xs, ys):
boxes.append((int(x), int(y), tpl.shape[1], tpl.shape[0],
float(score[y, x]), tpl_path.stem))
# NMS + min-distance: see template_matching.py for the reference impl.
boxes = _nms(boxes, overlap=nms_overlap)
boxes = _min_distance(boxes, distance=min_distance)
regions = [
Region(bbox=BBox(x, y, w, h), role="data",
attrs={"pattern": pat, "match_score": s})
for x, y, w, h, s, pat in boxes
]
return Layout.from_regions(regions, page=page)
def pid_pipeline(templates: list[Path], *, json_out: Path,
excel_out: Path | None = None) -> scriva.Pipeline:
stages: list = [
pid_template_match(templates=templates, threshold=0.8,
nms_overlap=0.2, min_distance=10),
openai(
model="gpt-4o",
prompt=Prompt.structured(schema=PidSymbol),
crop_padding_px=int(0.10 * 100), # 10% padding ≈ ocr-agent's _crop_box_image
max_concurrency=3,
),
whitespace(),
json_(json_out),
]
if excel_out is not None:
# embed_crops=True + crop_height_px=60 matches ocr-agent's
# pid_export.py (one row per box with a 60px thumbnail).
stages.append(excel(excel_out, embed_crops=True, crop_height_px=60,
crop_column="A"))
return scriva.Pipeline(*stages)

(_nms and _min_distance are pure NumPy; lift them verbatim from ocr-agent’s pipeline/template_matching.py.)

pipeline = pid_pipeline(
templates=[Path("uploads/pid_presets/valve.png"),
Path("uploads/pid_presets/pump.png")],
json_out=Path("outputs/pid/session-001.json"),
excel_out=Path("outputs/pid/session-001.xlsx"),
)
result = pipeline("drawing.png")

The opinionated domains.pid pack drops template matching entirely and asks the VLM to find symbols directly. Right when you don’t have templates yet or the drawing style varies:

from scriva import domains
pipeline = domains.pid.diagram(
model="gpt-4o",
json_out=Path("outputs/pid/session-002.json"),
)
result = pipeline("drawing.png")

Both shapes are documented as legitimate; ocr-agent’s UI happens to use 3a because of the drag-to-select interaction.

ocr-agent’s annotation backend (annotation_runner.py) does three distinct things — blank annotation, merge/border annotation, and OCR annotation — but all three follow the same pattern: primary classifier or recognizer runs first, results sorted by |score - 0.5|, Bedrock Qwen-VL re-runs the uncertain ones, every result becomes a training sample.

scriva.recognize.uncertainty_first + scriva.export.samples + scriva.classify.embedding.train cover it. The pre-built domain pack:

ocr_app/pipelines/annotation.py
import scriva
from scriva import samples
from scriva.recognize import openai, bedrock
from scriva.cache import Cache
def annotation_pipeline(*, dsn: str, json_out: str) -> scriva.Pipeline:
store = samples.layered(
fs=samples.fs("training_data/samples"),
index=samples.pgvector(dsn, dim=1024),
)
return scriva.domains.annotation.review(
primary=openai(model="gpt-4o", cache=Cache.layered(".scriva_cache")),
oracle=bedrock(model="qwen.qwen3-vl-235b-a22b"),
store=store,
k=20, # top-k most-uncertain go to the oracle
json_out=json_out,
)
pipeline = annotation_pipeline(
dsn="postgresql+asyncpg://localhost/ocr_agent",
json_out="outputs/annotations/2026-05-21.json",
)
pipeline("scan.png") # writes samples to the pgvector store as it runs

The pack composes:

detect.morphological_grid
──► classify.rule_based
──► recognize.uncertainty_first(primary, oracle, k=20, store=store)
──► postprocess.whitespace
──► postprocess.confidence_score.rendering
──► export.json_ + export.samples(store)

— so the recognizer wrapper persists both the primary recognition and the oracle’s correction on every sample it touches. That replaces AnnotationRunner._register_sample (annotation_runner.py:88) and AnnotationRunner._register_pair_sample (annotation_runner.py:263) both.

Once samples accumulate, fit a head and drop it into your forms pipeline:

from scriva import classify
from scriva.embedders import OpenAIEmbedder
from scriva.classify import LightGBMHead
store = samples.layered(
fs=samples.fs("training_data/samples"),
index=samples.pgvector(dsn, dim=1024),
)
clf = classify.embedding.train(
samples=store,
where=lambda s: s.label is not None, # only labelled samples
embedder=OpenAIEmbedder(model="text-embedding-3-small"),
head=LightGBMHead(),
target=lambda s: "blank" if s.attrs.get("is_blank") else "data",
)
clf.save("models/cells.joblib")

…and back in §1, the call you already had — embedding.load("models/cells.joblib") — picks it up next run. That is the entire ocr-agent active-learning loop, on three named primitives.

ocr-agent’s merge annotation operates on pairs of adjacent cells, not single cells. The mapping is the same uncertainty_first wrapper, but with a pair detector instead of a cell detector and a different prompt:

from scriva.prompts import Prompt
Prompt.register(
"border-pair",
"""この画像は帳票の隣接する2つのセルを合わせた領域です。
2つのセルの間に境界線(罫線)が存在するかどうかを判定してください。
以下のJSON形式で返答してください:
{ "has_border": true|false, "direction": "horizontal"|"vertical" }""",
locale="ja",
)
class BorderLabel(BaseModel):
has_border: bool
direction: str
border_pipeline = scriva.Pipeline(
pair_detector(), # custom (lift _extract_pair_region)
classify.embedding.load("models/borders.joblib"), # current border classifier
recognize.uncertainty_first(
primary=classify_recognizer(clf=border_classifier), # adapter wrapping the classifier
oracle=bedrock(
model="qwen.qwen3-vl-235b-a22b",
prompt=Prompt.structured_few_shot(
schema=BorderLabel,
base="border-pair",
),
),
store=store,
k=20,
),
export.samples(store),
)

The pair_detector is a 30-line custom @detector that yields one Region per adjacent cell pair (the body is _extract_pair_region from cell_detect.py:110). classify_recognizer is a thin @recognizer that calls a trained head and emits a Recognition whose confidence is the head’s predict_proba — its existence is what lets uncertainty_first sort pairs by |confidence - 0.5|.

scriva’s architecture page is firm about this: the web layer is the application’s job. But the integration is small — events.to_sse and Pipeline.cancel() are the two primitives — so here’s the canonical wiring that replaces ocr-agent’s routers/ocr.py.

ocr_app/api/jobs.py
import asyncio
import uuid
from fastapi import APIRouter, BackgroundTasks, UploadFile
from sse_starlette.sse import EventSourceResponse
import scriva
from scriva.events import to_sse
from ocr_app.pipelines.forms import tabular_forms_pipeline
router = APIRouter(prefix="/api/ocr", tags=["ocr"])
# Application-layer registry (architecture.md disclaims this from scriva).
_pipelines: dict[uuid.UUID, scriva.Pipeline] = {}
_results: dict[uuid.UUID, asyncio.Task] = {}
@router.post("/upload")
async def upload(file: UploadFile, bg: BackgroundTasks) -> dict:
job_id = uuid.uuid4()
path = await _save_upload(file, job_id)
pipeline = tabular_forms_pipeline(
dict_yaml=Path("dict.yaml"),
excel_out=Path(f"outputs/jobs/{job_id}.xlsx"),
)
_pipelines[job_id] = pipeline
_results[job_id] = asyncio.create_task(pipeline.aio(str(path)))
return {"job_id": str(job_id)}
@router.get("/{job_id}/progress")
async def progress(job_id: uuid.UUID):
pipeline = _pipelines[job_id]
return EventSourceResponse(to_sse(pipeline.events(str(path))))
@router.post("/{job_id}/cancel")
async def cancel(job_id: uuid.UUID) -> dict:
pipeline = _pipelines.get(job_id)
if pipeline is None:
return {"cancelled": False}
pipeline.cancel() # cooperative; stages exit at their next yield
return {"cancelled": True}

The to_sse helper emits text/event-stream chunks shaped exactly the way ocr-agent’s existing UI already consumes. The event schema (stage, kind, payload) is documented in architecture.md › Observability and is stable across releases.

ocr-agent’s interactive mode runs detection, pauses on awaiting_review, waits for the user to edit the cell map, then resumes with PUT /api/ocr/{job}/cells. The scriva-native version splits one pipeline into two:

# Phase 1 — write the layout to a sidecar and return
phase1 = scriva.Pipeline(
orientation(), deskew(),
fallback(morphological_grid(), hough_grid()),
hybrid(rule=rule_based(), learned=embedding.load("models/cells.joblib")),
dissolve_blank_merges,
json_(f"outputs/jobs/{job_id}/layout.json", select={"regions"}),
)
phase1(str(path))
# Respond to the browser; status = "awaiting_review"
# Phase 2 — when the user PUTs the edited layout back
phase2 = scriva.Pipeline(
box_annotations(f"outputs/jobs/{job_id}/layout.json"),
openai(model="gpt-4o", prompt=Prompt.ocr(locale="ja"),
cache=Cache.layered(".scriva_cache")),
whitespace(), blank_suppress(), strip_refusal,
dictionary.from_yaml("dict.yaml"),
rule_splitter(), confidence_score.rendering(),
excel(f"outputs/jobs/{job_id}.xlsx", legend_sheet=True),
)
phase2(str(path))

The sidecar’s schema is the same as the regions field of result.to_json(), so the UI can edit it in place and round-trip losslessly. box_annotations exists for exactly this case (see pipeline.md › Human-in-the-loop review).

These pieces of ocr-agent are not in any of the scriva snippets above and are not meant to be. They live in ocr_app/api/:

  • FastAPI app, routers, Jinja2 templates, static assets.
  • HMAC session cookies, basic-auth middleware, role checks.
  • OCRJob SQL model, job lifecycle states, the /jobs page.
  • Cropper.js UI, cell-editor UI, P&ID drag-to-select.
  • Dictionary CRUD UI, training-data CRUD UI, CSV import/export endpoints.
  • P&ID PIDSession + pattern-preset library bookkeeping.

scriva’s surface for them is exactly two calls: pipeline.cancel() for stop, and pipeline.events(doc)to_sse(...) for progress. Anything beyond that is your application.