Cookbook: rebuilding ocr-agent on scriva
Cookbook: rebuilding ocr-agent on scriva
Section titled “Cookbook: rebuilding ocr-agent on scriva”ocr-agent is the production system scriva was extracted from: a browser
OCR tool for Japanese tabular forms, agentic schema discovery, P&ID
diagrams, and a Bedrock-driven annotation loop. This page is the
ocr-agent → scriva translation, end-to-end. Every snippet is
copy-pasteable into a fresh project.
The shape of the page mirrors ocr-agent’s own surface:
- Normal OCR — grid detect → cell OCR → dictionary → rule-splitter → confidence → Excel.
- Agentic OCR — discover fields, OCR rows, extract values into a key-value sheet.
- P&ID OCR — find symbols by template, OCR each box, emit a structured catalogue.
- Annotation loop — primary + oracle recognizers, sample store, classifier training.
- Wrapping it in a FastAPI app — SSE progress, cancellation, human-in-the-loop pause/resume.
If you only read one section, read §1 — every other workflow is a rearrangement of the same five stage kinds.
Repository layout
Section titled “Repository layout”These examples assume one application package on top of scriva:
ocr_app/├── pipelines/│ ├── forms.py # §1│ ├── agentic.py # §2│ ├── pid.py # §3│ └── annotation.py # §4├── stages/ # custom @postprocessor / @recognizer helpers│ ├── blank_merges.py│ └── refusal_strip.py├── prompts/│ └── jp.py # locale-specific Prompt registrations└── api/ # §5 — FastAPI app, routers, templatesEverything in pipelines/, stages/, and prompts/ is pure scriva
configuration. Only api/ knows about HTTP, sessions, or job persistence
— that boundary is the same boundary architecture.md draws.
Bring the extras you actually use:
pip install "scriva[openai,bedrock,excel,pgvector,pdf]"…and configure the credentials each adapter needs through environment
variables (OPENAI_API_KEY, AWS_REGION, AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY, DATABASE_URL).
1. Normal OCR (tabular forms)
Section titled “1. Normal OCR (tabular forms)”This is the workflow /api/ocr/upload runs in ocr-agent. Detect a grid,
classify blank + merged cells, OCR each non-blank crop, correct against a
dictionary, split internal ruled lines, score confidence, write Excel.
from pathlib import Path
import scrivafrom scriva.preprocess import orientation, deskew, cropfrom scriva.detect import fallback, morphological_grid, hough_gridfrom scriva.classify import hybrid, rule_based, embeddingfrom scriva.recognize import openaifrom scriva.postprocess import ( whitespace, blank_suppress, dictionary, rule_splitter, confidence_score, postprocessor,)from scriva.export import excel, json_from scriva.cache import Cachefrom scriva.prompts import Prompt
# ocr-agent-specific: a blank "merged" region is almost always a# detection mistake — split it back into individual blank cells.# See notes/excluded-from-docs.md §4 for why this stays in the app layer.from ocr_app.stages.blank_merges import dissolve_blank_merges
def tabular_forms_pipeline( *, dict_yaml: Path | None = None, blank_density: float = 0.003, excel_out: Path = Path("out.xlsx"), json_out: Path | None = None, crop_bbox: tuple[int, int, int, int] | None = None,) -> scriva.Pipeline: """The ocr-agent 'normal OCR' pipeline.
`blank_density` corresponds to ocr-agent's strict/standard/loose presets (0.001 / 0.003 / 0.008). `crop_bbox` is what the Cropper.js selection becomes after the user confirms the crop screen. """ stages: list = []
# Preprocess if crop_bbox is not None: stages.append(crop(bbox=crop_bbox)) stages += [orientation(), deskew()]
# Detect: try morphology, fall back to Hough — same as grid_detect.py stages.append(fallback( morphological_grid(), hough_grid(min_line_length_ratio=0.05), ))
# Classify: rule-based first, learned head for the in-between cases. # Mirrors `_detect_blank_cells_hybrid` (rulebased → CLIP for ambiguous). stages.append(hybrid( rule=rule_based(blank_density=blank_density), learned=embedding.load("models/cells.joblib"), # optional; falls back if missing )) # Domain rule: dissolve all-blank merges back into singletons. stages.append(dissolve_blank_merges)
# Recognize: GPT-4o, Japanese prompt, layered cache (exact + semantic). stages.append(openai( model="gpt-4o", prompt=Prompt.ocr(locale="ja"), cache=Cache.layered(".scriva_cache"), max_concurrency=8, crop_padding_px=0, # ocr-agent does no padding ))
# Post-process: clean → drop noise → dictionary → split ruled lines → # confidence. Order matters — see postprocessors.md. stages += [ whitespace(), blank_suppress(), ] if dict_yaml is not None: stages.append(dictionary.from_yaml(dict_yaml)) stages += [ rule_splitter(), confidence_score.rendering(), # cairosvg + Azure Vision embedder ]
# Export: Excel (with confidence colouring + legend) and optional JSON stages.append(excel( excel_out, font="MS Gothic", confidence_thresholds=(0.6, 0.8), legend_sheet=True, )) if json_out is not None: stages.append(json_(json_out))
return scriva.Pipeline(*stages, page_concurrency=1)Running it
Section titled “Running it”pipeline = tabular_forms_pipeline( dict_yaml=Path("dict.yaml"), blank_density=0.003, excel_out=Path("outputs/jobs/abc123.xlsx"),)result = pipeline("scan.png")
print(f"{len(result.regions_with_text)} cells, " f"mean confidence {result.mean_confidence:.2%}")That run wrote outputs/jobs/abc123.xlsx. For the
confidence-coloured download that ocr-agent’s UI offers as a second
button, swap the threshold form for a per-cell callback:
def colour(rec): if rec.confidence is None or rec.confidence >= 0.8: return None if rec.confidence >= 0.6: return "yellow" return "red"
pipeline.replace("export-excel", excel( Path("outputs/jobs/abc123-coloured.xlsx"), confidence_fill=colour, legend_sheet=True, font="MS Gothic",))result = pipeline("scan.png")Two custom stages worth seeing
Section titled “Two custom stages worth seeing”The “dissolve all-blank merges” rule is a five-line @postprocessor. It
runs against the layout, not the recognitions, so it goes ahead of
recognize — which means it has to be a custom classifier-like stage,
not a @postprocessor. The right shape is a Stage:
from scriva import Stage, Context, MergeInfo
class DissolveBlankMerges(Stage): name = "dissolve-blank-merges" capabilities = frozenset()
async def run(self, ctx: Context) -> Context: regions = ctx.layout.regions new_regions = [] for r in regions: if r.merge_group_id and r.role == "blank": # Strip the merge — leave the region as a standalone blank cell. r = r.model_copy(update={ "merge_group_id": None, "merge": MergeInfo(), }) new_regions.append(r) ctx.layout.regions = new_regions return ctx
dissolve_blank_merges = DissolveBlankMerges()The other thing ocr-agent’s postprocess_vlm_text does that the
built-in blank_suppress does not cover is stripping VLM refusal
boilerplate (「申し訳ありません … 読み取れません」). One more
@postprocessor:
import refrom scriva import postprocessor
_REFUSAL = re.compile( r"(申し訳ありません|I'm sorry|I apologize|I cannot).*?" r"(?:読み取|確認|判読|認識|識別|read|unable).*?(?:できません|ません|cannot)[。..]?\s*", re.IGNORECASE,)
@postprocessor(name="strip-vlm-refusal")async def strip_refusal(page, layout, recognitions): return { rid: r.with_text(_REFUSAL.sub("", r.text).strip()) if r.text else r for rid, r in recognitions.items() }Drop it into the chain right after blank_suppress().
Confidence-driven re-OCR (a second pipeline)
Section titled “Confidence-driven re-OCR (a second pipeline)”ocr-agent’s /api/ocr/{job}/reocr re-runs only the low-confidence cells
through a different model, using the previous text as a hint. The
canonical recipe lives in pipeline.md › Confidence-driven
re-OCR; here is the ocr-agent
adaptation:
from scriva import RecognitionHintfrom scriva.detect import box_annotationsfrom scriva.recognize import openai, anthropic
def reocr_pipeline( result: "scriva.DocumentResult", *, threshold: float = 0.6, model: str = "gpt-4o",) -> scriva.Pipeline: Recognizer = openai if model.startswith(("gpt-", "o1", "o3", "o4")) else anthropic return scriva.Pipeline( box_annotations.from_result( result, where=lambda r: (r.confidence or 0) < threshold, ), Recognizer( model=model, prompt=scriva.prompts.Prompt.ocr_with_hint(), hints=RecognitionHint.from_result(result), cache=Cache.layered(".scriva_cache"), ), confidence_score.rendering(), excel(Path("outputs/jobs/abc123-refined.xlsx"), legend_sheet=True), )
refined = reocr_pipeline(result, threshold=0.6, model="gpt-4o")("scan.png")combined = result.merge(refined, strategy="highest_confidence")combined.to_excel("outputs/jobs/abc123-final.xlsx")2. Agentic OCR (schema discovery)
Section titled “2. Agentic OCR (schema discovery)”ocr-agent’s agentic mode: VLM discovers what fields exist, OCRs the
form row-by-row, then extracts each field’s value with the row context
in scope. The output is a key-value sheet (項目名 / 抽出値 / 参照行 / 備考), not a grid.
The whole thing collapses to the domains.agentic pack:
from pathlib import Path
import scrivafrom scriva import domainsfrom scriva.cache import Cache
def agentic_pipeline(*, excel_out: Path) -> scriva.Pipeline: return domains.agentic.extract( schema=None, # None = discover (3 passes by default) rounds=3, model="gpt-4o", json_out=excel_out.with_suffix(".json"), )When you do know the schema up front — say you have a pydantic.BaseModel
for the form — pass it as schema= and skip the discovery phase:
from pydantic import BaseModel
class InspectionForm(BaseModel): patient_name: str date_of_birth: str diagnosis_code: str dose_mg: float | None
pipeline = domains.agentic.extract( schema=InspectionForm, model="gpt-4o", json_out=Path("outputs/jobs/abc123.json"),)Customising the discovery prompt
Section titled “Customising the discovery prompt”ocr-agent’s discovery prompt is bilingual and includes the already-discovered list each round. To match that exactly, register your own and pass the pack a recognizer that uses it:
from scriva.prompts import Promptfrom scriva.recognize import openai
Prompt.register( "discover-ja", """この帳票画像を分析し、抽出すべきデータ項目(フィールド名)を列挙してください。{% if known %}【既に発見済みの項目】{% for item in known %}- {{ item }}{% endfor %}上記以外にまだ発見されていない項目があれば、それらも含めて網羅的に列挙してください。{% endif %}各行は「- 項目名」の形式で出力。項目名のみ、値は不要です。""", locale="ja",)
pipeline = domains.agentic.extract( rounds=3, discover_recognizer=openai(model="gpt-4o", prompt=Prompt.from_registered("discover-ja")), json_out=Path("outputs/jobs/abc123.json"),)Excel key-value output
Section titled “Excel key-value output”DocumentResult.to_excel() writes a flat key-value table when the
layout has no grid — exactly the shape ocr-agent’s
_export_agentic_excel produces. Replace the bespoke exporter with:
result = pipeline("scan.png")result.to_excel( "outputs/jobs/abc123.xlsx", font="MS Gothic", columns=["項目名", "抽出値", "参照行", "備考"], # column header overrides)3. P&ID symbol OCR
Section titled “3. P&ID symbol OCR”Two routes match ocr-agent’s behaviour. Pick by how much of the workflow the user drives interactively.
3a. User drags patterns; scriva finds matches
Section titled “3a. User drags patterns; scriva finds matches”ocr-agent’s UI lets the user drag-select a symbol, then runs
cv2.matchTemplate to find similar instances. A custom detector wraps
that:
from pathlib import Path
import cv2import numpy as npfrom scriva import LayoutDetector, Layout, Region, BBox, Capability, detectorfrom scriva.recognize import openaifrom scriva.postprocess import whitespacefrom scriva.export import json_, excelfrom scriva.prompts import Prompt
from pydantic import BaseModel
class PidSymbol(BaseModel): types: str # valve, pump, instrument, tank, … text: str # equipment IDs, numbers, labels
@detector(name="pid-template-match", capabilities=frozenset({Capability.POLYGON}))async def pid_template_match(page, *, templates: list[Path], threshold=0.8, nms_overlap=0.2, min_distance=10): """Run cv2.matchTemplate for each pattern, NMS, min-distance filter.""" image = page.image_as_ndarray() # PIL or ndarray; both supported boxes: list[tuple[int, int, int, int, float, str]] = [] for tpl_path in templates: tpl = cv2.imread(str(tpl_path)) score = cv2.matchTemplate(image, tpl, cv2.TM_CCOEFF_NORMED) ys, xs = np.where(score >= threshold) for x, y in zip(xs, ys): boxes.append((int(x), int(y), tpl.shape[1], tpl.shape[0], float(score[y, x]), tpl_path.stem)) # NMS + min-distance: see template_matching.py for the reference impl. boxes = _nms(boxes, overlap=nms_overlap) boxes = _min_distance(boxes, distance=min_distance) regions = [ Region(bbox=BBox(x, y, w, h), role="data", attrs={"pattern": pat, "match_score": s}) for x, y, w, h, s, pat in boxes ] return Layout.from_regions(regions, page=page)
def pid_pipeline(templates: list[Path], *, json_out: Path, excel_out: Path | None = None) -> scriva.Pipeline: stages: list = [ pid_template_match(templates=templates, threshold=0.8, nms_overlap=0.2, min_distance=10), openai( model="gpt-4o", prompt=Prompt.structured(schema=PidSymbol), crop_padding_px=int(0.10 * 100), # 10% padding ≈ ocr-agent's _crop_box_image max_concurrency=3, ), whitespace(), json_(json_out), ] if excel_out is not None: # embed_crops=True + crop_height_px=60 matches ocr-agent's # pid_export.py (one row per box with a 60px thumbnail). stages.append(excel(excel_out, embed_crops=True, crop_height_px=60, crop_column="A")) return scriva.Pipeline(*stages)(_nms and _min_distance are pure NumPy; lift them verbatim from
ocr-agent’s pipeline/template_matching.py.)
pipeline = pid_pipeline( templates=[Path("uploads/pid_presets/valve.png"), Path("uploads/pid_presets/pump.png")], json_out=Path("outputs/pid/session-001.json"), excel_out=Path("outputs/pid/session-001.xlsx"),)result = pipeline("drawing.png")3b. Whole-page VLM (no template matching)
Section titled “3b. Whole-page VLM (no template matching)”The opinionated domains.pid pack drops template matching entirely and
asks the VLM to find symbols directly. Right when you don’t have
templates yet or the drawing style varies:
from scriva import domains
pipeline = domains.pid.diagram( model="gpt-4o", json_out=Path("outputs/pid/session-002.json"),)result = pipeline("drawing.png")Both shapes are documented as legitimate; ocr-agent’s UI happens to use 3a because of the drag-to-select interaction.
4. Annotation & training loop
Section titled “4. Annotation & training loop”ocr-agent’s annotation backend (annotation_runner.py) does three
distinct things — blank annotation, merge/border annotation, and OCR
annotation — but all three follow the same pattern: primary classifier
or recognizer runs first, results sorted by |score - 0.5|, Bedrock
Qwen-VL re-runs the uncertain ones, every result becomes a training
sample.
scriva.recognize.uncertainty_first + scriva.export.samples +
scriva.classify.embedding.train cover it. The pre-built domain pack:
import scrivafrom scriva import samplesfrom scriva.recognize import openai, bedrockfrom scriva.cache import Cache
def annotation_pipeline(*, dsn: str, json_out: str) -> scriva.Pipeline: store = samples.layered( fs=samples.fs("training_data/samples"), index=samples.pgvector(dsn, dim=1024), )
return scriva.domains.annotation.review( primary=openai(model="gpt-4o", cache=Cache.layered(".scriva_cache")), oracle=bedrock(model="qwen.qwen3-vl-235b-a22b"), store=store, k=20, # top-k most-uncertain go to the oracle json_out=json_out, )pipeline = annotation_pipeline( dsn="postgresql+asyncpg://localhost/ocr_agent", json_out="outputs/annotations/2026-05-21.json",)pipeline("scan.png") # writes samples to the pgvector store as it runsThe pack composes:
detect.morphological_grid ──► classify.rule_based ──► recognize.uncertainty_first(primary, oracle, k=20, store=store) ──► postprocess.whitespace ──► postprocess.confidence_score.rendering ──► export.json_ + export.samples(store)— so the recognizer wrapper persists both the primary recognition and
the oracle’s correction on every sample it touches. That replaces
AnnotationRunner._register_sample (annotation_runner.py:88) and
AnnotationRunner._register_pair_sample (annotation_runner.py:263)
both.
Training a classifier from the store
Section titled “Training a classifier from the store”Once samples accumulate, fit a head and drop it into your forms pipeline:
from scriva import classifyfrom scriva.embedders import OpenAIEmbedderfrom scriva.classify import LightGBMHead
store = samples.layered( fs=samples.fs("training_data/samples"), index=samples.pgvector(dsn, dim=1024),)
clf = classify.embedding.train( samples=store, where=lambda s: s.label is not None, # only labelled samples embedder=OpenAIEmbedder(model="text-embedding-3-small"), head=LightGBMHead(), target=lambda s: "blank" if s.attrs.get("is_blank") else "data",)clf.save("models/cells.joblib")…and back in §1, the call you already had —
embedding.load("models/cells.joblib") — picks it up next run. That is
the entire ocr-agent active-learning loop, on three named primitives.
Border-pair annotation (the merge mode)
Section titled “Border-pair annotation (the merge mode)”ocr-agent’s merge annotation operates on pairs of adjacent cells, not
single cells. The mapping is the same uncertainty_first wrapper, but
with a pair detector instead of a cell detector and a different
prompt:
from scriva.prompts import Prompt
Prompt.register( "border-pair", """この画像は帳票の隣接する2つのセルを合わせた領域です。2つのセルの間に境界線(罫線)が存在するかどうかを判定してください。
以下のJSON形式で返答してください:{ "has_border": true|false, "direction": "horizontal"|"vertical" }""", locale="ja",)
class BorderLabel(BaseModel): has_border: bool direction: str
border_pipeline = scriva.Pipeline( pair_detector(), # custom (lift _extract_pair_region) classify.embedding.load("models/borders.joblib"), # current border classifier recognize.uncertainty_first( primary=classify_recognizer(clf=border_classifier), # adapter wrapping the classifier oracle=bedrock( model="qwen.qwen3-vl-235b-a22b", prompt=Prompt.structured_few_shot( schema=BorderLabel, base="border-pair", ), ), store=store, k=20, ), export.samples(store),)The pair_detector is a 30-line custom @detector that yields one
Region per adjacent cell pair (the body is _extract_pair_region
from cell_detect.py:110). classify_recognizer is a thin
@recognizer that calls a trained head and emits a Recognition whose
confidence is the head’s predict_proba — its existence is what lets
uncertainty_first sort pairs by |confidence - 0.5|.
5. FastAPI hosting
Section titled “5. FastAPI hosting”scriva’s architecture page is firm about this: the web layer is the
application’s job. But the integration is small — events.to_sse and
Pipeline.cancel() are the two primitives — so here’s the
canonical wiring that replaces ocr-agent’s routers/ocr.py.
SSE progress for one job
Section titled “SSE progress for one job”import asyncioimport uuidfrom fastapi import APIRouter, BackgroundTasks, UploadFilefrom sse_starlette.sse import EventSourceResponse
import scrivafrom scriva.events import to_sse
from ocr_app.pipelines.forms import tabular_forms_pipeline
router = APIRouter(prefix="/api/ocr", tags=["ocr"])
# Application-layer registry (architecture.md disclaims this from scriva)._pipelines: dict[uuid.UUID, scriva.Pipeline] = {}_results: dict[uuid.UUID, asyncio.Task] = {}
@router.post("/upload")async def upload(file: UploadFile, bg: BackgroundTasks) -> dict: job_id = uuid.uuid4() path = await _save_upload(file, job_id)
pipeline = tabular_forms_pipeline( dict_yaml=Path("dict.yaml"), excel_out=Path(f"outputs/jobs/{job_id}.xlsx"), ) _pipelines[job_id] = pipeline _results[job_id] = asyncio.create_task(pipeline.aio(str(path))) return {"job_id": str(job_id)}
@router.get("/{job_id}/progress")async def progress(job_id: uuid.UUID): pipeline = _pipelines[job_id] return EventSourceResponse(to_sse(pipeline.events(str(path))))
@router.post("/{job_id}/cancel")async def cancel(job_id: uuid.UUID) -> dict: pipeline = _pipelines.get(job_id) if pipeline is None: return {"cancelled": False} pipeline.cancel() # cooperative; stages exit at their next yield return {"cancelled": True}The to_sse helper emits text/event-stream chunks shaped exactly the
way ocr-agent’s existing UI already consumes. The event schema
(stage, kind, payload) is documented in
architecture.md › Observability and is
stable across releases.
Human-in-the-loop pause/resume
Section titled “Human-in-the-loop pause/resume”ocr-agent’s interactive mode runs detection, pauses on
awaiting_review, waits for the user to edit the cell map, then
resumes with PUT /api/ocr/{job}/cells. The scriva-native version
splits one pipeline into two:
# Phase 1 — write the layout to a sidecar and returnphase1 = scriva.Pipeline( orientation(), deskew(), fallback(morphological_grid(), hough_grid()), hybrid(rule=rule_based(), learned=embedding.load("models/cells.joblib")), dissolve_blank_merges, json_(f"outputs/jobs/{job_id}/layout.json", select={"regions"}),)phase1(str(path))# Respond to the browser; status = "awaiting_review"
# Phase 2 — when the user PUTs the edited layout backphase2 = scriva.Pipeline( box_annotations(f"outputs/jobs/{job_id}/layout.json"), openai(model="gpt-4o", prompt=Prompt.ocr(locale="ja"), cache=Cache.layered(".scriva_cache")), whitespace(), blank_suppress(), strip_refusal, dictionary.from_yaml("dict.yaml"), rule_splitter(), confidence_score.rendering(), excel(f"outputs/jobs/{job_id}.xlsx", legend_sheet=True),)phase2(str(path))The sidecar’s schema is the same as the regions field of
result.to_json(), so the UI can edit it in place and round-trip
losslessly. box_annotations exists for exactly this case (see
pipeline.md › Human-in-the-loop review).
What stays in the application layer
Section titled “What stays in the application layer”These pieces of ocr-agent are not in any of the scriva snippets
above and are not meant to be. They live in ocr_app/api/:
- FastAPI app, routers, Jinja2 templates, static assets.
- HMAC session cookies, basic-auth middleware, role checks.
OCRJobSQL model, job lifecycle states, the/jobspage.- Cropper.js UI, cell-editor UI, P&ID drag-to-select.
- Dictionary CRUD UI, training-data CRUD UI, CSV import/export endpoints.
- P&ID
PIDSession+ pattern-preset library bookkeeping.
scriva’s surface for them is exactly two calls: pipeline.cancel() for
stop, and pipeline.events(doc) → to_sse(...) for progress. Anything
beyond that is your application.
Where to go next
Section titled “Where to go next”- The four domain packs above are also documented standalone — see Domain packs.
- For the protocol-level shape of each stage (so you can write your own), the reference pages: Detectors, Recognizers, Post-processors, Exporters, Sample stores.
- For the full event payload table, Architecture › Observability.
- For caching strategy in production, Caching.