Exporters
Exporters
Section titled “Exporters”An exporter serialises a DocumentResult to a format. There are two
places exporters live:
- As methods on
DocumentResult—.to_excel(...),.to_json(), … — for ad-hoc serialisation after a run. - As stages in a pipeline — for in-run writes that participate in the
event stream and respect
page_concurrency.
Multiple exporters per pipeline are allowed and useful — write .xlsx for
humans and .json for downstream systems in the same run.
Result serialisation verbs
Section titled “Result serialisation verbs”The simplest path is on the result itself; no exporter stage required:
result = pipeline("scan.png")
text = result.render() # str — reading-order textdata = result.to_json() # dictdf = result.to_dataframe() # pandas — one row per regionpath = result.to_excel("out.xlsx", confidence_thresholds=(0.6, 0.8))md = result.to_markdown() # strhocr = result.to_hocr() # strxml = result.to_alto() # strpdf = result.to_pdf("out.pdf", source="scan.pdf") # searchable PDF (text over image)result.show() # matplotlib (optional extra)Each verb takes the same select= set for field filtering and the same
format-specific options as the equivalent in-pipeline exporter.
In-pipeline exporter stages
Section titled “In-pipeline exporter stages”Use these when you want the write to be part of the run — recorded in the
event stream, written before pipeline(doc) returns, parallelisable across
pages.
| Factory | Class | Output | Extra | Notes |
|---|---|---|---|---|
export.excel(path) | ExcelExporter | .xlsx | excel | Grid layout, merged cells, confidence colouring |
export.json_(path) | JSONExporter | .json | core | Full result dump; stable schema; the default |
export.markdown(path) | MarkdownExporter | .md | core | Reading-order text + optional table rendering |
export.hocr(path) | HocrExporter | .hocr (HTML) | core | W3C hOCR; consumable by hOCR viewers / pdf-merge |
export.alto(path) | AltoExporter | .xml | core | ALTO XML; library-archive friendly |
export.csv_(path) | CsvExporter | .csv | core | Grid only; one row per layout row |
export.jsonl(path) | JsonlExporter | .jsonl | core | One region per line; right for ML training sets |
export.pdf(path) | PdfExporter | .pdf | pdf | Searchable PDF — text overlay on source image |
export.samples(store) | SamplesExporter | side effect | core | Writes one Sample per region to a SampleStore |
export.null() | NullExporter | nothing | core | For test runs and dry-runs |
(The trailing underscores on json_, csv_ avoid clashing with stdlib
module names.)
The exporter most callers actually use. Behaves correctly on merged cells, rotated text, and confidence colouring:
from scriva.export import excelexcel( "out.xlsx", confidence_thresholds=(0.6, 0.8), # red / yellow / green confidence_fill=None, # see below — per-cell callback font="MS Gothic", show_merge_borders=True, embed_crops=False, # see below legend_sheet=False, # add a 凡例/Legend sheet)Per-cell confidence colouring
Section titled “Per-cell confidence colouring”confidence_thresholds=(low, high) is the easy default — three buckets,
fixed colours. When you need a per-cell rule, pass confidence_fill=:
excel( "out.xlsx", confidence_fill=lambda r: "red" if (r.confidence or 1) < 0.5 else None,)confidence_fill is Callable[[Recognition], Color | None]. Returning
None opts a cell out of fill — useful for “only colour the bad ones”
audits. When both options are set, confidence_fill wins; the threshold
form is kept for terse defaults.
If the layout has no grid (no region.grid set), the exporter writes a
flat key-value table instead.
Embedded crop thumbnails
Section titled “Embedded crop thumbnails”For workflows where the spreadsheet is meant to be reviewed alongside
the source image — P&ID symbol lists, annotation review, training-data
audits — set embed_crops=True:
excel( "out.xlsx", embed_crops=True, crop_height_px=60, crop_column="A", # which column hosts the thumbnail)The exporter writes one cropped PNG per region (cached under
outputs/.crops/ so re-exports are cheap) and uses
openpyxl.drawing.image.Image to anchor each one in the row. Row heights
are adjusted to fit. Aspect ratio is preserved; crop_height_px is the
height budget.
embed_crops adds noticeable file size — a 200-region sheet runs ~5 MB
versus ~50 KB without embedded images. Default off.
export.pdf
Section titled “export.pdf”Writes a searchable PDF — the source page images with a text layer
overlaid. The exporter assembles per-page hOCR and feeds it through
hocr-pdf (bundled with the pdf extra). The result opens in any PDF
reader and is greppable.
from scriva.export import pdf as export_pdf
export_pdf( "out.pdf", source: Path | Document | None = None, # the image/PDF; None re-uses doc.source dpi: int = 300, text_visibility: Literal["invisible", "subtle", "visible"] = "invisible",)text_visibility="invisible" is the standard searchable-PDF behaviour
(text is selectable but not rendered). "subtle" makes the text faintly
visible — useful for proofing detection alignment. "visible" renders
the recognised text over the image as a debugging view.
The equivalent verb on a result:
result.to_pdf("out.pdf", source="scan.pdf")export.samples
Section titled “export.samples”Persists one Sample per recognised region to a
SampleStore. The exporter is the bridge between a normal
OCR run and the training / annotation loop:
from scriva import samplesfrom scriva.export import samples as export_samples
pipeline = scriva.Pipeline( morphological_grid(), openai(model="gpt-4o"), confidence_score.rendering(), export_samples( store=samples.fs(".scriva_samples"), label_from="recognition", # "recognition" | "none" | Callable embed=None, # ImageEmbedder | None — required if store has no EMBEDDED_INDEX select=None, # set[str] | None — which Recognition fields to persist ),)label_from:
"recognition"(default) — the primary recognizer’s text becomes the sample’slabel. Right for “I’ll correct the wrong ones later” workflows."none"— leavelabel=None. The sample is unlabelled training data awaiting annotation.- A
Callable[[Recognition], str | None]— application-specific logic (e.g. only label when confidence > 0.9).
The exporter participates in the event stream — one export/progress
per persisted sample — and respects page_concurrency. See
domains.annotation for the canonical
annotation pipeline that builds on it.
json_ schema
Section titled “json_ schema”The schema is stable and versioned:
{ "scriva_version": "0.1.0", "schema_version": 1, "document": { "source": "scan.png", "pages": 1 }, "pages": [ { "index": 0, "size": [2480, 3508], "regions": [ { "id": "p0-r0", "bbox": [10, 20, 200, 50], "role": "data", "grid": { "row": 0, "col": 0, "rowspan": 1, "colspan": 1 }, "recognition": { "text": "Hello", "confidence": 0.92, "language": "en", "source": "openai:gpt-4o", "cache": null } } ] } ]}Round-trips losslessly through DocumentResult.from_json().
Selecting fields
Section titled “Selecting fields”Most exporters take a select argument that controls which Recognition
fields make it into the output. Useful for sharing results without confidence
debug data:
result.to_json(select={"text", "confidence"})export.json_("out.json", select={"text", "confidence"})Writing your own
Section titled “Writing your own”Subclass:
from scriva import Exporter, DocumentResult, ExportArtifact
class MyExporter(Exporter): async def export(self, result: DocumentResult) -> ExportArtifact: return ExportArtifact.bytes(render(result), suffix=".bin")…or decorate a function:
from scriva import exporter, ExportArtifact
@exporterasync def my_exporter(result): return ExportArtifact.bytes(render(result), suffix=".bin")The artifact can be a path (.path(...)), in-memory bytes (.bytes(...)),
or a stream (.stream(...)). The pipeline does not care which.