Skip to content

Exporters

An exporter serialises a DocumentResult to a format. There are two places exporters live:

  • As methods on DocumentResult.to_excel(...), .to_json(), … — for ad-hoc serialisation after a run.
  • As stages in a pipeline — for in-run writes that participate in the event stream and respect page_concurrency.

Multiple exporters per pipeline are allowed and useful — write .xlsx for humans and .json for downstream systems in the same run.

The simplest path is on the result itself; no exporter stage required:

result = pipeline("scan.png")
text = result.render() # str — reading-order text
data = result.to_json() # dict
df = result.to_dataframe() # pandas — one row per region
path = result.to_excel("out.xlsx", confidence_thresholds=(0.6, 0.8))
md = result.to_markdown() # str
hocr = result.to_hocr() # str
xml = result.to_alto() # str
pdf = result.to_pdf("out.pdf", source="scan.pdf") # searchable PDF (text over image)
result.show() # matplotlib (optional extra)

Each verb takes the same select= set for field filtering and the same format-specific options as the equivalent in-pipeline exporter.

Use these when you want the write to be part of the run — recorded in the event stream, written before pipeline(doc) returns, parallelisable across pages.

FactoryClassOutputExtraNotes
export.excel(path)ExcelExporter.xlsxexcelGrid layout, merged cells, confidence colouring
export.json_(path)JSONExporter.jsoncoreFull result dump; stable schema; the default
export.markdown(path)MarkdownExporter.mdcoreReading-order text + optional table rendering
export.hocr(path)HocrExporter.hocr (HTML)coreW3C hOCR; consumable by hOCR viewers / pdf-merge
export.alto(path)AltoExporter.xmlcoreALTO XML; library-archive friendly
export.csv_(path)CsvExporter.csvcoreGrid only; one row per layout row
export.jsonl(path)JsonlExporter.jsonlcoreOne region per line; right for ML training sets
export.pdf(path)PdfExporter.pdfpdfSearchable PDF — text overlay on source image
export.samples(store)SamplesExporterside effectcoreWrites one Sample per region to a SampleStore
export.null()NullExporternothingcoreFor test runs and dry-runs

(The trailing underscores on json_, csv_ avoid clashing with stdlib module names.)

The exporter most callers actually use. Behaves correctly on merged cells, rotated text, and confidence colouring:

from scriva.export import excel
excel(
"out.xlsx",
confidence_thresholds=(0.6, 0.8), # red / yellow / green
confidence_fill=None, # see below — per-cell callback
font="MS Gothic",
show_merge_borders=True,
embed_crops=False, # see below
legend_sheet=False, # add a 凡例/Legend sheet
)

confidence_thresholds=(low, high) is the easy default — three buckets, fixed colours. When you need a per-cell rule, pass confidence_fill=:

excel(
"out.xlsx",
confidence_fill=lambda r: "red" if (r.confidence or 1) < 0.5 else None,
)

confidence_fill is Callable[[Recognition], Color | None]. Returning None opts a cell out of fill — useful for “only colour the bad ones” audits. When both options are set, confidence_fill wins; the threshold form is kept for terse defaults.

If the layout has no grid (no region.grid set), the exporter writes a flat key-value table instead.

For workflows where the spreadsheet is meant to be reviewed alongside the source image — P&ID symbol lists, annotation review, training-data audits — set embed_crops=True:

excel(
"out.xlsx",
embed_crops=True,
crop_height_px=60,
crop_column="A", # which column hosts the thumbnail
)

The exporter writes one cropped PNG per region (cached under outputs/.crops/ so re-exports are cheap) and uses openpyxl.drawing.image.Image to anchor each one in the row. Row heights are adjusted to fit. Aspect ratio is preserved; crop_height_px is the height budget.

embed_crops adds noticeable file size — a 200-region sheet runs ~5 MB versus ~50 KB without embedded images. Default off.

Writes a searchable PDF — the source page images with a text layer overlaid. The exporter assembles per-page hOCR and feeds it through hocr-pdf (bundled with the pdf extra). The result opens in any PDF reader and is greppable.

from scriva.export import pdf as export_pdf
export_pdf(
"out.pdf",
source: Path | Document | None = None, # the image/PDF; None re-uses doc.source
dpi: int = 300,
text_visibility: Literal["invisible", "subtle", "visible"] = "invisible",
)

text_visibility="invisible" is the standard searchable-PDF behaviour (text is selectable but not rendered). "subtle" makes the text faintly visible — useful for proofing detection alignment. "visible" renders the recognised text over the image as a debugging view.

The equivalent verb on a result:

result.to_pdf("out.pdf", source="scan.pdf")

Persists one Sample per recognised region to a SampleStore. The exporter is the bridge between a normal OCR run and the training / annotation loop:

from scriva import samples
from scriva.export import samples as export_samples
pipeline = scriva.Pipeline(
morphological_grid(),
openai(model="gpt-4o"),
confidence_score.rendering(),
export_samples(
store=samples.fs(".scriva_samples"),
label_from="recognition", # "recognition" | "none" | Callable
embed=None, # ImageEmbedder | None — required if store has no EMBEDDED_INDEX
select=None, # set[str] | None — which Recognition fields to persist
),
)

label_from:

  • "recognition" (default) — the primary recognizer’s text becomes the sample’s label. Right for “I’ll correct the wrong ones later” workflows.
  • "none" — leave label=None. The sample is unlabelled training data awaiting annotation.
  • A Callable[[Recognition], str | None] — application-specific logic (e.g. only label when confidence > 0.9).

The exporter participates in the event stream — one export/progress per persisted sample — and respects page_concurrency. See domains.annotation for the canonical annotation pipeline that builds on it.

The schema is stable and versioned:

{
"scriva_version": "0.1.0",
"schema_version": 1,
"document": { "source": "scan.png", "pages": 1 },
"pages": [
{
"index": 0,
"size": [2480, 3508],
"regions": [
{
"id": "p0-r0",
"bbox": [10, 20, 200, 50],
"role": "data",
"grid": { "row": 0, "col": 0, "rowspan": 1, "colspan": 1 },
"recognition": {
"text": "Hello",
"confidence": 0.92,
"language": "en",
"source": "openai:gpt-4o",
"cache": null
}
}
]
}
]
}

Round-trips losslessly through DocumentResult.from_json().

Most exporters take a select argument that controls which Recognition fields make it into the output. Useful for sharing results without confidence debug data:

result.to_json(select={"text", "confidence"})
export.json_("out.json", select={"text", "confidence"})

Subclass:

from scriva import Exporter, DocumentResult, ExportArtifact
class MyExporter(Exporter):
async def export(self, result: DocumentResult) -> ExportArtifact:
return ExportArtifact.bytes(render(result), suffix=".bin")

…or decorate a function:

from scriva import exporter, ExportArtifact
@exporter
async def my_exporter(result):
return ExportArtifact.bytes(render(result), suffix=".bin")

The artifact can be a path (.path(...)), in-memory bytes (.bytes(...)), or a stream (.stream(...)). The pipeline does not care which.