Exporters

An exporter serialises a DocumentResult to a format. There are two places exporters live:

As methods on DocumentResult — .to_excel(...), .to_json(), … — for ad-hoc serialisation after a run.
As stages in a pipeline — for in-run writes that participate in the event stream and respect page_concurrency.

Multiple exporters per pipeline are allowed and useful — write .xlsx for humans and .json for downstream systems in the same run.

Result serialisation verbs

The simplest path is on the result itself; no exporter stage required:

result = pipeline("scan.png")

text    = result.render()                        # str — reading-order text
data    = result.to_json()                       # dict
df      = result.to_dataframe()                  # pandas — one row per region
path    = result.to_excel("out.xlsx", confidence_thresholds=(0.6, 0.8))
md      = result.to_markdown()                   # str
hocr    = result.to_hocr()                       # str
xml     = result.to_alto()                       # str
pdf     = result.to_pdf("out.pdf", source="scan.pdf")  # searchable PDF (text over image)
result.show()                                    # matplotlib (optional extra)

Each verb takes the same select= set for field filtering and the same format-specific options as the equivalent in-pipeline exporter.

In-pipeline exporter stages

Use these when you want the write to be part of the run — recorded in the event stream, written before pipeline(doc) returns, parallelisable across pages.

Factory	Class	Output	Extra	Notes
`export.excel(path)`	`ExcelExporter`	`.xlsx`	`excel`	Grid layout, merged cells, confidence colouring
`export.json_(path)`	`JSONExporter`	`.json`	core	Full result dump; stable schema; the default
`export.markdown(path)`	`MarkdownExporter`	`.md`	core	Reading-order text + optional table rendering
`export.hocr(path)`	`HocrExporter`	`.hocr` (HTML)	core	W3C hOCR; consumable by hOCR viewers / pdf-merge
`export.alto(path)`	`AltoExporter`	`.xml`	core	ALTO XML; library-archive friendly
`export.csv_(path)`	`CsvExporter`	`.csv`	core	Grid only; one row per layout row
`export.jsonl(path)`	`JsonlExporter`	`.jsonl`	core	One region per line; right for ML training sets
`export.pdf(path)`	`PdfExporter`	`.pdf`	`pdf`	Searchable PDF — text overlay on source image
`export.samples(store)`	`SamplesExporter`	side effect	core	Writes one `Sample` per region to a `SampleStore`
`export.null()`	`NullExporter`	nothing	core	For test runs and dry-runs

(The trailing underscores on json_, csv_ avoid clashing with stdlib module names.)

`excel`

The exporter most callers actually use. Behaves correctly on merged cells, rotated text, and confidence colouring:

from scriva.export import excel
excel(
    "out.xlsx",
    confidence_thresholds=(0.6, 0.8),  # red / yellow / green
    confidence_fill=None,              # see below — per-cell callback
    font="MS Gothic",
    show_merge_borders=True,
    embed_crops=False,                 # see below
    legend_sheet=False,                # add a 凡例/Legend sheet
)

Per-cell confidence colouring

confidence_thresholds=(low, high) is the easy default — three buckets, fixed colours. When you need a per-cell rule, pass confidence_fill=:

excel(
    "out.xlsx",
    confidence_fill=lambda r: "red" if (r.confidence or 1) < 0.5 else None,
)

confidence_fill is Callable[[Recognition], Color | None]. Returning None opts a cell out of fill — useful for “only colour the bad ones” audits. When both options are set, confidence_fill wins; the threshold form is kept for terse defaults.

If the layout has no grid (no region.grid set), the exporter writes a flat key-value table instead.

Embedded crop thumbnails

For workflows where the spreadsheet is meant to be reviewed alongside the source image — P&ID symbol lists, annotation review, training-data audits — set embed_crops=True:

excel(
    "out.xlsx",
    embed_crops=True,
    crop_height_px=60,
    crop_column="A",      # which column hosts the thumbnail
)

The exporter writes one cropped PNG per region (cached under outputs/.crops/ so re-exports are cheap) and uses openpyxl.drawing.image.Image to anchor each one in the row. Row heights are adjusted to fit. Aspect ratio is preserved; crop_height_px is the height budget.

embed_crops adds noticeable file size — a 200-region sheet runs ~5 MB versus ~50 KB without embedded images. Default off.

`export.pdf`

Writes a searchable PDF — the source page images with a text layer overlaid. The exporter assembles per-page hOCR and feeds it through hocr-pdf (bundled with the pdf extra). The result opens in any PDF reader and is greppable.

from scriva.export import pdf as export_pdf

export_pdf(
    "out.pdf",
    source: Path | Document | None = None,   # the image/PDF; None re-uses doc.source
    dpi: int = 300,
    text_visibility: Literal["invisible", "subtle", "visible"] = "invisible",
)

text_visibility="invisible" is the standard searchable-PDF behaviour (text is selectable but not rendered). "subtle" makes the text faintly visible — useful for proofing detection alignment. "visible" renders the recognised text over the image as a debugging view.

The equivalent verb on a result:

result.to_pdf("out.pdf", source="scan.pdf")

`export.samples`

Persists one Sample per recognised region to a SampleStore. The exporter is the bridge between a normal OCR run and the training / annotation loop:

from scriva import samples
from scriva.export import samples as export_samples

pipeline = scriva.Pipeline(
    morphological_grid(),
    openai(model="gpt-4o"),
    confidence_score.rendering(),
    export_samples(
        store=samples.fs(".scriva_samples"),
        label_from="recognition",     # "recognition" | "none" | Callable
        embed=None,                   # ImageEmbedder | None — required if store has no EMBEDDED_INDEX
        select=None,                  # set[str] | None — which Recognition fields to persist
    ),
)

label_from:

"recognition" (default) — the primary recognizer’s text becomes the sample’s label. Right for “I’ll correct the wrong ones later” workflows.
"none" — leave label=None. The sample is unlabelled training data awaiting annotation.
A Callable[[Recognition], str | None] — application-specific logic (e.g. only label when confidence > 0.9).

The exporter participates in the event stream — one export/progress per persisted sample — and respects page_concurrency. See domains.annotation for the canonical annotation pipeline that builds on it.

`json_` schema

The schema is stable and versioned:

{
  "scriva_version": "0.1.0",
  "schema_version": 1,
  "document": { "source": "scan.png", "pages": 1 },
  "pages": [
    {
      "index": 0,
      "size": [2480, 3508],
      "regions": [
        {
          "id": "p0-r0",
          "bbox": [10, 20, 200, 50],
          "role": "data",
          "grid": { "row": 0, "col": 0, "rowspan": 1, "colspan": 1 },
          "recognition": {
            "text": "Hello",
            "confidence": 0.92,
            "language": "en",
            "source": "openai:gpt-4o",
            "cache": null
          }
        }
      ]
    }
  ]
}

Round-trips losslessly through DocumentResult.from_json().

Selecting fields

Most exporters take a select argument that controls which Recognition fields make it into the output. Useful for sharing results without confidence debug data:

result.to_json(select={"text", "confidence"})
export.json_("out.json", select={"text", "confidence"})

Writing your own

Subclass:

from scriva import Exporter, DocumentResult, ExportArtifact

class MyExporter(Exporter):
    async def export(self, result: DocumentResult) -> ExportArtifact:
        return ExportArtifact.bytes(render(result), suffix=".bin")

…or decorate a function:

from scriva import exporter, ExportArtifact

@exporter
async def my_exporter(result):
    return ExportArtifact.bytes(render(result), suffix=".bin")

The artifact can be a path (.path(...)), in-memory bytes (.bytes(...)), or a stream (.stream(...)). The pipeline does not care which.

Exporters

Exporters

Result serialisation verbs

In-pipeline exporter stages

excel

Per-cell confidence colouring

Embedded crop thumbnails

export.pdf

export.samples

json_ schema

Selecting fields

Writing your own

`excel`

`export.pdf`

`export.samples`

`json_` schema