Skip to content

Quickstart

A five-minute tour. We’ll go from the one-liner to a full custom pipeline that OCRs a scanned tabular form and writes Excel.

Terminal window
pip install "scriva[openai,excel]"

Extras pull optional engine adapters: openai, anthropic, bedrock, tesseract, excel, pdf, vector-cache. The core library has no mandatory dependency on any OCR engine.

Terminal window
export OPENAI_API_KEY=sk-...
import scriva
text = scriva.read("scan.png")

scriva.read picks a sensible detector + recognizer based on the extras you installed and returns the recognised text. A few common knobs:

scriva.read("scan.png") # -> str
scriva.read("scan.pdf", as_="json") # -> dict
scriva.read("scan.png", as_="dataframe") # -> pandas.DataFrame
scriva.read("scan.png", to="out.xlsx") # writes file, returns Path
scriva.read("scan.png", language="ja", recognizer="anthropic")

That is the entire API for the 80% case. The rest of this page is for when you want to inspect, swap, or extend.

For multi-page inputs the loader exposes a few knobs:

from scriva import Document
doc = Document.load(
"scan.pdf",
pages: Sequence[int] | slice | None = None, # None = all pages
dpi: int = 300, # raster resolution for non-text PDFs
password: str | None = None, # encrypted PDFs
renderer: Literal["pdfium", "poppler"] = "pdfium",
)

PDFs without an embedded text layer are rasterised at dpi and then treated as page images by the rest of the pipeline. PDFs with a text layer are still rasterised — scriva does not currently short-circuit on embedded text; treat that as an OCR cache, not a substitute. The pdf extra installs pypdfium2; renderer="poppler" requires the poppler binary on $PATH.

pages accepts a list ([0, 2, 5]) or a slice (slice(0, 10)). One-page PDFs do not need this — pipeline("file.pdf") always works.

import scriva
from scriva.preprocess import deskew
from scriva.detect import morphological_grid
from scriva.classify import rule_based
from scriva.recognize import openai
from scriva.postprocess import dictionary, rule_splitter
from scriva.export import excel
pipeline = scriva.Pipeline(
deskew(),
morphological_grid(),
rule_based(blank_density=0.002),
openai(model="gpt-4o", max_concurrency=8, cache=".scriva_cache"),
dictionary.from_yaml("corrections.yaml"),
rule_splitter(),
excel("out.xlsx"),
page_concurrency=4,
)
result = pipeline("scan.png")
print(f"Recognised {len(result.regions_with_text())} cells, "
f"mean confidence {result.mean_confidence:.2f}")

A few things to notice:

  • Positional, ordered. Pipeline(*stages) slots each stage into its phase by Protocol (Preprocessor, LayoutDetector, Recognizer, …). Order matters only within a phase.
  • pipeline(doc) is sync. Internally async; the call blocks until done. Use await pipeline.aio(doc) from an async caller, or async for page in pipeline.stream(doc): … for streaming.
  • Cache is a string. cache=".scriva_cache" becomes a FileSystemCache. See caching.md for Cache.layered(...), Redis, and friends.
  • scriva.preprocess, scriva.detect, … are factory namespaces. Lowercase, callable, return configured stage instances. The class form (scriva.detectors.MorphologicalGridDetector) is still there for subclassing.

The DocumentResult knows how to serialise itself, so you don’t need to wire an exporter just to print things:

result.render() # str — reading-order text
result.to_json() # dict
result.to_dataframe() # pandas
result.to_excel("out.xlsx") # writes file, returns Path
result.show() # matplotlib viz of boxes (optional extra)

Use the in-pipeline excel(...) exporter when you want the write to happen during the run (it emits an event and respects page_concurrency); use result.to_excel(...) for ad-hoc serialisation.

Every stage emits structured events. Pass a callback to see them live:

pipeline("scan.png", on_event=print)

Or stream them yourself:

async for event in pipeline.events("scan.png"):
print(event.stage, event.kind, event.payload)

Sample output:

preprocess started
preprocess finished ms=42
detect_grid finished rows=12 cols=6 cells=72
classify finished blank=14 merged=3
recognize progress done=10 total=58 cache_hits=3
recognize finished ms=4910
post_process finished stage=dictionary corrected=8
export finished path=out.xlsx

See Pipeline › Observing for SSE and async-iterator patterns.

from scriva.recognize import anthropic
pipeline.replace("recognize", anthropic(model="claude-opus-4-7"))

Want a different layout detector? Implement the protocol — five lines:

from scriva import LayoutDetector, Layout, Page
class MyDetector(LayoutDetector):
async def detect(self, page: Page) -> Layout:
...

…or skip the class entirely with the decorator:

from scriva import detector, Layout, Region
@detector
async def my_detector(page):
boxes = my_model.predict(page.image)
return Layout.from_regions([Region(bbox=b, role="data") for b in boxes],
page=page)
  • Concepts — the abstractions you just used.
  • Pipeline — the full builder API.
  • Domains — pre-built pipelines you can grab off the shelf.