Quickstart
Quickstart
Section titled “Quickstart”A five-minute tour. We’ll go from the one-liner to a full custom pipeline that OCRs a scanned tabular form and writes Excel.
Install
Section titled “Install”pip install "scriva[openai,excel]"Extras pull optional engine adapters: openai, anthropic, bedrock,
tesseract, excel, pdf, vector-cache. The core library has no
mandatory dependency on any OCR engine.
Set credentials
Section titled “Set credentials”export OPENAI_API_KEY=sk-...The one-liner
Section titled “The one-liner”import scrivatext = scriva.read("scan.png")scriva.read picks a sensible detector + recognizer based on the extras you
installed and returns the recognised text. A few common knobs:
scriva.read("scan.png") # -> strscriva.read("scan.pdf", as_="json") # -> dictscriva.read("scan.png", as_="dataframe") # -> pandas.DataFramescriva.read("scan.png", to="out.xlsx") # writes file, returns Pathscriva.read("scan.png", language="ja", recognizer="anthropic")That is the entire API for the 80% case. The rest of this page is for when you want to inspect, swap, or extend.
Loading PDFs
Section titled “Loading PDFs”For multi-page inputs the loader exposes a few knobs:
from scriva import Document
doc = Document.load( "scan.pdf", pages: Sequence[int] | slice | None = None, # None = all pages dpi: int = 300, # raster resolution for non-text PDFs password: str | None = None, # encrypted PDFs renderer: Literal["pdfium", "poppler"] = "pdfium",)PDFs without an embedded text layer are rasterised at dpi and then
treated as page images by the rest of the pipeline. PDFs with a text
layer are still rasterised — scriva does not currently short-circuit on
embedded text; treat that as an OCR cache, not a substitute. The pdf
extra installs pypdfium2; renderer="poppler" requires the poppler
binary on $PATH.
pages accepts a list ([0, 2, 5]) or a slice (slice(0, 10)). One-page
PDFs do not need this — pipeline("file.pdf") always works.
Build a pipeline
Section titled “Build a pipeline”import scrivafrom scriva.preprocess import deskewfrom scriva.detect import morphological_gridfrom scriva.classify import rule_basedfrom scriva.recognize import openaifrom scriva.postprocess import dictionary, rule_splitterfrom scriva.export import excel
pipeline = scriva.Pipeline( deskew(), morphological_grid(), rule_based(blank_density=0.002), openai(model="gpt-4o", max_concurrency=8, cache=".scriva_cache"), dictionary.from_yaml("corrections.yaml"), rule_splitter(), excel("out.xlsx"), page_concurrency=4,)
result = pipeline("scan.png")print(f"Recognised {len(result.regions_with_text())} cells, " f"mean confidence {result.mean_confidence:.2f}")A few things to notice:
- Positional, ordered.
Pipeline(*stages)slots each stage into its phase by Protocol (Preprocessor,LayoutDetector,Recognizer, …). Order matters only within a phase. pipeline(doc)is sync. Internally async; the call blocks until done. Useawait pipeline.aio(doc)from an async caller, orasync for page in pipeline.stream(doc): …for streaming.- Cache is a string.
cache=".scriva_cache"becomes aFileSystemCache. See caching.md forCache.layered(...), Redis, and friends. scriva.preprocess,scriva.detect, … are factory namespaces. Lowercase, callable, return configured stage instances. The class form (scriva.detectors.MorphologicalGridDetector) is still there for subclassing.
Get text out
Section titled “Get text out”The DocumentResult knows how to serialise itself, so you don’t need to
wire an exporter just to print things:
result.render() # str — reading-order textresult.to_json() # dictresult.to_dataframe() # pandasresult.to_excel("out.xlsx") # writes file, returns Pathresult.show() # matplotlib viz of boxes (optional extra)Use the in-pipeline excel(...) exporter when you want the write to happen
during the run (it emits an event and respects page_concurrency); use
result.to_excel(...) for ad-hoc serialisation.
Inspect what happened
Section titled “Inspect what happened”Every stage emits structured events. Pass a callback to see them live:
pipeline("scan.png", on_event=print)Or stream them yourself:
async for event in pipeline.events("scan.png"): print(event.stage, event.kind, event.payload)Sample output:
preprocess startedpreprocess finished ms=42detect_grid finished rows=12 cols=6 cells=72classify finished blank=14 merged=3recognize progress done=10 total=58 cache_hits=3recognize finished ms=4910post_process finished stage=dictionary corrected=8export finished path=out.xlsxSee Pipeline › Observing for SSE and async-iterator patterns.
Swap any piece
Section titled “Swap any piece”from scriva.recognize import anthropicpipeline.replace("recognize", anthropic(model="claude-opus-4-7"))Want a different layout detector? Implement the protocol — five lines:
from scriva import LayoutDetector, Layout, Page
class MyDetector(LayoutDetector): async def detect(self, page: Page) -> Layout: ...…or skip the class entirely with the decorator:
from scriva import detector, Layout, Region
@detectorasync def my_detector(page): boxes = my_model.predict(page.image) return Layout.from_regions([Region(bbox=b, role="data") for b in boxes], page=page)