Sample stores

A sample store persists labelled crops — the input + output of a recognition, with optional human/oracle correction. It is to labelled crops what Cache is to recognised crops: a swappable protocol whose adapters back onto filesystem / sqlite / pgvector / S3.

Four things in one library use sample stores:

export.samples writes one sample per recognised region as a pipeline stage.
classify.embedding.train reads samples to fit a RegionClassifier.
RecognitionHint.from_store retrieves nearest-neighbour samples to populate few-shot examples in the recognizer prompt.
postprocess.dictionary.from_samples derives a correction dictionary from labelled samples whose label differs from recognition.text.

SampleStore is opt-in. A pipeline without one behaves exactly as before — no on-disk side effects.

Two roads from a SampleStore

A SampleStore is the canonical “user-environment” surface for accuracy improvement. Once you keep one — typically ./.scriva_samples/ checked in alongside the project, or a shared pgvector index for a team — corrections you make today raise accuracy on every run after.

From the same store, scriva derives two complementary improvements:

                  ┌──────────────────────────────────────────────────┐
                  │             SampleStore (your env)               │
                  │   crop + recognition.text + label (correction)   │
                  └────────────────┬──────────────┬──────────────────┘
                                   │              │
        few-shot exemplars  ◄──────┘              └──────►  derived dictionary
        (input side, before                                  (output side, after
        recognition runs)                                    recognition runs)
                  │                                                  │
                  ▼                                                  ▼
   RecognitionHint.from_store(store,            postprocess.dictionary.from_samples(store)
       near=region, k=5)                            inserted after recognize

Few-shot road (input side). For each region the recognizer is about to OCR, the store retrieves its nearest labelled neighbours and splices them into the prompt as exemplars. The model sees “here’s what cells that look like this should read as.” Best when crops are visually distinctive (handwriting styles, stamps, language-specific glyphs).
Dictionary road (output side). Pairs (recognition.text → label) observed in the store become a supervised dictionary. The post-processor rewrites the same OCR errors on future runs without the recognizer ever knowing. Best when errors are systematic (ENCL0SURE → ENCLOSURE, Acme Ind. → Acme Industries, Inc.).

The two roads are complementary, not competing. Use them together and the store earns its keep twice per run:

import scriva
from scriva import samples, RecognitionHint
from scriva.recognize import openai
from scriva.postprocess import dictionary, whitespace, confidence_score
from scriva.prompts import Prompt
from scriva.detect import morphological_grid

store = samples.layered(
    fs=samples.fs(".scriva_samples"),
    index=samples.pgvector("postgresql://localhost/scriva", dim=1536),
)

pipeline = scriva.Pipeline(
    morphological_grid(),
    openai(
        model="gpt-4o",
        prompt=Prompt.ocr_few_shot(),                              # input road
        hints=lambda region: RecognitionHint.from_store(store, near=region, k=5),
    ),
    whitespace(),
    dictionary.from_samples(store, min_observations=2),            # output road
    confidence_score.rendering(),
)

Every correction a reviewer makes against the store strengthens both roads. That is the loop the annotation domain pack automates.

The shortest path

from scriva import samples

store = samples.fs(".scriva_samples")        # filesystem
store = samples.sqlite("samples.db")         # sqlite with embedding BLOB
store = samples.pgvector(dsn, dim=1536)      # Postgres + pgvector
store = samples.layered(                     # storage + nearest-neighbour split
    fs=samples.fs(".scriva_samples"),
    index=samples.pgvector(dsn, dim=1536),
)

When a stage takes a store= argument, a string path is shorthand for samples.fs(path) — same convention as cache=.

What a sample carries

class Sample(BaseModel):
    id: SampleId
    crop: bytes                       # PNG bytes of the region
    label: str | None                 # human/oracle text; None until annotated
    recognition: Recognition          # what the primary recognizer said
    embedding: np.ndarray | None      # filled by the store on put() if it has one
    source: SampleSource              # document, page index, region_id, run_id, timestamp
    attrs: dict[str, Any] = {}        # open bag — role, language, app-specific tags

label is intentionally separate from recognition.text: the recognizer’s guess and the ground-truth correction are different things, and most workflows want both. A sample with label is None is unlabelled raw training data; setting label to a string promotes it to a labelled example.

Sample is immutable. Derive new ones with .with_label(...) / .replace(**kwargs).

Protocol

class SampleStore(Engine, Protocol):
    name: str

    async def put(self, sample: Sample) -> SampleId: ...
    async def get(self, id: SampleId) -> Sample | None: ...
    async def find(
        self,
        *,
        where: Callable[[Sample], bool] | None = None,
        near: bytes | Region | None = None,   # nearest-neighbour over embedding
        limit: int = 50,
    ) -> Sequence[Sample]: ...
    async def remove(self, id: SampleId) -> None: ...

The same Engine machinery used for recognizers and caches gives sample stores a capabilities set. A store that does not implement embedding search declares without Capability.NEAREST_NEIGHBOUR, and construction reports an actionable error when something asks for find(near=...) against it — before you load anything.

Capability	Meaning
`NEAREST_NEIGHBOUR`	Supports `find(near=...)`
`PERSISTENT`	Survives process restart
`EMBEDDED_INDEX`	Computes embeddings on `put` without an external call

Built-in adapters

Factory	Class	Backend	Extra	Capabilities
`samples.fs(path)`	`FileSystemSampleStore`	crops on disk + `samples.jsonl`	core	`PERSISTENT`
`samples.sqlite(path)`	`SqliteSampleStore`	one row per sample, BLOB crops	core	`PERSISTENT`
`samples.pgvector(dsn, dim=...)`	`PgvectorSampleStore`	Postgres + pgvector	`pgvector`	`PERSISTENT`, `NEAREST_NEIGHBOUR`
`samples.layered(fs=, index=)`	`LayeredSampleStore`	one store for bytes, another for embeddings	core	union of children
`samples.memory()`	`InMemorySampleStore`	dict + in-memory NumPy index	core	`NEAREST_NEIGHBOUR` (no persistence)

samples.layered(...) is the common production shape: filesystem for the crops, pgvector for nearest-neighbour. Lookups dispatch to whichever child has the relevant capability.

Computing embeddings

If your store has EMBEDDED_INDEX, it embeds on put. Otherwise pass an embedder when constructing the store, or write sample.embedding yourself before put:

from scriva.embedders import OpenAIEmbedder

store = samples.pgvector(
    dsn,
    dim=1536,
    embedder=OpenAIEmbedder(model="text-embedding-3-small"),
)

The embedder protocol is the same one used by Cache.vector and classify.embedding — any ImageEmbedder works.

What not to store

Cached recognitions. That is what Cache is for. A cache keys on the crop hash; a sample store keys on identity and embedding. They look similar but solve different problems.
Pipeline events. Subscribe to the event stream and write your own log; events are not regions.
Raw pages without a region. Samples are crops. Persist whole pages through Document plugins or your own application layer.

Writing your own backend

Implement the protocol. Two common patterns:

Redis + S3. S3 for crops, Redis for the JSONL index. Five lines on top of redis.asyncio + aioboto3.
MinIO + DuckDB. Same shape, fully on-prem. DuckDB’s vector extension handles find(near=...) once you set Capability.NEAREST_NEIGHBOUR.

The protocol is intentionally narrow so the surface stays writeable.