Sample stores
Sample stores
Section titled “Sample stores”A sample store persists labelled crops — the input + output of a
recognition, with optional human/oracle correction. It is to labelled
crops what Cache is to recognised crops: a swappable protocol whose
adapters back onto filesystem / sqlite / pgvector / S3.
Four things in one library use sample stores:
export.sampleswrites one sample per recognised region as a pipeline stage.classify.embedding.trainreads samples to fit aRegionClassifier.RecognitionHint.from_storeretrieves nearest-neighbour samples to populate few-shot examples in the recognizer prompt.postprocess.dictionary.from_samplesderives a correction dictionary from labelled samples whoselabeldiffers fromrecognition.text.
SampleStore is opt-in. A pipeline without one behaves exactly as
before — no on-disk side effects.
Two roads from a SampleStore
Section titled “Two roads from a SampleStore”A SampleStore is the canonical “user-environment” surface for
accuracy improvement. Once you keep one — typically ./.scriva_samples/
checked in alongside the project, or a shared pgvector index for a
team — corrections you make today raise accuracy on every run after.
From the same store, scriva derives two complementary improvements:
┌──────────────────────────────────────────────────┐ │ SampleStore (your env) │ │ crop + recognition.text + label (correction) │ └────────────────┬──────────────┬──────────────────┘ │ │ few-shot exemplars ◄──────┘ └──────► derived dictionary (input side, before (output side, after recognition runs) recognition runs) │ │ ▼ ▼ RecognitionHint.from_store(store, postprocess.dictionary.from_samples(store) near=region, k=5) inserted after recognize- Few-shot road (input side). For each region the recognizer is about to OCR, the store retrieves its nearest labelled neighbours and splices them into the prompt as exemplars. The model sees “here’s what cells that look like this should read as.” Best when crops are visually distinctive (handwriting styles, stamps, language-specific glyphs).
- Dictionary road (output side). Pairs
(recognition.text → label)observed in the store become a supervised dictionary. The post-processor rewrites the same OCR errors on future runs without the recognizer ever knowing. Best when errors are systematic (ENCL0SURE → ENCLOSURE,Acme Ind. → Acme Industries, Inc.).
The two roads are complementary, not competing. Use them together and the store earns its keep twice per run:
import scrivafrom scriva import samples, RecognitionHintfrom scriva.recognize import openaifrom scriva.postprocess import dictionary, whitespace, confidence_scorefrom scriva.prompts import Promptfrom scriva.detect import morphological_grid
store = samples.layered( fs=samples.fs(".scriva_samples"), index=samples.pgvector("postgresql://localhost/scriva", dim=1536),)
pipeline = scriva.Pipeline( morphological_grid(), openai( model="gpt-4o", prompt=Prompt.ocr_few_shot(), # input road hints=lambda region: RecognitionHint.from_store(store, near=region, k=5), ), whitespace(), dictionary.from_samples(store, min_observations=2), # output road confidence_score.rendering(),)Every correction a reviewer makes against the store strengthens both roads. That is the loop the annotation domain pack automates.
The shortest path
Section titled “The shortest path”from scriva import samples
store = samples.fs(".scriva_samples") # filesystemstore = samples.sqlite("samples.db") # sqlite with embedding BLOBstore = samples.pgvector(dsn, dim=1536) # Postgres + pgvectorstore = samples.layered( # storage + nearest-neighbour split fs=samples.fs(".scriva_samples"), index=samples.pgvector(dsn, dim=1536),)When a stage takes a store= argument, a string path is shorthand for
samples.fs(path) — same convention as cache=.
What a sample carries
Section titled “What a sample carries”class Sample(BaseModel): id: SampleId crop: bytes # PNG bytes of the region label: str | None # human/oracle text; None until annotated recognition: Recognition # what the primary recognizer said embedding: np.ndarray | None # filled by the store on put() if it has one source: SampleSource # document, page index, region_id, run_id, timestamp attrs: dict[str, Any] = {} # open bag — role, language, app-specific tagslabel is intentionally separate from recognition.text: the recognizer’s
guess and the ground-truth correction are different things, and most
workflows want both. A sample with label is None is unlabelled raw
training data; setting label to a string promotes it to a labelled
example.
Sample is immutable. Derive new ones with .with_label(...) /
.replace(**kwargs).
Protocol
Section titled “Protocol”class SampleStore(Engine, Protocol): name: str
async def put(self, sample: Sample) -> SampleId: ... async def get(self, id: SampleId) -> Sample | None: ... async def find( self, *, where: Callable[[Sample], bool] | None = None, near: bytes | Region | None = None, # nearest-neighbour over embedding limit: int = 50, ) -> Sequence[Sample]: ... async def remove(self, id: SampleId) -> None: ...The same Engine machinery used for recognizers and caches gives
sample stores a capabilities set. A store that does not implement
embedding search declares without Capability.NEAREST_NEIGHBOUR, and
construction reports an actionable error when something asks for
find(near=...) against it — before you load anything.
| Capability | Meaning |
|---|---|
NEAREST_NEIGHBOUR | Supports find(near=...) |
PERSISTENT | Survives process restart |
EMBEDDED_INDEX | Computes embeddings on put without an external call |
Built-in adapters
Section titled “Built-in adapters”| Factory | Class | Backend | Extra | Capabilities |
|---|---|---|---|---|
samples.fs(path) | FileSystemSampleStore | crops on disk + samples.jsonl | core | PERSISTENT |
samples.sqlite(path) | SqliteSampleStore | one row per sample, BLOB crops | core | PERSISTENT |
samples.pgvector(dsn, dim=...) | PgvectorSampleStore | Postgres + pgvector | pgvector | PERSISTENT, NEAREST_NEIGHBOUR |
samples.layered(fs=, index=) | LayeredSampleStore | one store for bytes, another for embeddings | core | union of children |
samples.memory() | InMemorySampleStore | dict + in-memory NumPy index | core | NEAREST_NEIGHBOUR (no persistence) |
samples.layered(...) is the common production shape: filesystem for the
crops, pgvector for nearest-neighbour. Lookups dispatch to whichever
child has the relevant capability.
Computing embeddings
Section titled “Computing embeddings”If your store has EMBEDDED_INDEX, it embeds on put. Otherwise pass an
embedder when constructing the store, or write sample.embedding
yourself before put:
from scriva.embedders import OpenAIEmbedder
store = samples.pgvector( dsn, dim=1536, embedder=OpenAIEmbedder(model="text-embedding-3-small"),)The embedder protocol is the same one used by Cache.vector and
classify.embedding — any ImageEmbedder works.
What not to store
Section titled “What not to store”- Cached recognitions. That is what
Cacheis for. A cache keys on the crop hash; a sample store keys on identity and embedding. They look similar but solve different problems. - Pipeline events. Subscribe to the event stream and write your own log; events are not regions.
- Raw pages without a region. Samples are crops. Persist whole pages
through
Documentplugins or your own application layer.
Writing your own backend
Section titled “Writing your own backend”Implement the protocol. Two common patterns:
- Redis + S3. S3 for crops, Redis for the JSONL index. Five lines on
top of
redis.asyncio+aioboto3. - MinIO + DuckDB. Same shape, fully on-prem. DuckDB’s vector
extension handles
find(near=...)once you setCapability.NEAREST_NEIGHBOUR.
The protocol is intentionally narrow so the surface stays writeable.