Skip to content

Sample stores

A sample store persists labelled crops — the input + output of a recognition, with optional human/oracle correction. It is to labelled crops what Cache is to recognised crops: a swappable protocol whose adapters back onto filesystem / sqlite / pgvector / S3.

Four things in one library use sample stores:

SampleStore is opt-in. A pipeline without one behaves exactly as before — no on-disk side effects.

A SampleStore is the canonical “user-environment” surface for accuracy improvement. Once you keep one — typically ./.scriva_samples/ checked in alongside the project, or a shared pgvector index for a team — corrections you make today raise accuracy on every run after.

From the same store, scriva derives two complementary improvements:

┌──────────────────────────────────────────────────┐
│ SampleStore (your env) │
│ crop + recognition.text + label (correction) │
└────────────────┬──────────────┬──────────────────┘
│ │
few-shot exemplars ◄──────┘ └──────► derived dictionary
(input side, before (output side, after
recognition runs) recognition runs)
│ │
▼ ▼
RecognitionHint.from_store(store, postprocess.dictionary.from_samples(store)
near=region, k=5) inserted after recognize
  • Few-shot road (input side). For each region the recognizer is about to OCR, the store retrieves its nearest labelled neighbours and splices them into the prompt as exemplars. The model sees “here’s what cells that look like this should read as.” Best when crops are visually distinctive (handwriting styles, stamps, language-specific glyphs).
  • Dictionary road (output side). Pairs (recognition.text → label) observed in the store become a supervised dictionary. The post-processor rewrites the same OCR errors on future runs without the recognizer ever knowing. Best when errors are systematic (ENCL0SURE → ENCLOSURE, Acme Ind. → Acme Industries, Inc.).

The two roads are complementary, not competing. Use them together and the store earns its keep twice per run:

import scriva
from scriva import samples, RecognitionHint
from scriva.recognize import openai
from scriva.postprocess import dictionary, whitespace, confidence_score
from scriva.prompts import Prompt
from scriva.detect import morphological_grid
store = samples.layered(
fs=samples.fs(".scriva_samples"),
index=samples.pgvector("postgresql://localhost/scriva", dim=1536),
)
pipeline = scriva.Pipeline(
morphological_grid(),
openai(
model="gpt-4o",
prompt=Prompt.ocr_few_shot(), # input road
hints=lambda region: RecognitionHint.from_store(store, near=region, k=5),
),
whitespace(),
dictionary.from_samples(store, min_observations=2), # output road
confidence_score.rendering(),
)

Every correction a reviewer makes against the store strengthens both roads. That is the loop the annotation domain pack automates.

from scriva import samples
store = samples.fs(".scriva_samples") # filesystem
store = samples.sqlite("samples.db") # sqlite with embedding BLOB
store = samples.pgvector(dsn, dim=1536) # Postgres + pgvector
store = samples.layered( # storage + nearest-neighbour split
fs=samples.fs(".scriva_samples"),
index=samples.pgvector(dsn, dim=1536),
)

When a stage takes a store= argument, a string path is shorthand for samples.fs(path) — same convention as cache=.

class Sample(BaseModel):
id: SampleId
crop: bytes # PNG bytes of the region
label: str | None # human/oracle text; None until annotated
recognition: Recognition # what the primary recognizer said
embedding: np.ndarray | None # filled by the store on put() if it has one
source: SampleSource # document, page index, region_id, run_id, timestamp
attrs: dict[str, Any] = {} # open bag — role, language, app-specific tags

label is intentionally separate from recognition.text: the recognizer’s guess and the ground-truth correction are different things, and most workflows want both. A sample with label is None is unlabelled raw training data; setting label to a string promotes it to a labelled example.

Sample is immutable. Derive new ones with .with_label(...) / .replace(**kwargs).

class SampleStore(Engine, Protocol):
name: str
async def put(self, sample: Sample) -> SampleId: ...
async def get(self, id: SampleId) -> Sample | None: ...
async def find(
self,
*,
where: Callable[[Sample], bool] | None = None,
near: bytes | Region | None = None, # nearest-neighbour over embedding
limit: int = 50,
) -> Sequence[Sample]: ...
async def remove(self, id: SampleId) -> None: ...

The same Engine machinery used for recognizers and caches gives sample stores a capabilities set. A store that does not implement embedding search declares without Capability.NEAREST_NEIGHBOUR, and construction reports an actionable error when something asks for find(near=...) against it — before you load anything.

CapabilityMeaning
NEAREST_NEIGHBOURSupports find(near=...)
PERSISTENTSurvives process restart
EMBEDDED_INDEXComputes embeddings on put without an external call
FactoryClassBackendExtraCapabilities
samples.fs(path)FileSystemSampleStorecrops on disk + samples.jsonlcorePERSISTENT
samples.sqlite(path)SqliteSampleStoreone row per sample, BLOB cropscorePERSISTENT
samples.pgvector(dsn, dim=...)PgvectorSampleStorePostgres + pgvectorpgvectorPERSISTENT, NEAREST_NEIGHBOUR
samples.layered(fs=, index=)LayeredSampleStoreone store for bytes, another for embeddingscoreunion of children
samples.memory()InMemorySampleStoredict + in-memory NumPy indexcoreNEAREST_NEIGHBOUR (no persistence)

samples.layered(...) is the common production shape: filesystem for the crops, pgvector for nearest-neighbour. Lookups dispatch to whichever child has the relevant capability.

If your store has EMBEDDED_INDEX, it embeds on put. Otherwise pass an embedder when constructing the store, or write sample.embedding yourself before put:

from scriva.embedders import OpenAIEmbedder
store = samples.pgvector(
dsn,
dim=1536,
embedder=OpenAIEmbedder(model="text-embedding-3-small"),
)

The embedder protocol is the same one used by Cache.vector and classify.embedding — any ImageEmbedder works.

  • Cached recognitions. That is what Cache is for. A cache keys on the crop hash; a sample store keys on identity and embedding. They look similar but solve different problems.
  • Pipeline events. Subscribe to the event stream and write your own log; events are not regions.
  • Raw pages without a region. Samples are crops. Persist whole pages through Document plugins or your own application layer.

Implement the protocol. Two common patterns:

  • Redis + S3. S3 for crops, Redis for the JSONL index. Five lines on top of redis.asyncio + aioboto3.
  • MinIO + DuckDB. Same shape, fully on-prem. DuckDB’s vector extension handles find(near=...) once you set Capability.NEAREST_NEIGHBOUR.

The protocol is intentionally narrow so the surface stays writeable.