Skip to content

Memmap Cache

The dnanet.data.cache package provides a lazy, memmap-backed disk cache for HIDDataset. It enables large datasets to be loaded without keeping all pixel data in RAM.

Layout

Each cache key maps to a directory containing:

File Contents
data.bin (N, D, L) int16 — fluorescence signal
annotation.bin (N, D, L) int8 — scanpoint mask
scaler.bin (N, L) float32 — base-pair mapping
index.parquet N rows of path + sidecar keys
panels.json Deduped adjusted panels
alleles.json Deduped allele annotations
shapes.json Row counts, array shapes, dtypes
fingerprint.json SHA256 of sources + config
manifest.jsonl Append-only writer log
_COMPLETE Sentinel for successful write

Reader

from dnanet.data.cache.reader import IndexEntry, MemmapCacheReader

MemmapCacheReader opens memmaps lazily per-process (fork-safe for DataLoader workers). Use load_index() to read the index without touching pixel data, then get_row(idx) to retrieve (data, annotation, scaler) slices.

Writer

from dnanet.data.cache.writer import MemmapCacheWriter

MemmapCacheWriter appends rows to binary files and writes a JSONL manifest. On finalize(), the manifest is converted to index.parquet with deduped JSON sidecars.

Usage:

with MemmapCacheWriter(cache_dir) as writer:
    for image in images:
        writer.write(image)
    writer.finalize(config_payload, source_paths)

Fingerprinting

Two hashes are used:

  • key: 16-char hex derived from config only. Drives the cache directory name.
  • fingerprint: full SHA256 from config + source file contents. Stored inside the cache dir and validated on load.

Content-based validation (SHA256 of file contents) ensures cached datasets survive copies to NFS shares, mount point changes, and file renames.