Data Pipeline¶
This page describes the complete data loading pipeline from HID files on disk to PyTorch tensors fed into the training loop.
HID File Format¶
HID files are XML-based containers produced by Applied Biosystems capillary electrophoresis instruments (e.g., 3500 Genetic Analyzer). Each file contains:
- Raw data — Unprocessed fluorescence intensity per scan point per dye
- Analyzed data — Instrument-processed data (baseline-subtracted, smoothed)
- Size standard channel — The last dye channel (e.g., WEN ILS or LIZ)
- Metadata — Sample name, instrument settings, run date
DNANet supports three loading strategies:
- raw — Use only the raw data
- analyzed — Use the instrument-analyzed data
- superior — Use raw data with DNANet's own baseline estimation (recommended)
Size Standard Calibration¶
The size standard channel maps scan-point positions to base-pair (bp) values. This calibration is critical for allele identification.
PPF6C (WEN ILS)¶
- Detect peaks above 180 RFU (adaptive: 120 RFU for tail peaks)
- Remove duplicates within 15 scan points
- Take the last 19 peaks (excluding the final)
- Validate: pixel-per-bp ratio must be 7–13 for all adjacent pairs
- Fit cubic spline through (peak_idx, expected_bp) pairs
- Rescale profile to 4096 scan points over 65–475 bp
GlobalFiler (GeneScan 600 LIZ)¶
- Detect peaks above 300 RFU
- Remove duplicates within 15 scan points
- Iterative polynomial fitting (degree 2):
- Take last N peaks matching expected bp count
- Fit quadratic polynomial
- If max deviation > 5.0 bp, trim last expected bp and retry
- Up to 10 shrinkage iterations
- Fit cubic spline and rescale to 4096 scan points over 60–480 bp
Annotation Pipeline¶
For the NFI R&D dataset, annotations come from AlleleReport TXT files:
- Mapping CSV —
2p_5p_hid_to_annotation.csvmaps each HID filename to an annotation name in the AlleleReport - AlleleReport parsing — Tab-separated file with columns for each marker; rows contain allele calls per sample
- Panel lookup — Called allele names are resolved to base-pair positions and bin widths from the SGPanel XML
- Mask construction — For each allele, pixels between
bp ± bin_widthare set to 1 in the binary mask - Adjustment (optional) — Masks are refined by finding actual peaks:
top— Set only the peak apex to 1complete— Set the entire peak (boundary to boundary) to 1
Ladder-Based Panel Adjustment¶
Ladder files contain known alleles at fixed positions. By comparing detected ladder peaks to expected positions, the panel can be adjusted for run-specific instrument drift:
- Parse ladder HID file and calibrate its size standard
- For each marker, detect the allele peak and compute its actual bp position
- Update the panel's bin positions to match the ladder measurements
This produces an adjusted panel that is more accurate than the default XML panel for each specific electrophoresis run.
Persistent cache¶
Parsed and pre-processed profiles are persisted to an on-disk memmap-backed
cache so that subsequent runs only open a small index and stream single rows
on demand. See Dataset Caching for the layout, fingerprint
invalidation, deduped sidecars, RAM guard, and the cache-inspect tool.
HIDDataset Construction¶
HIDDataset orchestrates the full loading pipeline:
dataset = HIDDataset(
root="data/2p_5p_Dataset_NFI/Raw data .HID files",
scaling_strategy=scaling_strategy,
dataset_strategy=dataset_strategy,
cache_dir="data/cache/dnanet_rd",
adjustment_of_annotations="complete",
)
Steps:¶
- Inject strategies — pass a kit scaling strategy and dataset strategy
- Load mappings — Annotation mapping CSV, ladder paths CSV, ladder catalog
- Collect files — Walk root recursively, filter via
DatasetStrategy.categorize_file() - Load images — For each sample:
- Build adjusted panel from its ladder (cached per ladder file)
- Create
HIDImagewith panel, annotation file, and metadata - Trigger lazy load and validate (skip if data is None)
- Adjust annotations — Optionally snap masks to actual peaks
DNANetDataModule¶
The Lightning DataModule bridges the domain dataset to PyTorch:
datamodule = DNANetDataModule(
dataset=dataset, # TransformableDataset (e.g. HIDDataset)
batch_size=16,
num_workers=0,
shuffle_train=True,
val_fraction=0.2, # 80/20 train/val split
seed=42, # via **split_kwargs → dataset_splitter()
)
Internally:
1. Calls the dataset strategy split helper when available, otherwise dataset.split(...)
2. Applies the dataset transformer and collate function
3. Creates DataLoader instances for Lightning's fit() / validate() / predict()