Skip to content

Data Layer

The dnanet.data package handles everything from raw HID files to PyTorch-ready tensors.

HIDImage

The central data container. Wraps a path to a HID file with lazy loading. HIDImage receives the kit scaling strategy directly.

from dnanet.data.image import HIDImage

Properties: - datanp.ndarray | None — Shape (num_dyes, signal_length, 1). Triggers load on first access. - annotationScanpointAnnotation | None — Ground-truth segmentation mask - scalernp.ndarray — Shape (1, signal_length). Maps pixel → base pair. - adjusted_panelPanel | None — Reference panel - dimensionstuple[int, int] — Data array shape (num_dyes, signal_length) - metadict — Metadata (NOC, ladder path, etc.)

Methods: - adjust_annotations(method) — Snap annotation mask to actual peaks - "top" — Label only peak apex - "complete" — Label entire peak boundary-to-boundary

Datasets

TransformableDataset

Abstract base class for datasets of HIDImage objects. Provides properties for images, transform, and dataset_strategy.

from dnanet.data.dataset import TransformableDataset

HIDDataset

Loads HID files from a directory. Implements TransformableDataset and torch.utils.data.Dataset.

from dnanet.data.hid_dataset import HIDDataset

Constructor args: See {doc}/guides/datasets for details.

DataModule

DNANetDataModule

Lightning DataModule bridging TransformableDataset → PyTorch DataLoaders.

from dnanet.data.datamodule import DNANetDataModule

Args: - dataset — A TransformableDataset (e.g. HIDDataset) - batch_size — Batch size - num_workers — DataLoader workers - shuffle_train — Whether to shuffle the training DataLoader - **split_kwargs — Passed to dataset_splitter(): val_fraction, test_fraction, seed, etc.

Parsing

from dnanet.data.parsing.hid import get_peak_data

HID Parsing

get_peak_data(path, strategy, data_loading_strategy) — Parse a HID file and return raw/analyzed data as a numpy array.

Preprocessing

Peak Detection

from dnanet.data.preprocessing.peaks import (
    find_peaks_above_threshold,
    find_peak_boundary,
    find_peak_near_idx,
    find_peak_idx_near_or_in_range,
    find_valley_idx_in_range,
    find_absolute_peak_idx_in_range,
)
  • find_peaks_above_threshold(signal, threshold) — Detect peaks including flat-top peaks
  • find_peak_boundary(signal, peak_idx, threshold) — Walk left/right to find peak start and end
  • find_peak_near_idx(signal, idx) — Find nearest peak at least as high
  • find_peak_idx_near_or_in_range(signal, index_range, threshold) — Find dominant peak within or near an index range
  • find_valley_idx_in_range(signal, index_range, threshold) — Find signal minimum within an index range
  • find_absolute_peak_idx_in_range(signal, index_range, threshold) — Find peak by absolute value within an index range

Baseline Estimation

from dnanet.data.preprocessing.baseline import (
    baseline_superior,
    baseline_classic,
    baseline_enhanced,
)
  • baseline_superior(signal) — DNANet's recommended baseline method (100-pt window, 20th percentile)
  • baseline_classic(signal) — Traditional rolling-minimum approach (551-pt window, 20th percentile)
  • baseline_enhanced(signal) — Improved classic with piecewise weighted linear fits and Savitzky-Golay smoothing

Strategies

ScalingStrategy

Abstract base for kit-specific base-pair calibration.

from dnanet.data.strategies.scaling import ScalingStrategy, PowerPlexFusion6CStrategy, GlobalFilerStrategy

Concrete implementations: - PowerPlexFusion6CStrategy — PPF6C kit with WEN ILS (bp range 65–475) - GlobalFilerStrategy — GlobalFiler kit with GeneScan 600 LIZ (bp range 60–480)

DatasetStrategy

Abstract base for dataset-specific file handling.

from dnanet.data.strategies.datasets import DatasetStrategy, NFIRnDStrategy, ProvedItStrategy

Concrete implementations: - NFIRnDStrategy — NFI R&D 2p/5p dataset - ProvedItStrategy — PROVEDIt court validation dataset

Dataset strategies are instantiated from config and passed directly to HIDDataset, HIDImage, parsing helpers, and transformers.