Architecture Overview¶
DNANet is organized into six packages, each with a clear responsibility:
dnanet/
├── core/ Domain models (alleles, markers, panels, annotations)
├── data/ Data loading, parsing, preprocessing, strategies
├── models/ Neural network architectures (stateless nn.Module)
├── modules/ PyTorch Lightning modules (training logic)
├── evaluation/ Metrics, allele calling, visualization
└── tasks/ CLI task runners (train, evaluate, cross-validate)
Package Dependencies¶
The dependency graph flows strictly downward — no circular imports:
coredepends on nothing (pure domain models + stdlib)datadepends oncore+ NumPy/SciPymodelsdepends oncore+ PyTorchmodulesdepends onmodels+ Lightningevaluationdepends oncore+data+ NumPytasksdepends on everything (orchestration layer)
Key Design Decisions¶
1. Separation of Architecture from Training¶
Neural network architectures (models/) are pure nn.Module classes with
no training logic. Training loops, optimizers, metrics, and logging live
in Lightning modules (modules/). This allows:
- Reusing the same UNet architecture for segmentation or reconstruction
- Unit-testing architectures without Lightning overhead
- Swapping training strategies without touching model code
2. Strategy Pattern for Kit & Dataset Variance¶
Forensic DNA kits (PPF6C, GlobalFiler) and datasets (NFI R&D, ProvedIt) have
different conventions for size standards, file naming, and annotations. Rather
than scattering if kit == "PPF6C" conditions, each variant is a Strategy
class. Kit scaling strategies and dataset strategies are injected directly
into the data pipeline.
3. Hydra Composition over Monolithic Config¶
Configuration is split into independent groups (data, model, training, evaluation, logging) that are composed at runtime. This avoids the combinatorial explosion of maintaining separate config files for every experiment combination.
4. Lazy Loading for HIDImage¶
HID files are parsed on first access to the .data property, not at
construction time. This means scanning a directory of 800 files is instant;
the I/O cost is paid only when data is actually needed.
Module Map¶
| Module | Lines | Key Classes | Design Patterns |
|---|---|---|---|
core/allele.py |
~50 | Allele |
Value Object |
core/marker.py |
~80 | Marker |
Value Object |
core/panel.py |
~200 | Panel |
Factory Method (from_xml) |
core/annotation.py |
~60 | Annotation |
Null Object |
data/image.py |
~250 | HIDImage |
Lazy Loading (Virtual Proxy) |
data/dataset.py |
~25 | TransformableDataset |
Abstract Base Class |
data/hid_dataset.py |
~440 | HIDDataset |
Facade |
data/datamodule.py |
~150 | DNANetDataModule |
Adapter, Bridge |
data/strategies/scaling.py |
~460 | ScalingStrategy, PPF6CStrategy, GlobalFilerStrategy |
Strategy, Template Method |
models/unet.py |
~160 | UNet, EncoderBlock, DecoderBlock |
Composite |
models/autoencoder.py |
~300 | Conv1dAutoencoder, FourierAutoencoder |
— |
models/peaknet.py |
~350 | CombinedClassifier, FiLMCombiner |
Strategy (combiner) |
modules/segmentation.py |
~120 | SegmentationModule |
Mediator, Template Method |
evaluation/allele_caller.py |
~150 | AlleleCaller, NearestBasePairCaller |
Strategy |
tasks/train.py |
~280 | run(), run_with_data() |
Facade, Command |
cli.py |
~60 | main() |
Command |