Datasets¶
DNANet supports loading forensic DNA profiles from HID files produced by capillary electrophoresis instruments (e.g., Applied Biosystems 3500).
Supported Datasets¶
NFI R&D Dataset (data=dnanet_rd)¶
The Netherlands Forensic Institute Research & Development dataset contains mixture profiles from 2 to 5 contributors, amplified with the PowerPlex Fusion 6C (PPF6C) kit.
Statistics: - 350 sample HID files (87×2p, 87×3p, 88×4p, 86×5p contributors) - 72 ladder files - 6 mixture dataset directories (different electrophoresis runs) - Annotation format: AlleleReport TXT files (tab-separated) - Size standard: WEN ILS (19 peaks)
Required files:
| File | Description |
|---|---|
Raw data .HID files/ |
Root directory with HID files |
txt_annotations_2024/ |
AlleleReport TXT annotation files |
2p_5p_hid_to_annotation.csv |
Maps HID filenames → annotation names |
best_ladder_paths_DTH.csv |
Maps samples → best ladder file |
ladder_alleles.csv |
Expected ladder alleles per marker |
Annotation pipeline:
1. CSV maps each HID stem to an annotation name in the AlleleReport file
2. Called alleles are parsed from the AlleleReport (marker → allele list)
3. Alleles are mapped to pixel positions using the base-pair scaler
4. Binary segmentation masks are built from allele bin ranges
5. Optionally, masks are refined by snapping to actual peak positions
(adjustment_of_annotations: top or complete)
PROVEDIt Dataset (data=provedit)¶
The Project on Validation of Evidence Data Interpretation Tools (PROVEDIt) is a publicly available court validation dataset using the GlobalFiler kit.
Statistics: - ~750 sample HID files across multiple injection times - 85 ladder files - Size standard: GeneScan 600 LIZ (34 peaks, iterative fit)
Notes:
- ProvedIt currently loads without allele annotations (segmentation masks
are all-zero). Full genotype annotation loading from the XLSX file is
planned.
- The GlobalFiler size standard parser uses an iterative shrinking fit.
Files with poor size standard quality are still loaded with a warning
(best-effort fit), matching the original implementation.
- Ladders are auto-discovered by matching the well prefix (e.g., sample
B03_RD14-... matches ladder B03_Ladder-GF_... in the same
directory).
Data Pipeline¶
The data loading pipeline follows this flow:
HID files on disk
│
▼
HIDDataset.__init__()
├─ receives scaling_strategy
├─ receives dataset_strategy
├─ _create_annotation_mapping() ← CSV lookup
├─ _load_ladder_paths() ← CSV lookup
├─ _collect_files() ← walk + filter via strategy
└─ _load_images() ← create HIDImage instances
│
▼
HIDImage._load() (per file, lazy)
├─ get_peak_data(path, strategy) ← parse HID XML
├─ scaling.parse_size_standard(lane) ← validate & fit
├─ profile[:, rescaled_indices] ← rescale to bp grid
├─ parse_called_alleles(...) ← load annotation
└─ _build_segmentation(...) ← binary mask
│
▼
HIDImage(data, annotation, scaler, meta)
│
▼
adjust_annotations(method) ← optional: snap to peaks
│
▼
TransformableDataset (list of HIDImages)
│
▼
DNANetDataModule
├─ split(val_fraction) → train/val
└─ transformer/collate_fn → (x, y) tensors
│
▼
DataLoader → Lightning Trainer
Data Shapes¶
| Stage | Shape | Description |
|---|---|---|
| Raw HID | (num_dyes+1, ~10000) |
Raw scan points per dye |
| After rescaling | (num_dyes, 4096) |
Uniform base-pair grid |
| Segmentation mask | (num_dyes, 4096) |
Binary annotation |
| Torch input (x) | task-specific | Produced by the configured transformer |
| Torch target (y) | task-specific | Produced by the configured transformer |
Where num_dyes = 5 (analysis channels; size standard excluded by default).
Adding a New Dataset¶
- Create a dataset strategy in
src/dnanet/data/strategies/datasets/:
class MyDatasetStrategy(DatasetStrategy):
@classmethod
def categorize_file(cls, file_name: str) -> FileCategory:
# Classify as "sample", "ladder", "control", or "unknown"
...
@classmethod
def get_number_of_contributors(cls, file_name: str) -> int | None:
# Extract NOC from filename (e.g., "2p")
...
@classmethod
def get_sample_id(cls, file_name: str) -> str:
# Extract unique sample ID from filename
...
# ... implement remaining abstract methods
- Create a config in
conf/data/my_dataset.yamland pointdataset_strategy._target_at the strategy class:
name: my_dataset
dataset:
_target_: dnanet.data.hid_dataset.HIDDataset
root: data/my_dataset/
scaling_strategy: ${data.scaling_strategy}
dataset_strategy: ${data.dataset_strategy}
scaling_strategy:
_target_: dnanet.data.strategies.scaling.powerplex_fusion_6c.PowerPlexFusion6CStrategy
scanpoint_resolution: 4096
dataset_strategy:
_target_: dnanet.data.strategies.datasets.my_dataset.MyDatasetStrategy
- Run:
dnanet task=train data=my_dataset model=unet training=segmentation