A probabilistic deep learning approach for choroid plexus segmentation in autism spectrum disorder – NPP — Digital Psychiatry and Neuroscience
Building machine learning systems that generalize beyond their training data typically depends on painstaking, manual “ground truth” labels. That approach does not scale. A newer strategy is to quantify model uncertainty so we can assess reliability even when labels are unavailable. In this work, the ASCHOPLEX choroid plexus segmentation model was adapted and tested for generalization to autism spectrum disorder (ASD), with a particular focus on probabilistic uncertainty estimates as a proxy for out-of-distribution detection.
Study goal and hypothesis
The team asked: can ASCHOPLEX segment the choroid plexus in ASD reliably after targeted finetuning, and can uncertainty metrics flag cases where performance might degrade? They hypothesized the model would be most certain on individuals most similar to the finetuning cohort—specifically, adults—showing lower uncertainty for ASD and control (CON) adults compared with children.
Cohorts and imaging
Local cohort (finetuning and evaluation). T1-weighted multi-echo MEMPRAGE MRI scans were collected from 65 adults (36 ASD, 29 CON; ages 18–40) at 1 mm isotropic resolution on a 3T TIM Trio scanner. ASD diagnoses were made by board-certified psychiatrists and corroborated with ADI-R and ADOS-2. Institutional review board approval and informed consent/assent procedures were followed.
ABIDE cohort (external generalization). The Autism Brain Imaging Data Exchange provided 2,226 subjects (1,060 ASD; 1,166 CON) from 24 sites with varied scanners and protocols. After a two-rater visual quality control pass (checking for missing data, artifacts, field-of-view issues, orientation, and severe motion), 410 scans were excluded. The resulting 1,802 scans formed the final cohort: 708 males with ASD, 109 females with ASD, 774 male CON, and 252 female CON.
Manual ground truth
For the local dataset, choroid plexus labels in the lateral ventricles were manually traced in OsiriX by a radiology trainee and quality-checked by an experienced MRI researcher. Tracing proceeded slice-by-slice in axial view, with boundary refinements in coronal and sagittal planes to avoid cerebrospinal fluid or brain tissue.
ASCHOPLEX finetuning
ASCHOPLEX is an ensemble of five deterministic deep neural networks for choroid plexus segmentation. To tailor it to ASD, the team finetuned the ensemble on 12 local subjects (6 train: 3 ASD, 3 CON; 6 validation: 3 ASD, 3 CON), balanced by sex. Performance was then tested on the remaining 53 adults (30 ASD, 23 CON) using Dice similarity coefficient against manual labels. FreeSurfer v6.0.0 served as a classical, non–deep learning comparator.
From deterministic to probabilistic
Ensembles alone provide limited uncertainty (based on inter-model disagreement). To obtain fine-grained, voxel-level uncertainty, the researchers implemented a probabilistic variant of ASCHOPLEX using Monte Carlo (MC) Dropout. They:
- Enabled dropout layers during finetuning and inference across all five models to reduce overfitting and capture stochasticity.
- Empirically tested dropout rates (0.1, 0.25, 0.4, 0.5) on the local set; 0.1 offered the best Dice while preserving sufficient variance for uncertainty estimation.
- Modified post-processing to retain voxel-wise probabilities (0–1) instead of immediately thresholding at 0.5.
This approach approximates a Bayesian neural network and yields predictive distributions per voxel, enabling uncertainty decomposition.
Inference and segmentation
Deterministic pipeline. Each model in the ensemble produced a binary mask; final labels came from majority voting.
Probabilistic pipeline. For each subject, the team ran 20 MC dropout passes per model (five models), generating 100 stochastic segmentations. The voxel-wise mean probability map was then thresholded at 0.5 to create the final binary mask. Sensitivity analyses indicated that thresholds above 0.2 yielded similar Dice on held-out data.
Uncertainty metrics
For each subject, voxel-wise predictive means were computed by averaging across stochastic samples. Four uncertainty maps were derived: total uncertainty, aleatoric uncertainty (expected conditional entropy of the posterior predictive), epistemic uncertainty (via Bayesian Active Learning by Disagreement), and a standard deviation map as an intuitive dispersion index. Uncertainty was calculated per voxel across both inter-model and intra-model variability (dropout realizations), then masked to the predicted choroid plexus and summarized into a single subject-level score.
Sampling schemes:
- Local dataset: 100 samples per subject (20 passes × 5 models).
- ABIDE subset: 100 samples for a representative 86-subject subset.
- ABIDE full: 25 samples per subject (5 passes × 5 models) for computational feasibility.
Generalization was assessed by comparing uncertainty between ABIDE children (ages 5–17) and adults (18–64).
Evaluation metrics
Primary accuracy metric. Dice similarity coefficient between automated and manual segmentations on the 53 held-out local subjects.
Secondary metrics. Hausdorff distance, volume similarity, and Pearson correlation of segmented volumes (details in Supplementary Materials).
Bias analysis
A linear mixed-effects model tested whether performance varied by diagnosis (ASD vs. CON), sex (male vs. female), and segmentation procedure (FreeSurfer; ASCHOPLEX deterministic without finetuning; ASCHOPLEX deterministic with finetuning; ASCHOPLEX probabilistic with finetuning). Dice was the dependent variable, with all interactions considered and subject as a random intercept. Post-hoc t-tests were FDR-corrected.
Why this matters
- Scalable validation: Probabilistic segmentation enables reliability checks on large, unlabeled datasets, reducing dependence on manual ground truth.
- Domain shift awareness: Uncertainty surfaces highlight when a model is less confident—critical for multi-site, multi-age cohorts like ABIDE.
- Targeted finetuning: A small, representative sample (12 subjects) can meaningfully adapt a model to a new clinical population.
- Transparent thresholds: Explicit probability maps and sensitivity analyses around binarization thresholds help standardize downstream analyses.
Takeaway
This study positions uncertainty-aware deep learning as a practical path for deploying medical imaging models across diverse clinical populations and scanners. By integrating MC Dropout into ASCHOPLEX and rigorously comparing deterministic and probabilistic pipelines, the team demonstrates how to both boost segmentation performance through finetuning and monitor reliability via uncertainty metrics—especially when generalizing from adults to children in ASD cohorts.