Comparative analysis of deep learning models and strategies for multiclass brain MRI segmentation

Bookmark (0)
Please login to bookmark Close

Magnetic Resonance Imaging (MRI) is one of the most advanced techniques for studying brain structure and pathology, and it has become a fundamental tool in the diagnosis and monitoring of neurological disorders. Its clinical utility can be further enhanced by accurate and efficient image segmentation. Accurate segmentation of anatomical and pathological regions is essential for quantitative analysis, treatment planning, and disease monitoring. Although manual segmentation by experts remains the gold standard, it is laborious, error-prone, and impractical in large-scale or time-critical clinical contexts. These limitations have motivated the development of automated segmentation methods based on deep learning. However, despite the outstanding progress achieved in recent years, current solutions still face significant challenges, including limited availability of annotated data, severe class imbalance, and difficulties in generalization across datasets and acquisition protocols.
This Master’s Thesis addresses these challenges by performing a comparative analysis of three reference deep learning architectures for semantic segmentation: U-Net, DeepLabV3, and Fully Convolutional Network (FCN). This work was conducted in the context of multiclass brain MRI segmentation, using a synthetic dataset of phantoms. The study is structured around three complementary experimental approaches, designed to assess both architectural performance and improvement strategies. The first approach is based on direct training from scratch and was conducted on a phantom dataset. This dataset was specifically created to simulate realistic brain structures and included multimodal acquisitions (T1, T2, DP), manually and semi-automatically segmented into six classes: background, tumor, white matter, grey matter, blood vessels, and external markers. Masks were generated and standardized using a pipeline combining Label Studio with the Segment Anything Model (SAM), ensuring consistent and reproducible annotations. The second approach explored transfer learning through pretraining on the publicly available BraTS2020 dataset, followed by finetuning on the phantom dataset. This aim is to investigate whether pretrained representations on real patient data could improve performance on a smaller and domain-specific dataset. The third approach introduced advanced data augmentation strategies to address class imbalance and limited sample size. All three architectures were trained under homogeneous configurations to ensure fair comparison, and subsequently with customized hyperparameters optimized per model. Evaluation relied on widely adopted segmentation metrics (Dice coefficient, Intersection over Union, pixel accuracy), with complementary analysis of training dynamics, per-class performance, computational cost, and qualitative visual inspection of representative cases.
Results from the first approach revealed clear architectural differences. Under homogeneous hyperparameters, DeepLabV3 and FCN outperformed U-Net, establishing the initial ranking among the three models. After introducing architecture-specific adjustments, such as tailored learning rates, performance improved substantially in all cases. DeepLabV3 reached the best overall results, and U-Net obtained the lowest performance. These findings highlight the importance of both architectural design and hyperparameter tuning. In terms of computational efficiency, FCN was the fastest and least memory-demanding, while U-Net was the most resource-intensive, requiring significantly higher GPU memory and FLOPs.
In contrast, transfer learning with BraTS2020 did not translate into performance gains under the conditions of this study. While pretrained weights led to smoother training dynamics and mitigated overfitting, the final performance of the three architectures was lower. The third approach, based on augmentation strategies, achieved results comparable to direct training while mitigating overfitting and improving segmentation of certain classes. White and grey matter benefited the most, with Dice scores consistently improving across models. The tumor class also showed slight but consistent gains, demonstrating that targeted augmentation enhanced minority-class learning. However, improvements were less pronounced for highly underrepresented classes, where data scarcity remained a limiting factor. The qualitative visual analysis reinforced these observations, showing that DeepLabV3 and FCN produced cleaner masks with sharper boundaries, whereas U-Net tended to generate blurrier contours and spurious predictions in minority classes. After augmentation, all models produced visually more consistent masks, particularly in complex regions, confirming the benefits of increased data variability.
Taken together, the comparative analysis demonstrates that DeepLabV3 is the most balanced and robust model for multiclass brain MRI segmentation, combining high accuracy with moderate computational cost. Beyond the numerical results, this work highlights the crucial role of dataset quality, annotation consistency, and class balance in training reliable medical segmentation models. The main limitations of the study are the small dataset size and the strong class imbalance, which constrained the ability to evaluate minority structures.
In conclusion, this thesis primarily provides a systematic comparison of three reference deep learning architectures for multiclass brain MRI segmentation, establishing their relative strengths and weaknesses under different conditions. Building on this benchmark, alternative strategies such as transfer learning and data augmentation were explored to address the limitations encountered. The study thus contributes practical insights into the trade-offs between model performance, computational cost, and dataset constraints, offering guidance for future developments in AI-assisted neuroimaging. Among the evaluated models, DeepLabV3 emerges as the most balanced candidate for potential integration into clinical workflows, supporting tasks such as tumor delineation, treatment planning, and disease monitoring.

​Magnetic Resonance Imaging (MRI) is one of the most advanced techniques for studying brain structure and pathology, and it has become a fundamental tool in the diagnosis and monitoring of neurological disorders. Its clinical utility can be further enhanced by accurate and efficient image segmentation. Accurate segmentation of anatomical and pathological regions is essential for quantitative analysis, treatment planning, and disease monitoring. Although manual segmentation by experts remains the gold standard, it is laborious, error-prone, and impractical in large-scale or time-critical clinical contexts. These limitations have motivated the development of automated segmentation methods based on deep learning. However, despite the outstanding progress achieved in recent years, current solutions still face significant challenges, including limited availability of annotated data, severe class imbalance, and difficulties in generalization across datasets and acquisition protocols.
This Master’s Thesis addresses these challenges by performing a comparative analysis of three reference deep learning architectures for semantic segmentation: U-Net, DeepLabV3, and Fully Convolutional Network (FCN). This work was conducted in the context of multiclass brain MRI segmentation, using a synthetic dataset of phantoms. The study is structured around three complementary experimental approaches, designed to assess both architectural performance and improvement strategies. The first approach is based on direct training from scratch and was conducted on a phantom dataset. This dataset was specifically created to simulate realistic brain structures and included multimodal acquisitions (T1, T2, DP), manually and semi-automatically segmented into six classes: background, tumor, white matter, grey matter, blood vessels, and external markers. Masks were generated and standardized using a pipeline combining Label Studio with the Segment Anything Model (SAM), ensuring consistent and reproducible annotations. The second approach explored transfer learning through pretraining on the publicly available BraTS2020 dataset, followed by finetuning on the phantom dataset. This aim is to investigate whether pretrained representations on real patient data could improve performance on a smaller and domain-specific dataset. The third approach introduced advanced data augmentation strategies to address class imbalance and limited sample size. All three architectures were trained under homogeneous configurations to ensure fair comparison, and subsequently with customized hyperparameters optimized per model. Evaluation relied on widely adopted segmentation metrics (Dice coefficient, Intersection over Union, pixel accuracy), with complementary analysis of training dynamics, per-class performance, computational cost, and qualitative visual inspection of representative cases.
Results from the first approach revealed clear architectural differences. Under homogeneous hyperparameters, DeepLabV3 and FCN outperformed U-Net, establishing the initial ranking among the three models. After introducing architecture-specific adjustments, such as tailored learning rates, performance improved substantially in all cases. DeepLabV3 reached the best overall results, and U-Net obtained the lowest performance. These findings highlight the importance of both architectural design and hyperparameter tuning. In terms of computational efficiency, FCN was the fastest and least memory-demanding, while U-Net was the most resource-intensive, requiring significantly higher GPU memory and FLOPs.
In contrast, transfer learning with BraTS2020 did not translate into performance gains under the conditions of this study. While pretrained weights led to smoother training dynamics and mitigated overfitting, the final performance of the three architectures was lower. The third approach, based on augmentation strategies, achieved results comparable to direct training while mitigating overfitting and improving segmentation of certain classes. White and grey matter benefited the most, with Dice scores consistently improving across models. The tumor class also showed slight but consistent gains, demonstrating that targeted augmentation enhanced minority-class learning. However, improvements were less pronounced for highly underrepresented classes, where data scarcity remained a limiting factor. The qualitative visual analysis reinforced these observations, showing that DeepLabV3 and FCN produced cleaner masks with sharper boundaries, whereas U-Net tended to generate blurrier contours and spurious predictions in minority classes. After augmentation, all models produced visually more consistent masks, particularly in complex regions, confirming the benefits of increased data variability.
Taken together, the comparative analysis demonstrates that DeepLabV3 is the most balanced and robust model for multiclass brain MRI segmentation, combining high accuracy with moderate computational cost. Beyond the numerical results, this work highlights the crucial role of dataset quality, annotation consistency, and class balance in training reliable medical segmentation models. The main limitations of the study are the small dataset size and the strong class imbalance, which constrained the ability to evaluate minority structures.
In conclusion, this thesis primarily provides a systematic comparison of three reference deep learning architectures for multiclass brain MRI segmentation, establishing their relative strengths and weaknesses under different conditions. Building on this benchmark, alternative strategies such as transfer learning and data augmentation were explored to address the limitations encountered. The study thus contributes practical insights into the trade-offs between model performance, computational cost, and dataset constraints, offering guidance for future developments in AI-assisted neuroimaging. Among the evaluated models, DeepLabV3 emerges as the most balanced candidate for potential integration into clinical workflows, supporting tasks such as tumor delineation, treatment planning, and disease monitoring. Read More