Evaluating and Optimizing CNN–Transformer Architectures for Musculoskeletal Disease Classification

Moulay Youssef Ichahane; Noureddine  Assad

Authors

Moulay Youssef Ichahane LTI Laboratory, ENSA, Chouaib Doukkali University, EL Jadida, Morocco https://orcid.org/0000-0001-7382-1019
Noureddine Assad LTI Laboratory, ENSA, Chouaib Doukkali University, EL Jadida, Morocco

Keywords:

Deep Learning, Dataset Scaling, Computer Vision, Neural Network Architecture

Abstract

This study examines the impact of dataset dimensionality on deep learning performance in musculoskeletal disease detection, focusing on osteoporosis and rheumatoid arthritis. Using over 200,000 annotated X-ray, DXA, and MRI images, the performance of Vision Transformer (ViT), ConvNeXt, and Swin Transformer models was systematically evaluated in terms of scalability, robustness, and multi-modal integration. Results demonstrate that increasing dataset scale significantly enhances model generalization, with Swin Transformer achieving the best performance (AUC = 0.94, p < 0.001). These findings underscore the critical role of self-attention mechanisms and model scaling strategies in medical image classification, providing new benchmarks for dataset requirements and guiding the development of more reliable AI-driven diagnostic systems. Furthermore, the study emphasizes the necessity of large, diverse datasets to mitigate overfitting and improve real-world applicability. It also highlights the potential of hybrid architectures for integrating multi-source medical data. Overall, this research contributes to advancing explainable and scalable AI solutions for musculoskeletal imaging in clinical practice.

Downloads

Download data is not yet available.

References

A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” 2021. [Online]. Available: https://arxiv.org/abs/2010.11929

X. Liu, L. Song, S. Liu, and Y. Zhang, “A Review of Deep-Learning-Based Medical Image Segmentation Methods,” 2021.

T. Johnson, J. Su, A. Henning, and J. Ren, “A 7T MRI Study of Fibular Bone Thickness and Density : Impact of Age , Sex and Body Weight , and Correlation with Bone Marrow Expansion and Muscle Fat Infiltration,” 2025.

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A Simple Framework for Contrastive Learning of Visual Representations,” no. Figure 1, 2019.

J. He et al., “Focused Contrastive Loss for Classification With Pre-Trained Language Models,” IEEE Trans. Knowl. Data Eng., vol. 36, no. 7, pp. 3047–3061, 2024, doi: 10.1109/tkde.2023.3327777.

B. C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding Deep Learning ( Still ) Requires Rethinking Generalization,” pp. 107–115, 2017.

and J. F. T. Hastie, R. Tibshirani, “The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York, NY, USA: Springer, 2009.,” Math. Intell., vol. 27, no. 2, pp. 83–85, 2009.

I. Loshchilov and F. Hutter, “D w d r,” 2019.

G. Hinton, “Dropout : A Simple Way to Prevent Neural Networks from Overfitting,” vol. 15, pp. 1929–1958, 2014.

J. M. Johnson and T. M. Khoshgoftaar, “Survey on deep learning with class imbalance,” J. Big Data, 2019, doi: 10.1186/s40537-019-0192-5.

R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection,” no. June, 2013.

M. Lin and H. Chen, “A Study of the Effects of Digital Learning on Learning Motivation and Learning Outcome,” vol. 8223, no. 7, pp. 3553–3564, 2017, doi: 10.12973/eurasia.2017.00744a.

Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436 – 444, 2015, doi: 10.1038/nature14539.

G. Litjens et al., “A survey on deep learning in medical image analysis,” vol. 42, no. December 2012, pp. 60–88, 2017, doi: 10.1016/j.media.2017.07.005.