Proceedings of the 2025 3rd International Conference on Image, Algorithms, and Artificial Intelligence (ICIAAI 2025)

Analysis of Distributed Training Systems and Optimization Algorithms

Authors
Peixin Yang1, *
1Morrissey College of Art and Science, Boston College, Chestnut Hill, 02467, USA
*Corresponding author. Email: yangbjt@bc.edu
Corresponding Author
Peixin Yang
Available Online 31 August 2025.
DOI
10.2991/978-94-6463-823-3_107How to use a DOI?
Keywords
Distributed Training Systems; Parameter Server; Communication Optimization; Fault Tolerance
Abstract

The growing depth of a machine learning model and scale of data volume necessitate to use distributed training system instead of the capacity of single machine. Herein, this paper presents a thorough treatment of two key architectural paradigms – the parameter server and decentralized architectures – and analyze their trade-offs in terms of scale, communication efficiency and fault tolerance. This paper focuses on three of the most fundamental optimization algorithms: synchronous SGD, asynchronous SGD, and the ADMM, showing the trade-offs between communication cost and stability in their convergence. This study also shows through case studies on frameworks such as TensorFlow, and Horovod that a decentralized architecture eliminates 75% of the communication overhead compared to centralized reliance on communication and that hybrid models similar to DeepSpeed yield elastic deployment in face of fast-changing models. The trade-offs of synchronization overhead, gradient staleness and system brittleness are addressed by means of gradient compression, adaptive synchronization strategies and resilient checkpointing. This paper also indicate future directions including: adapting to network dynamism, standard evaluation metrics, privacy-preserving techniques etc. to further advance scalability and reliability of distributed training systems. This paper makes the point that architectural and algorithmic innovation both play a key role in scaling up to large size.

Copyright
© 2025 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

Volume Title
Proceedings of the 2025 3rd International Conference on Image, Algorithms, and Artificial Intelligence (ICIAAI 2025)
Series
Advances in Computer Science Research
Publication Date
31 August 2025
ISBN
978-94-6463-823-3
ISSN
2352-538X
DOI
10.2991/978-94-6463-823-3_107How to use a DOI?
Copyright
© 2025 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

TY  - CONF
AU  - Peixin Yang
PY  - 2025
DA  - 2025/08/31
TI  - Analysis of Distributed Training Systems and Optimization Algorithms
BT  - Proceedings of the 2025 3rd International Conference on Image, Algorithms, and Artificial Intelligence (ICIAAI 2025)
PB  - Atlantis Press
SP  - 1115
EP  - 1127
SN  - 2352-538X
UR  - https://doi.org/10.2991/978-94-6463-823-3_107
DO  - 10.2991/978-94-6463-823-3_107
ID  - Yang2025
ER  -