Analysis of Distributed Training Systems and Optimization Algorithms

Peixin Yang

doi:10.2991/978-94-6463-823-3_107

<Previous Article In Volume

Next Article In Volume>

Analysis of Distributed Training Systems and Optimization Algorithms

Authors

Peixin Yang¹^{, *}

¹Morrissey College of Art and Science, Boston College, Chestnut Hill, 02467, USA

^*Corresponding author. Email: yangbjt@bc.edu

Corresponding Author

Peixin Yang

Available Online 31 August 2025.

DOI: 10.2991/978-94-6463-823-3_107 How to use a DOI?
Keywords: Distributed Training Systems; Parameter Server; Communication Optimization; Fault Tolerance
Abstract: The growing depth of a machine learning model and scale of data volume necessitate to use distributed training system instead of the capacity of single machine. Herein, this paper presents a thorough treatment of two key architectural paradigms – the parameter server and decentralized architectures – and analyze their trade-offs in terms of scale, communication efficiency and fault tolerance. This paper focuses on three of the most fundamental optimization algorithms: synchronous SGD, asynchronous SGD, and the ADMM, showing the trade-offs between communication cost and stability in their convergence. This study also shows through case studies on frameworks such as TensorFlow, and Horovod that a decentralized architecture eliminates 75% of the communication overhead compared to centralized reliance on communication and that hybrid models similar to DeepSpeed yield elastic deployment in face of fast-changing models. The trade-offs of synchronization overhead, gradient staleness and system brittleness are addressed by means of gradient compression, adaptive synchronization strategies and resilient checkpointing. This paper also indicate future directions including: adapting to network dynamism, standard evaluation metrics, privacy-preserving techniques etc. to further advance scalability and reliability of distributed training systems. This paper makes the point that architectural and algorithmic innovation both play a key role in scaling up to large size.
Copyright: © 2025 The Author(s)
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

<Previous Article In Volume

Next Article In Volume>

Volume Title: Proceedings of the 2025 3rd International Conference on Image, Algorithms, and Artificial Intelligence (ICIAAI 2025)
Series: Advances in Computer Science Research
Publication Date: 31 August 2025
ISBN: 978-94-6463-823-3
ISSN: 2352-538X
DOI: 10.2991/978-94-6463-823-3_107 How to use a DOI?
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

ris enw bib

TY  - CONF
AU  - Peixin Yang
PY  - 2025
DA  - 2025/08/31
TI  - Analysis of Distributed Training Systems and Optimization Algorithms
BT  - Proceedings of the 2025 3rd International Conference on Image, Algorithms, and Artificial Intelligence (ICIAAI 2025)
PB  - Atlantis Press
SP  - 1115
EP  - 1127
SN  - 2352-538X
UR  - https://doi.org/10.2991/978-94-6463-823-3_107
DO  - 10.2991/978-94-6463-823-3_107
ID  - Yang2025
ER  -

download .riscopy to clipboard