Analysis of Distributed Training Systems and Optimization Algorithms
- DOI
- 10.2991/978-94-6463-823-3_107How to use a DOI?
- Keywords
- Distributed Training Systems; Parameter Server; Communication Optimization; Fault Tolerance
- Abstract
The growing depth of a machine learning model and scale of data volume necessitate to use distributed training system instead of the capacity of single machine. Herein, this paper presents a thorough treatment of two key architectural paradigms – the parameter server and decentralized architectures – and analyze their trade-offs in terms of scale, communication efficiency and fault tolerance. This paper focuses on three of the most fundamental optimization algorithms: synchronous SGD, asynchronous SGD, and the ADMM, showing the trade-offs between communication cost and stability in their convergence. This study also shows through case studies on frameworks such as TensorFlow, and Horovod that a decentralized architecture eliminates 75% of the communication overhead compared to centralized reliance on communication and that hybrid models similar to DeepSpeed yield elastic deployment in face of fast-changing models. The trade-offs of synchronization overhead, gradient staleness and system brittleness are addressed by means of gradient compression, adaptive synchronization strategies and resilient checkpointing. This paper also indicate future directions including: adapting to network dynamism, standard evaluation metrics, privacy-preserving techniques etc. to further advance scalability and reliability of distributed training systems. This paper makes the point that architectural and algorithmic innovation both play a key role in scaling up to large size.
- Copyright
- © 2025 The Author(s)
- Open Access
- Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
Cite this article
TY - CONF AU - Peixin Yang PY - 2025 DA - 2025/08/31 TI - Analysis of Distributed Training Systems and Optimization Algorithms BT - Proceedings of the 2025 3rd International Conference on Image, Algorithms, and Artificial Intelligence (ICIAAI 2025) PB - Atlantis Press SP - 1115 EP - 1127 SN - 2352-538X UR - https://doi.org/10.2991/978-94-6463-823-3_107 DO - 10.2991/978-94-6463-823-3_107 ID - Yang2025 ER -