Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU

In order to solve the problems of low computing efficiency utilization, poor stability, high difficulty in training optimization, and imperfect domestic accelerator technology ecology in AI cluster model training with more than 10 000 NPU, a large language model training optimization solution based...

Full description

Saved in:
Bibliographic Details
Main Authors: LOU Tao, NIU Hongweihua, ZHANG Pengfei, DONG Jiangfan, LI Panpan, LI Daotong, XU Weidong, YAO Chenghui, XUE Lianhao, TANG Ting, XIANG Jie
Format: Article
Language:zho
Published: Beijing Xintong Media Co., Ltd 2025-07-01
Series:Dianxin kexue
Subjects:
Online Access:http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2025166/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In order to solve the problems of low computing efficiency utilization, poor stability, high difficulty in training optimization, and imperfect domestic accelerator technology ecology in AI cluster model training with more than 10 000 NPU, a large language model training optimization solution based on a completely domestic AI cluster was proposed. Through automatic distributed strategy recommendation, pipeline parallel optimization, overlap optimization and full-link profiling technology, the model FLOPS utilization (MFU) reached 45.13% when training a 405B large language model on 16 384 domestic NPU, which was more than 10% higher than the baseline performance. At the same time, a set of stability assurance mechanisms was built throughout the entire large language model training process to achieve real-time monitoring of key indicators before and during model training, as well as the ability to quickly diagnose faults after training task were interrupted. The experimental results show that the large language model training optimization solution proposed can effectively improve the utilization of computing efficiency, and has important guiding significance for the future construction of domestic AI cluster and large language model training.
ISSN:1000-0801