Supercomputer networking to accelerate large scale AI training
Summary
OpenAI has introduced Multipath Reliable Connection (MRC), a networking protocol designed to enhance the efficiency and reliability of supercomputer clusters used for training large-scale AI models. Developed in collaboration with AMD, Broadcom, Intel, Microsoft, and NVIDIA, MRC enables multi-plane network architectures that utilize adaptive packet spraying and SRv6-based static source routing. These innovations minimize network congestion, allow for rapid routing around hardware failures without disrupting training jobs, and reduce overall infrastructure complexity, ultimately supporting the development of increasingly capable frontier AI models.
(Source:OpenAI)