Supercomputer networking to accelerate large scale AI training

OpenAI
OpenAI introduced MRC, a new networking protocol developed with industry partners to improve performance and resilience in large-scale AI training clusters.

Summary

OpenAI has introduced Multipath Reliable Connection (MRC), a networking protocol designed to enhance the efficiency and reliability of supercomputer clusters used for training large-scale AI models. Developed in collaboration with AMD, Broadcom, Intel, Microsoft, and NVIDIA, MRC enables multi-plane network architectures that utilize adaptive packet spraying and SRv6-based static source routing. These innovations minimize network congestion, allow for rapid routing around hardware failures without disrupting training jobs, and reduce overall infrastructure complexity, ultimately supporting the development of increasingly capable frontier AI models.

(Source:OpenAI)