RoCE networks for distributed AI coaching at scale
- AI networks play an essential function in interconnecting tens of 1000’s of GPUs collectively, forming the foundational infrastructure for coaching, enabling giant fashions with a whole bunch of billions of parameters corresponding to LLAMA 3.1 405B.
- This week at ACM SIGCOMM 2024 in Sydney, Australia, we’re sharing particulars on the community we’ve got constructed at Meta over the previous few years to help our large-scale distributed AI coaching workload.
- Our paper, “RDMA over Ethernet for Distributed AI Training at Meta Scale,” supplies the small print on how we design, implement, and function one of many world’s largest AI networks at scale.
The rising prevalence of AI has launched a brand new period of communication calls for. Distributed coaching, particularly, imposes essentially the most important pressure on information middle networking infrastructure. For example, a typical generative AI (GenAI) job might necessitate tight coordination of tens of 1000’s of GPUs over the course of a number of weeks. Setting up a dependable, high-performance community infrastructure able to accommodating this burgeoning demand necessitates a reevaluation of information middle community design.
When Meta launched distributed GPU-based coaching, we determined to assemble specialised information middle networks tailor-made for these GPU clusters. We opted for RDMA Over Converged Ethernet model 2 (RoCEv2) because the inter-node communication transport for almost all of our AI capability.
Now we have efficiently expanded our RoCE networks, evolving from prototypes to the deployment of quite a few clusters, every accommodating 1000’s of GPUs. These RoCE clusters help an intensive vary of manufacturing distributed GPU coaching jobs, together with rating, content material advice, content material understanding, pure language processing, and GenAI mannequin coaching, amongst different workloads.
Topology
We constructed a devoted backend community particularly for distributed coaching. This allowed us to evolve, function, and scale independently from the remainder of the info middle community. To help giant language fashions (LLMs), we expanded the backend community in the direction of the DC-scale, e.g., incorporating topology-awareness into the coaching job scheduler.
The separation
The coaching cluster depends on two impartial networks: the frontend (FE) community for duties corresponding to information ingestion, checkpointing, and logging, and the backend (BE) community for coaching, as depicted under.
A coaching rack is related to each the FE and BE of the info middle community. The FE has a hierarchy of community layers – rack switches (RSWs), cloth switches (FSWs), and better – that homes the storage warehouse, which supplies GPUs with the required enter information for coaching workloads. We guarantee that there’s sufficient ingress bandwidth on the rack swap to not hinder the coaching workload.
The BE is a specialised cloth that connects all RDMA NICs in a non-blocking structure, offering excessive bandwidth, low latency, and lossless transport between any two GPUs within the cluster, no matter their bodily location. This backend cloth makes use of the RoCEv2 protocol, which encapsulates the RDMA service in UDP packets for transport over the community.
AI Zone
Our BE networks have undergone a number of transformations. Initially, our GPU clusters used a easy star topology with a couple of AI racks related to a central Ethernet swap working the non-routable RoCEv1 protocol. This setup had clear limitations in GPU scale and swap redundancy. Due to this fact, we swiftly transitioned to a fabric-based structure for prolonged scalability and better availability.
We designed a two-stage Clos topology for AI racks, often called an AI Zone. The rack coaching swap (RTSW), serving because the leaf swap, gives scale-up connectivity for GPUs throughout the rack utilizing copper-based DAC cables. The backbone tier, composed of modular cluster coaching switches (CTSW), supplies scale-out connectivity amongst all racks within the cluster. The CTSW has deep buffers statically divided over the ports within the chassis. The RTSWs hook up with CTSWs by way of single-mode fiber and 400G pluggable transceivers.
The AI Zones are designed to help numerous interconnected GPUs in a non-blocking method. Nevertheless, rising AI developments, corresponding to LLMs like Llama, demand a GPU scale bigger than what a single AI zone supplies. To accommodate this, we designed an aggregator coaching swap (ATSW) layer that connects the CTSWs in an information middle constructing, increasing the RoCE area past a single AI Zone.
Be aware, the cross-AI Zone connectivity is oversubscribed by design, with community visitors balanced utilizing ECMP. To mitigate the efficiency bottleneck for cross-AI Zone visitors, we enhanced the coaching job scheduler to discover a “minimal minimize” when dividing the coaching nodes into totally different AI Zones, decreasing the cross-AI Zone visitors and thus collective completion time. The scheduler does this by studying the place of GPU servers within the logical topology to advocate a rank project.
Routing
The scaling of compute energy and community topology mentioned above led to the query of the best way to effectively steadiness and route the huge coaching visitors. Particularly, the AI coaching workloads had a number of difficult traits:
- Low entropy: In comparison with conventional information middle workloads, the quantity and the variety of flows for AI workloads are a lot smaller and the circulation patterns are normally repetitive and predictable.
- Burstiness: On the time dimension, the flows normally exhibit the “on and of”’ nature within the time granularity of milliseconds.
- Elephant flows: For every burst, the depth of every circulation may attain as much as the road price of NICs.
ECMP and path pinning
We initially thought of the broadly adopted ECMP, which locations flows randomly based mostly on the hashes on the five-tuple: supply and vacation spot IPs, supply and vacation spot UDP ports, and protocol. Nevertheless, and as anticipated, ECMP rendered poor efficiency for the coaching workload as a result of low circulation entropy.
Alternatively, we designed and deployed a path-pinning scheme within the preliminary years of our deployment. This scheme routed packets to particular paths based mostly on the vacation spot “slice” (the index of the RTSW downlink). This labored nicely if every rack was totally assigned to the identical job and there was no failure within the community. Nevertheless, this was seldom true. We noticed that the rack could be partially allotted to a job, with solely one of many two hosts within the rack utilizing the uplink bandwidth. This fragmented job placement triggered uneven visitors distribution and congestion on the uplinks of the actual RTSW and degraded the coaching efficiency as much as greater than 30%. Additional, community failures on a uplink or a CTSW triggered the affected flows to be erratically reassigned to different CTSWs by ECMP. These reassigned flows collided with different current flows and slowed down the entire coaching job.
We mitigated the fast impression of those circulation collisions by upgrading the bandwidth of the RTSW uplinks bandwidth by 2x. Therefore we allowed for the RTSW uplink capability to be 1:2 under-subscribed in comparison with the RTSW downlink capability. Whereas this mitigated the fast efficiency impression, this was an costly answer because it required 2x community capability. Thus, we acknowledged this as a short-term mitigation and proceeded to additional phases of routing evolution.
Queue pair scaling
We subsequent revisited ECMP with an intent to extend the variety of flows for hierarchical collectives by the queue pair (QP) scaling software program characteristic within the collective library.
To account for this, we configured switches to carry out Enhanced ECMP (E-ECMP) to moreover hash on the vacation spot QP discipline of a RoCE packet utilizing the UDF functionality of the swap ASIC. This elevated entropy and, in comparison with baseline ECMP with out QP scaling, we noticed that E-ECMP together with QP scaling confirmed efficiency enchancment of as much as 40% for the AllReduce collective.
We evaluated two QP scaling methods. The primary concerned splitting every message meant to be posted over a single QP, as a substitute onto a number of QPs leading to a number of flows. However it additionally produced smaller message sizes on cloth in addition to a number of ACKs. The second method concerned posting every message to a distinct queue, in a round-robin vogue. For the NIC message sizes demonstrated in our manufacturing with NCCL, we noticed the latter to be performing nicely. This characteristic has been essential for ECMP scalability by growing the community flows for hierarchical collectives like AllReduce.
Whereas we improved ECMP efficiency with QP scaling, the underlying probabilistic nature of hashing was a persistent draw back of this routing scheme. Additionally, the necessity to customise the QP scaling issue and methodology based mostly on the workload sort, whereas workable within the short-term, introduced long-term operational complexity.
Congestion management
As we transitioned to 400G deployments, we tried to tune DCQCN to adapt to new community speeds and topology. Nevertheless, with default DCQCN settings and doubled ECN thresholds in comparison with 200G networks, efficiency was degraded. Additional investigation revealed that DCQCN implementation in firmware has modified, introducing bugs and decreased visibility with issues referring to appropriate CNP counting.
We proceeded with out DCQCN for our 400G deployments. Right now, we’ve got had over a 12 months of expertise with simply PFC for circulation management, with out some other transport-level congestion management. Now we have noticed steady efficiency and lack of persistent congestion for coaching collectives.
Receiver-driven visitors admission
To mitigate the congestion for 400G and past, we co-designed the collective library and RoCE transport to implement receiver-driven visitors admission for higher efficiency. The diagram under reveals that the GPU-to-GPU communication structure in our manufacturing coaching clusters predominantly makes use of two-stage copy and receiver-initiated communication by way of the NCCL collective library. Every GPU’s excessive bandwidth reminiscence (HBM) maintains a number of channels for parallel transmission of chunked collective messages. The sender GPU threads first copy information from the compute buffer to an accessible channel buffer. The sender CPU proxy thread can solely publish an RDMA write request after receiving a clear-to-send (CTS) packet from the receiver, which incorporates the scale and reminiscence data. The receiver’s GPU threads then copy the channel buffer contents to the vacation spot compute buffer. Lastly, CPU proxy threads on either side recycle the channel buffer, and the receiver CPU proxy sends one other CTS packet as soon as the channel buffer is prepared.
We successfully leverage this mechanism as a receiver-driven visitors admission to restrict the quantity of in-flight visitors on the community, particularly when congestion begins to construct up. Nevertheless, configuring the proper setting could be difficult as:
- The variety of channels is restricted as a result of useful resource competition on GPU threads with concurrent compute operations;
- Setting the channel buffer measurement requires a extra cautious steadiness between congestion spreading and bandwidth under-utilization than Infiniband attributable to RoCE’s extra coarse-grained circulation management and doable end-host slowness.
Thus, we took two steps to enhance the efficiency. First, we experimentally decided the proper parameter settings for the variety of channels and channel buffer measurement throughout varied coaching job sizes and collective varieties. Second, we applied excessive precedence queuing at switches for CTS packets to expedite the notifications and mitigate potential bandwidth hunger.
Congestion management has been a focus of analysis in RDMA networks. DCQCN has been the gold customary for storage-focused networks. Nevertheless, our expertise with distributed AI coaching workloads supplies a distinct perspective on tailoring the congestion management algorithms. Regardless of turning off DCQCN and a number of cases of RTSW sending PFC to a deep-buffer CTSW, we’ve got not encountered a state of affairs over the past 4 years the place manufacturing AI coaching visitors causes the CTSW to ship PFCs to RTSWs persistently.
Our present answer depends upon cautious coordination between the collective communication library and the community. It might rely upon the relative throughput between GPU and community, which might not be relevant to all situations. We encourage the analysis neighborhood to place extra deal with this matter.
Transferring ahead
The design and operation of large-scale RoCE networks for distributed AI coaching workloads have advanced to fulfill the growing calls for of computational density and scale. By segregating FE and BE networks, using varied routing schemes, and optimizing collective visitors patterns, we’ve got been capable of construct a performant and dependable community infrastructure. These designs and insights underline the significance of deeply understanding the coaching workload and translating these implications into community part design, finally contributing to the development of distributed AI coaching infrastructure.
With the quick rising pattern of GenAI workload, our community infrastructure will evolve quickly.
Learn the paper
RDMA over Ethernet for Distributed AI Training at Meta Scale
Acknowledgements
We wish to thank all contributors to the paper, together with Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Adi Mohammad Riftadi, Ashmitha Jeevaraj Shetty, Jingyi Yang, Shuqiang Zhang, Mikel Jimenez Fernandez, Shashi Gandham, Omar Baldonado. Many present and former individuals within the Community Infrastructure staff at Meta have contributed to productionizing RoCE networks for AI coaching through the years. Specifically, we wish to acknowledge Srinivas Sridharan, Petr Lapukhov, Jose Leitao, and Brandon Taylor. This work is an in depth collaboration with our companions in Meta’s AI Manufacturing Engineering, AI and Methods Co-design, and AI {Hardware} Methods groups.