Feature: The popularity of generative AI models has exploded over the last two years. To keep up, they require ever-larger quantities of accelerators.
Without a breakthrough in machine-learning, and as power becomes a limiting element, AI’s growth may ultimately depend on a new type of supercomputer that spans entire nations and possibly even continents.
This idea is pretty simple. If it is no longer feasible to build larger datacenters, you can start combining the ones that you already have. This is where the industry seems to be heading. The Register (19459023) reported that Dell’Oro analyst Sameh Bojelbene said “Distribution is inevitable,” The Register.
She is not the only person who thinks this. Gilad Shainer is the senior veep for networking at Nvidia. He believes that “in the coming generation you will see the ability to actually build those remote datacenters together and form a large, virtual, single datacenter.”
distributing large workloads over multiple machines is not a new concept in high-performance computing. This is how all modern supercomputers, whether they are AI or scientific, work. They use high-speed interconnects like Nvidia’s InfiniBand, or HPE’s Slingshot, to connect thousands of nodes.
In a lot of ways, distributing workloads between multiple datacenters is a continuation of an existing model. However, it comes with its own unique challenges.
It’s good to know that the infrastructure needed to connect datacenters already exists, at least in part. Cloud providers use high-speed datacenter interfaces (DCI). This technology is not new.
For traditional scientific workloads Nvidia and Mellanox have offered their MetroX product line, which uses dense wave division multiplexing (DWDM) to bridge InfiniBand fabrics across multiple datacenters with a distance of up to forty kilometers.
Unfortunately the latest generation was released just weeks before ChatGPT ignited the AI goldrush. It was designed for disaster recovery, high availability and not the large-scale AI training which has been popularized in the years following the chatbot’s launch.
Shainer says that research is underway to extend the range of this chatbot from tens to thousands of kilometers. This would help to solve power challenges, as datacenters located in different regions could work together. The sheer distances and nature of AI workloads present their own challenges.
Balancing bandwidth and latency
AI workloads, in general, love bandwidth but hate latency. In the datacenter, the biggest challenge is packet loss or connections that stall, leaving compute idle as data is retransmitted. According to AMD, 30 percent of the training time is spent waiting for the network.
Several technologies have been developed to overcome this limitation. Nvidia’s InfiniBand, for example, is one solution. But specialized data processing unit and AI optimized switches are also available to overcome these challenges when using Ethernet.
When talking about datacenter-to-datacenter networks, latency is an unavoidable fact of life. Light can only travel through glass fibers at a certain speed — approximately 4.9 microseconds every kilometer. This is pretty fast. But over a span of 1,000 kilometers, it’s a round trip of about 10 milliseconds before you factor in protocol overheads and processing overheads. Retransmissions are more problematic over such distances.
Depending upon the bandwidth and distances, repeaters and amplifiers may be needed to boost the signal. This can exacerbate the issue. Rodney Wilson, Chief Technologist for Research Networks at optics vendor Ciena told El Reg that there are new technologies on the horizon which could help address this problem. Hollow core fiber is one of the technologies that could help to reduce latencies, as it would reduce the number of repeaters needed. The hollow core fiber is still relatively new and there are a lot of dark fibres already in the ground.
The problem is not just latency; it’s also bandwidth. The scale-out networks that are used to connect the GPU servers in the datacenter usually have eight 400Gbps links – one per GPU โ for a total bandwidth of 3.2Tbps. If you wanted to extend this scale-out network over the DCI it would take multiple petabits aggregate bandwidth. Wilson says that modern optics in carrier networks can now support bandwidths up to 1.6Tbps for each wavelength. With multiple wavelengths you’re looking at a pretty hefty bundle.
Shainer says that software optimization can help to mitigate many of the latency and bandwidth issues. It’s possible to minimize bandwidth while hiding latency depending on how workloads are distributed across datacenters.
For example, if you wanted to run two physically dissimilar clusters with a training workload, you would want to distribute the work in such a way that all the calculations were completed in the datacenters. You’d then only send the data across the interconnect of the datacenter when you combined the results.
“The way that you run the job determines how much bandwidth you need between the datacenters,” He added. Practical realities “It could be 10 percent of the total [scale out network] bandwidthโฆ It depends on how you structure the network.”