What happens when we cannot just build bigger AI Datacenters anymore?

Feature: The popularity of generative AI models has exploded over the last two years. To keep up, they require ever-larger quantities of accelerators.

Without a breakthrough in machine-learning, and as power becomes a limiting element, AI’s growth may ultimately depend on a new type of supercomputer that spans entire nations and possibly even continents.

This idea is pretty simple. If it is no longer feasible to build larger datacenters, you can start combining the ones that you already have. This is where the industry seems to be heading. The Register (19459023) reported that Dell’Oro analyst Sameh Bojelbene said “Distribution is inevitable,” The Register.

She is not the only person who thinks this. Gilad Shainer is the senior veep for networking at Nvidia. He believes that “in the coming generation you will see the ability to actually build those remote datacenters together and form a large, virtual, single datacenter.”

distributing large workloads over multiple machines is not a new concept in high-performance computing. This is how all modern supercomputers, whether they are AI or scientific, work. They use high-speed interconnects like Nvidia’s InfiniBand, or HPE’s Slingshot, to connect thousands of nodes.

In a lot of ways, distributing workloads between multiple datacenters is a continuation of an existing model. However, it comes with its own unique challenges.

It’s good to know that the infrastructure needed to connect datacenters already exists, at least in part. Cloud providers use high-speed datacenter interfaces (DCI). This technology is not new.

For traditional scientific workloads Nvidia and Mellanox have offered their MetroX product line, which uses dense wave division multiplexing (DWDM) to bridge InfiniBand fabrics across multiple datacenters with a distance of up to forty kilometers.

Unfortunately the latest generation was released just weeks before ChatGPT ignited the AI goldrush. It was designed for disaster recovery, high availability and not the large-scale AI training which has been popularized in the years following the chatbot’s launch.

Shainer says that research is underway to extend the range of this chatbot from tens to thousands of kilometers. This would help to solve power challenges, as datacenters located in different regions could work together. The sheer distances and nature of AI workloads present their own challenges.

Balancing bandwidth and latency

AI workloads, in general, love bandwidth but hate latency. In the datacenter, the biggest challenge is packet loss or connections that stall, leaving compute idle as data is retransmitted. According to AMD, 30 percent of the training time is spent waiting for the network.

Several technologies have been developed to overcome this limitation. Nvidia’s InfiniBand, for example, is one solution. But specialized data processing unit and AI optimized switches are also available to overcome these challenges when using Ethernet.

When talking about datacenter-to-datacenter networks, latency is an unavoidable fact of life. Light can only travel through glass fibers at a certain speed — approximately 4.9 microseconds every kilometer. This is pretty fast. But over a span of 1,000 kilometers, it’s a round trip of about 10 milliseconds before you factor in protocol overheads and processing overheads. Retransmissions are more problematic over such distances.

Depending upon the bandwidth and distances, repeaters and amplifiers may be needed to boost the signal. This can exacerbate the issue. Rodney Wilson, Chief Technologist for Research Networks at optics vendor Ciena told El Reg that there are new technologies on the horizon which could help address this problem. Hollow core fiber is one of the technologies that could help to reduce latencies, as it would reduce the number of repeaters needed. The hollow core fiber is still relatively new and there are a lot of dark fibres already in the ground.

The problem is not just latency; it’s also bandwidth. The scale-out networks that are used to connect the GPU servers in the datacenter usually have eight 400Gbps links – one per GPU – for a total bandwidth of 3.2Tbps. If you wanted to extend this scale-out network over the DCI it would take multiple petabits aggregate bandwidth. Wilson says that modern optics in carrier networks can now support bandwidths up to 1.6Tbps for each wavelength. With multiple wavelengths you’re looking at a pretty hefty bundle.

Shainer says that software optimization can help to mitigate many of the latency and bandwidth issues. It’s possible to minimize bandwidth while hiding latency depending on how workloads are distributed across datacenters.

For example, if you wanted to run two physically dissimilar clusters with a training workload, you would want to distribute the work in such a way that all the calculations were completed in the datacenters. You’d then only send the data across the interconnect of the datacenter when you combined the results.

“The way that you run the job determines how much bandwidth you need between the datacenters,” He added. Practical realities “It could be 10 percent of the total [scale out network] bandwidth… It depends on how you structure the network.”

Multi-datacenter training is relatively simple, but it also faces a number of obstacles that must be overcome.

Shainer explains that you’ll want to have your datacenters homogenous, which means they should all use the same computing architecture, to avoid bottlenecks.

Nvidia’s DGX and SuperPod design reference models have been paving the way for this for a while. These should theoretically help datacenter operators avoid the headaches that come with dealing with lopsided computing architectures.

Shainer says that if maintaining multiple cookie cutter datacenters isn’t practical, and you’re forced to make older computing generations work with new ones, it can still be done, but it won’t necessarily be as effective. “The oldest generation will determine the performance of the newest generation.”

There’s a good chance that it won’t just be two datacenters sharing load. It may be necessary to connect multiple datacenters in a mesh network for redundancy and routing diversity. Wilson explains that this is because traffic over long distances will be conceivably flowing over carrier networks which can be disrupted in a number of ways. He explained. Wilson says that in an ideal world, the network should be proactively adjusted rather than relying solely on reactive routing. Amazon spends $11B in Georgia on AI datacenters

AI hype led to a datacenter spending spree in 2024, but it won’t last.

AI buzz led to a datacenter spending spree in 2024, but it won’t last.

It is only a matter of time.

The necessity to distribute AI workloads between multiple datacenters will eventually be needed.

While the power of a datacenter limits how many GPUs it can fit, it does not limit how large or quickly a model can train. If you’re not already limited by memory limits, you can train an enormous model with GPUs in the five-figure range; it will just take longer. This may seem like a natural plateau for these massive clusters. As clusters get bigger, they become more problematic. In large clusters, mean time to fail (MTTF) is often very short, leading to more disruptions as the cluster grows.

During training Llama 405, Meta experienced an error every three hours, with over three quarters of them being hardware-related and 58 percent being GPU-related.

As clusters get bigger, the faster the job can be completed the better. This will reduce the chances of failures before the next checkpoint.

Unfortunately with AI models increasing at a rate of 4x-5x per year and GPUs needing more power to achieve generational performances, it’s only going to be a matter time before these systems grow beyond the confines of one datacenter. (r)

What happens when we cannot just build bigger AI Datacenters anymore?

Balancing bandwidth and latency

It is only a matter of time.

Scale AI hires team behind remote developer recruiting platform Pesto AI

Getty Images CEO warns that it cannot afford to fight every...

Apple appeals EU Digital Markets Act based on privacy

Court documents reveal OpenAI for iPhone

Recomended

Scale AI hires team behind remote developer recruiting platform Pesto AI

Getty Images CEO warns that it cannot afford to fight every AI Copyright case

Apple appeals EU Digital Markets Act based on privacy

Court documents reveal OpenAI for iPhone

WhatsApp now has usernames

Apple in the running for streaming rights to MLB Sunday Night Baseball