DDN wants to use its high-performance computing experience and become a leader for storage in artificial intelligence. It has been in the HPC sector for years and now has PS300m, a fund that focuses on AI
.
DDN has been able to make its name in high performance computing (HPC) and now it has received $300m in investment from US fund Blackstone. The money will be used, according to DDN, to translate its leadership in supercomputing into storage solutions for artificial intelligent (AI).
While arrays designed for the two workloads can keep up with high-performance processing there are differences. HPC workloads require reading a relatively few mathematical formulations in order to produce huge amounts of simulation data.
AI is the opposite. A large amount of data must be read in order to produce a small model for training or to respond to an application or a human prompt for inference.
DDN EXAscaler adapted AI
DDN launches its EXAscaler arrays in the HPC market. They use the Lustre Parallel File System, which was launched in around 20 years ago. The EXAscaler array is made up of a number disk drives, where one serves as an index for the contents of all the others. Compute nodes query that node in order to determine which of the other nodes to read and write data blocks to, before communicating directly with that node.
For the compute nodes to function, they must run a Lustre Client and have a network connection with all storage servers. This usually means an Infiniband network with no packet loss, and the controller’s ability to copy data directly into the storage nodes. Random Access Memory (RAM) Non-volatile Memory Express (NVMe), is a storage option on the host computer.
DDN’s AI400X2 arrays are designed for AI workloads and include this functionality. They use the same nodes (2U) as EXAscaler but use Nvidia Ethernet SpectrumX cards. These cards use the BlueField DPUs from Nvidia, and offer Ethernet networks the same benefits as Infiniband. The use of RDMA (RDMA over Converged Ethernet) means that there is no packet loss when writing data to Nvidia. Directly access the memory of a graphics processing unit (GPU), (using GPUdirect).
Storage of training data using DDN
AI400X2 is designed to communicate with GPUs as quickly as possible during training workloads. They’re a potentially expensive option to store huge quantities of data from models that have been trained. DDN’s Infinia arrays have been available for this purpose since 2023. These are S3 object storage systems that allow drives to be added without disruption.
DNN offloaded S3 storage to containers such as the metadata servers, the storage servers, etc. DDN can replicate in Infinia functionality that is similar to Lustre by deploying specific S3 containers on the compute nodes. SpectrumX cards can be added to Infinia arrays to increase transfer speeds.
DNN says it knows better than anyone else how intensive storage works. Incoherence problems can occur when GPUs write data simultaneously and then read it quickly. Checkpointing is a way to regulate this, but the operation is resource-intensive and does not produce useful data. DDN says that it can avoid delays by carefully managing the data flows and using caching.
DDN is preparing a big announcement
DDN has already invested in AI, and one of its customers, Elon Musk’s xAI has deployed a Colossus supercomputer with 100,000 H100 graphics processors. The purpose of the $300m isn’t entirely clear.
Blackstone has likely positioned itself in a number AI-focused businesses, and now has one member on the DDN Board. The fund provided financial support to CoreWeave last year, a provider of AI-focused Infrastructure as-a-Service.
DDN promises an important announcement on 20th February, which is prefaced by the slogan “We’re Making AI Real.”
By: Adam Armstrong
By: Adam Armstrong