After a year-long collaboration, Zyphra, AMD, and IBM have demonstrated that AMD’s GPUs and platform can effectively support the training of large-scale AI models, culminating in the creation of ZAYA1.
Jointly developed, ZAYA1 is heralded as the first significant Mixture-of-Experts (MoE) foundational model constructed entirely using AMD GPUs and networking technology. This breakthrough challenges the prevailing notion that NVIDIA is the sole option for scaling AI workloads.
The training leveraged AMD’s MI300X GPUs, Pensando networking hardware, and ROCm software stack, all deployed on IBM Cloud’s infrastructure. Remarkably, the system architecture mirrors that of a typical enterprise cluster, eschewing experimental hardware or complex configurations, but notably excluding NVIDIA components.
Cost-Effective AI Training with AMD GPUs Without Sacrificing Performance
When budgeting for AI training, organizations prioritize memory capacity, communication bandwidth, and consistent iteration times over raw throughput. AMD’s MI300X GPUs, each equipped with 192GB of high-bandwidth memory, provide ample headroom, enabling initial training phases without immediately resorting to extensive parallelism. This flexibility simplifies tuning and reduces project fragility.
Zyphra’s cluster design features nodes with eight MI300X GPUs interconnected via InfinityFabric, each paired with a dedicated Pollara network card. A separate network manages dataset access and checkpointing. This straightforward setup minimizes switch costs and stabilizes iteration times by reducing network complexity.
ZAYA1: A Compact AI Model Delivering Competitive Results
The ZAYA1-base model activates 760 million parameters out of a total 8.3 billion and was trained on an extensive dataset of 12 trillion tokens through a three-phase process. Its architecture incorporates compressed attention mechanisms, an optimized routing system to direct tokens to specialized experts, and refined residual scaling to maintain stability in deeper layers.
Training employed a hybrid optimizer approach combining Muon and AdamW. To optimize Muon for AMD hardware, Zyphra fused computational kernels and minimized unnecessary memory traffic, preventing the optimizer from becoming a bottleneck. Batch sizes were progressively increased, contingent on high-throughput storage pipelines capable of delivering tokens efficiently.
Despite its relatively modest size, ZAYA1 competes with larger models such as Qwen3-4B, Gemma3-12B, Llama-3-8B, and OLMoE. The MoE architecture’s advantage lies in activating only a fraction of the model at any time, reducing inference memory demands and computational overhead.
For instance, a financial institution could develop a specialized investigative AI model without complex parallelism in early stages. The MI300X’s generous memory capacity facilitates iterative development, while ZAYA1’s compressed attention reduces evaluation latency.
Adapting ROCm for Optimal AMD GPU Performance
Transitioning from a mature NVIDIA-based workflow to AMD’s ROCm platform required deliberate optimization. Rather than a direct port, Zyphra’s team analyzed AMD hardware behavior and adjusted model parameters, GEMM (General Matrix Multiply) patterns, and microbatch sizes to align with MI300X’s optimal compute characteristics.
InfinityFabric achieves peak efficiency when all eight GPUs in a node participate in collective operations, while Pollara networking performs best with larger message sizes. Accordingly, fusion buffers were sized to maximize throughput. Long-context training, spanning 4,000 to 32,000 tokens, utilized ring attention for sharded sequences and tree attention during decoding to prevent bottlenecks.
Storage strategies were tailored to workload demands: smaller models require high IOPS, whereas larger models depend on sustained bandwidth. Dataset shards were consolidated to minimize scattered reads, and per-node page caches were expanded to accelerate checkpoint recovery-critical for lengthy training runs where rollbacks are common.
Ensuring Cluster Stability During Extended Training
Long-duration training jobs often encounter hardware and network hiccups. Zyphra’s Aegis monitoring service continuously analyzes logs and system metrics to detect issues such as NIC errors or ECC memory faults, automatically initiating corrective measures. Additionally, RCCL timeouts were extended to prevent transient network interruptions from terminating entire jobs.
Checkpointing is distributed across all GPUs, avoiding bottlenecks associated with centralized saving. This approach achieves checkpoint speeds over ten times faster than naive methods, enhancing cluster uptime and reducing operator intervention.
Implications of ZAYA1’s AMD-Based Training for AI Infrastructure
The project highlights clear parallels between NVIDIA and AMD ecosystems: NVLINK versus InfinityFabric, NCCL versus RCCL, cuBLASLt versus hipBLASLt, among others. The findings suggest that AMD’s software and hardware stack has matured sufficiently to support large-scale AI model development.
This does not imply that enterprises should immediately replace existing NVIDIA clusters. A pragmatic strategy involves maintaining NVIDIA for production workloads while leveraging AMD’s MI300X GPUs and ROCm platform for training phases that benefit from larger memory capacity and open software. This diversification mitigates supplier risk and expands training throughput without significant disruption.
Key recommendations emerging from this work include treating model architecture as adaptable rather than fixed, designing networks around actual collective communication patterns, implementing fault tolerance that preserves GPU compute time rather than merely logging errors, and modernizing checkpointing to maintain training momentum.
These insights offer a practical framework for organizations aiming to scale AI capabilities beyond reliance on a single vendor, providing a viable alternative path for large-scale AI training.