Tommy Lee Walker (stock.adobe.c)
People might try to lock up GPU resource even if they do not need them all day, but no longer thanks to Nvidia KAI Scheduler.
Nvidia’s KAI Scheduler is a Kubernetes native graphics processing unit (GPU scheduling tool) that has been made open source under the Apache2.0 licence.
KAI Scheduler is a part of the Nvidia Run :ai platformis designed to manage AI workloads on GPUs as well as central processing units (CPUs). According to Nvidia KAI can manage fluctuating GPU demand and reduce wait times for access to compute. It also offers resource allocation or guarantees.
GitHub repository of KAI Scheduler claims that it supports the entire AI cycle, from small interactive jobs requiring minimal resources to large training, inference and training, all on the same cluster. Nvidia claims that it ensures resource fairness while ensuring optimal resource allocation. The tool, which can be used alongside other schedulers on a Kubernetes Cluster, allows administrators to dynamically assign GPU resources to workloads.
Ronen Dar, vice president of software systems at Nvidia and Ekin Karabulut a Nvidia data scientist said that you might only need one GPU for interactive tasks (such as data exploration), but then require multiple GPUs for distributed learning or multiple experiments. In a blog post,wrote. “Traditional schedulers have a hard time dealing with such variability.”
According to them, the KAI Scheduler constantly recalculates fair share values and adjusts quotas, limits, and quotas in real-time, automatically matching current workload demands. This dynamic approach, according to Dar and Karabulut ensures efficient GPU allocation without constant administrative intervention.
The scheduler, which they call “gang scheduling”GPU-sharing and a hierarchical queueing system that allows users to submit batches jobs, reduces wait times for machine learning engineers. Dar and Karabulut said that the jobs are launched in accordance with priorities and fairness as soon as resources become available.
Dar and Karabulut wrote that KAI Scheduler uses bin packing and consolidation to optimize for fluctuating GPU and CPU resource demand. They said that this maximizes compute utilisation through resource fragmentation and is achieved by packing smaller tasks onto GPUs and CPUs that are partially used.
Dar and Karabulut stated that it also addresses fragmentation of nodes by reallocating tasks between nodes. KAI Scheduler uses the other technique of spreading workloads over nodes, GPUs and CPUs in order to reduce per-node workload and maximize resource availability for each workload.
Nvidia also said that KAI Scheduler handles when shared clusters deploy. Dar and Karabulut claim that some researchers purchase more GPUs early in the morning to ensure availability. They said that this practice can lead to under-utilisation of resources, even if other teams still have unused resource quotas.
Nvidia says KAI Scheduler tackles this issue by enforcing the resource guarantees. Dar and Karabulut said that this approach promotes cluster efficiency and prevents resource hoarding.
KAI Scheduler offers what Nvidia refers to as a built-in Podgrouper, which automatically detects and connects to tools and frameworks like Kubeflow and Argo, and the Training Operator. This, it claims, reduces configuration complexity, and helps speed up development.
by Sean Kerner.
AMD reduces delivery time for MI350.
by Antone Gonsalves.
By: Adam Armstrong.