
Call for Collaboration on Balancing Power Demand in AI Training
Researchers from Microsoft, Nvidia, and OpenAI have called upon designers of software, hardware, infrastructure, and utilities to collaborate on finding ways to normalize power demand during AI training.
Nearly 60 scientists from these three firms have co-authored a paper addressing the power management challenges of AI training workloads. Their concern is that the fluctuating power demand of AI training threatens the electrical grid's ability to handle variable loads.
The paper, titled 'Power Stabilization for AI Training Datacenters,' argues that the oscillating energy demand between the power-intensive GPU compute phase and the less-taxing communication phase represents a barrier to AI model development.
The authors note the extreme difference in power consumption between the compute and communication phases, with the former approaching the thermal limits of the GPU and the latter being close to idle energy usage.
This variation in power demand occurs at the node (server) level and across other nodes at the data center, becoming visible at the rack, datacenter, and power grid levels. It's akin to 50,000 hairdryers (~2000 watts) being turned on simultaneously.
The researchers have evaluated software-based approaches, GPU-level firmware features, and data center-level battery energy storage systems, arguing that an optimal solution involves a combination of all three.
They urge AI framework and system designers to focus on asynchronous and power-aware training algorithms, and for utility and grid operators to share resonance and ramp specifications.