Watt's Really Holding AI Back

Our industry has a dim secret: the AI bottleneck isn't compute anymore. It's electricity.
Companies are now choosing where to build data centers based on grid access, not talent pools. A single rack of next-gen GPUs pulls close to a megawatt. And agentic AI, aka what everyone's racing to deploy, can consume 100x the energy of a chatbot.
The industry isn't building toward a power crisis. It's already in one.
That's the conversation Deep Geeks Ep. 1 rips open. Host Dr. Serena Huang sits down with Dasha Mukhortova, Head of Sustainability at Nebius, and Val Bercovici, Chief AI Officer at WEKA — not for a polite sustainability panel, but for an engineering-level dissection of what actually makes AI efficient and what's quietly wasting watts at scale.
The numbers that should make you uncomfortable
You're probably used to hearing about GPU clusters and training runs. But here's the scale Val puts into perspective: a single rack of the latest-generation GPUs consumes close to a megawatt of power. That's not a data center… that's one rack. And the big cloud and frontier lab deals? They're not announced in dollars or FLOPS anymore. They're announced in gigawatts. Each one roughly the output of a nuclear power plant.
Now layer on agents. Val breaks down the math: going from chat to reasoning is a 10x jump in energy consumption. From reasoning to agents? Another 10x. That's a 100x multiplier over today's chatbot workloads. Every inefficiency in your stack gets amplified a hundredfold.
Software is the efficiency lever you're underestimating
It's easy to fixate on chips and cooling. But Dasha makes a point that reframes the whole conversation: the software layer — how workloads are orchestrated, how failures are caught, how clusters scale — is where the biggest efficiency gains hide.
Watch: Why software optimization matters more than hardware alone (~1:00)
In this clip, Dasha explains why idling GPUs are an efficiency killer, and Val introduces "goodput" — the metric that reveals whether your busy system is actually producing useful output, or just burning watts.
Idling draws massive power, especially with GPUs. Nebius tackles this with autoscaling that matches bursty AI workloads to right-sized clusters in real time. And WEKA's focus on goodput — useful output per watt, not just raw utilization — shows why a "busy" system and a "productive" system aren't the same thing.
AI storage changes the efficiency equation
If we may….here is where it gets interesting. Dasha and Val both land on a point that doesn't get enough airtime: storage has to be purpose-built for AI workloads, not retrofitted from traditional infrastructure.
Watch: Why AI-tailored storage and memory tiering matter (~1:30)
Dasha breaks down the difference between active and cold storage for AI pipelines. Then Val explains how memory tiering — decoupling GPU memory from GPU compute — eliminates the most common source of GPU waste in inference today.
A core insight here is when you need more memory for inference, the default move is to provision more GPUs just to get the memory that comes attached. Those GPUs sit largely idle during inference. It's a waste of capital and energy. Decoupling memory from compute — using storage that performs at memory speed — rebalances the system and only provisions what you actually need.
Waste heat as a community resource (yes, really)
A highlight of the episode is Dasha's description of the Nebius data center near Helsinki. It's engineered not just as a consumer of energy, but as a contributor to the local energy grid. Waste heat from servers feeds back into the municipal heating system — and in 2025, households in the area spent 10% less on heating because of it.
On an annual basis, Nebius recovers 20–30% of its electricity consumption as usable heat. That's not a footnote. That's infrastructure designed as part of an ecosystem, not in spite of one.
The AI metric to watch in 2026
Val leaves the audience with a transparency challenge: ask your AI infrastructure provider whether they're using proper augmented memory technologies. Are they warehousing tokens efficiently, or dropping them on the floor as they pump them out of their AI factories? The difference is 75–90% in efficiency gains — and it shows up directly in tokens per watt.
Dasha's parting advice is equally direct: challenge every provider on how they build their stack. Nebius's custom-designed servers draw 20% less power than pre-built alternatives. Last year in Finland, the combined hardware and cooling efficiencies avoided 50 gigawatt-hours of electricity…enough to run their Paris site for five to seven months. Violà!
Why this conversation matters for you
Whether you're building AI infrastructure, buying it, or making decisions about where to deploy it, the takeaway from this episode is clear: efficiency isn't a sustainability nice-to-have. It's the competitive differentiator. You can't compete on token pricing without an efficient system — hardware, software, and everything that connects them.
The providers who engineer efficiency into every layer of their stack will set the prices. Everyone else will pay them.
Watch Deep Geeks Ep. 1 now — and if this changes how you think about AI infrastructure, share it with someone who needs to hear it.
Deep Geeks is available on YouTube, Spotify, Apple Podcasts, and wherever you get your podcasts.
Download the AI Energy Metrics Cheatsheet we developed based on this episode. It outlines six metrics and seven tips for efficient AI infrastructure.
What's Next
Scale Production AI Faster with NeuralMesh
Your models aren't slow. Your data is. Fix AI bottlenecks with high-throughput infrastructure.


