PODCAST

Can AI Survive Its Own Energy Appetite? | DeepGeeks Ep. 1

The energy is then translated into AI outputs, but it doesn't tell you the full story. How much energy per workload has been used or how much energy per token has been used? These resource demands are so intense right now that every ounce of inefficiency is not just bad for the environment, it's bad for business, and it restricts the agility of these companies. Welcome to Deep Geeks. I'm doctor Serena Huang. Today, we are going deep on a question that keeps a lot of people in our industry up at night, and it's not about models or data or compute in a way you might expect. Can AI innovation outpace its own energy demands, or are we building toward a power crisis? I have two incredible guests. First, Daria Mukhortova, head of sustainability at NEBIUS. And I'm also joined by Val Berkovici, Chief AI Officer at WEKA. Thank you, Serena. I'm very happy to be here. Thank you for having me. Pleasure, Serena. Looking forward to this. Dasha, let's start with you. Tell us a little bit about your work at Nebulas and why sustainability sits at the center of an AI infrastructure company. My key task from from the very beginning was to ensure that we don't just try figure out what Nebulas is and what it builds, but also how. This actually translates into a few principles that myself and the team, we agreed on early in the day. First is that we don't treat sustainability as something that comes after, but we treat it as a principle. For us, sustainability is basically a synonym to efficiency. A synonym of reliability, which makes it pretty much understandable of how to treat it as an engineering approach. So that's pretty much summarizes, I guess, like, core things that I'd focus on internally, and looking forward to the discussion. Likewise. Can't wait to dive in. Such an interesting perspective. And, Val, you are approaching this from a different angle as chief AI officer at WEKA. And given your background building and foundation layers of cloud and storage infrastructure, what does energy efficiency mean to you when you are thinking about how AI systems are actually built? Storage for accelerated compute for AI, as Johnson calls it, is very different than storage for traditional computing. It's this very deep, geek, technical challenge of performance efficiency, not just capacity efficiency. And performance efficiency is very sophisticated engineering for that high performance. You can say that general performance equals revenue, performance efficiency equals profit, and that's why there's, you know, as Joshua was saying, really great aligned incentives here. It's not just that, you know, we're doing this for altruistic sense. It really aligns with us as a business trying to help our customers extract maximum profits out of these, you know, know, unprecedented capital and and operational expenditures they have. I am blown away by how these two perspectives are now actually converging, the sustainability lens and the infrastructure lens. Because for a long time, I remember them as very separate conversations. And now we see how energy efficiency isn't just a check the box exercise for sustainability. It's real engineering constraint. So let's dive into how companies like Nebulas are actually solving for it. Dasha, let me start with something concrete here. Nebulas built the data center infrastructure in Helsinki, and that wasn't an accident. Could you please walk us through the thinking behind that? And, specifically, I'm curious about the heat recovery piece because the idea of recycling waste heat into usable energy feels like it flips this whole conversation. Yes. Indeed. Our data centers in Finland, it's, like, thirty minutes away from Helsinki to get the bit to the south. It's what we internally call as our playground for all the specific innovations that then then we try to roll out across other sites. The first one, it's first data center that we we built. And what's interesting about that specific site is that is how it's engineered to be not just a consumer of energy, but also a contributor to the energy system. And the specific example that you mentioned about heat recovery, it's inbuilt in the cooling system cycle. For us, this is actually win win across many fields. And I can walk you through the actually, benefits of having the system. So first of all, it's a great economic contribution to the local community because by using by reusing server heat that capture, which is basically free, you actually are able to reduce the costs of producing heat for the municipality. And in last year, in twenty twenty five, the households, they actually paid ten percent less. They spent ten percent less on heating because they were able to to leverage the free server heats as a as a resource. And if we talk about numbers here and if we treat heat as a byproduct of electricity consumption, So that would be recovering around twenty, thirty percent on an annual basis and given back to the local energy system. And that's, I think, is key that when we think about infrastructure as an interplay of solutions that can be efficient on their own, but also connected with a local energy system and also find ways to power that system back. I am going to keep that story in my back pocket, though, because I purposely spend a lot of time with people who are AI skeptics, I would call. And one of the reasons people say I don't want to touch AI is because of energy consumption. And you just illustrated a very different way of approaching AI that can actually translate into greater good for the whole community. So thank you for sharing that. And, Val, from your vantage point, when customers come to WEKA, what does this energy efficiency Dasha just described actually unlock for them? What becomes possible when infrastructure is built this way. If you're striving for, again, these two forms of efficiency in the storage world, capacity efficiency and performance efficiency, that's a direct impact on your bottom line and your CapEx, Particularly since late twenty twenty five, beginning of twenty twenty six, we are now seeing the rise of agents. And agents are again another you know, if we go from chat to reasoning, that's an order of magnitude more energy consumption and efficient performance requirements for efficiency. As we move from reasoning to agents, it's another ten x, meaning a hundred x more, you know, intensive workloads, compute workloads than for chats. And so the performance efficiency benefits I'm talking about get amplified a hundred x between WEKA and Nebius right now. It really puts a a very big scope, a very big lens on how efficient you're being because these resource demands are so intense right now that every ounce of inefficiency is not just bad for the environment, it's bad for business, and it it restricts the agility of these companies. You're seeing some of the the leading minds, the leading voices, the CEOs of OpenAI and Anthropic publicly talk about balancing cash flows to survive as these very high pace startups, these unprecedented sort of scale startups, and performance efficiency and overall efficiency is a big part of that. So now we're seeing that efficiency at the infrastructure layer is not just about saving money or electricity bills. It's about what you can do with AI. And I see that organizations that crack this, they're going to be able to run more experiments, train more models, serve more inference, which brings me to the next question. So, Val, help me out here. What should we be thinking about? What should we be measuring? And I've heard terms like microwatts per token, PUE, but help us understand the benchmarks that actually matter. I'll start with a transparent one. Now maybe one of the things to remind people is this is a very transparent industry, particularly on the inference side. So pricing token pricing is a very public metric, very, very competitive. Yes. And, you know, Nebulas and us do a lot of work on optimizing token pricing right now. But I think with regards to this discussion, and I'll let Dasha focus on PUE because I think that's one of our specialties, We can focus on tokens per watt. Now very often when it comes to performance, we try and measure tokens per second, sometimes tokens per GPU, and the GPU and now the LPUs and other kinds of accelerator types are diversifying, but tokens per watt is a really, really key metric. And inefficient inference, which is really where we're at today you know, think of of Amazon. Amazon is legendary for their factories. Amazon is legendary for their warehouses and their delivery logistics. The way inference works right now being so nascent is that there are no warehouses in the inferencing reality of delivering tokens to users. There's just factories with lots of inefficiencies. So introducing a concept of token warehouses lets you not waste factory output, lets you optimize and scale the delivery of tokens to your users. It really shows up very much in the actual tokens per watt. It shows up so much so that inefficient implementations end up consuming as much as any individual household would use in a single day per chat session or per agent session. Wow. Where it's yeah. Exactly. Where as efficient, you know, warehousing of tokens, if you will, and inefficient memory for tokens, not just storage, ends up reducing that by eighty percent or more, you know, as an active project we have with Nebulas right now. Wow. That's a huge difference. Dasha, do you want to chime in there on some of the metrics? I I guess I can definitely cover the PUE part. PUE basically is a metric that meant that that measures how efficient energy is delivered to IT equipments. Right? However, I would make this note that power usage effectiveness is great on its own to see how you design the physical infrastructure that's built around the specific GPUs. However, it doesn't tell you how efficiently the energy is then translated into AI outputs. So from my perspective Right. It's very important to have a broader look at how we can measure efficiency and what Val was talking about is should be definitely part of the picture. And what I see also in the industry, there is a certain there is a good understanding what PUE is and hence, people tend to also compare different sites and different providers based on PUE, but it doesn't tell you the full story. So from my perspective, it's important to also introduce the measurement of, let's say, how much energy per workload has been used or how much energy per token has been used. And another, like, level of depth here would be how much of useful outputs. We also differentiate between, I guess, good put and throughput here has been produced per megawatts or per per watts of energy that entered the system. And this brings me to one thought about how we can treat the infrastructure. And while the discussion is largely focused on hardware, Like what the chips how the chips are performing, how the servers are performing, what the cooling systems are, how they are contributing to that. The big question is about the software. Because software plays an important role in orchestrating all of that all of that orchestrating all of that all of that hardware and infrastructure and allocating the workloads in a way that leverages all the available piece of fabric leads to servers not staying idle. Idle in is actually a killer to efficiency. I would set this discussion as discussion as something that needs to take into account all the pieces and all the layers that infrastructure is composed of starting from how you build a chip, how you build enclosure of the chip, onto what your software is doing, how it's performing, how it's orchestrating the workload. And so good put literally shows that you can have a very busy system. Right? We we talk about GPU efficiency a lot because they're such expensive assets and they're so expensive to operate. You can have a very busy system that's not really producing a lot of useful output or producing at a very, very slow rate, and that's if you measure throughput or just general utilization. But if you really focus on good put, it's what is the actual useful output? What is the actual, you know, performance level, efficiency level that you want? And you want that utilization to focus on bad versus inefficiently being busy, you know, and and focusing on and delivering inefficient output that's not adequate. So the actual final tokens per second is something that's more important, more relevant to measure than just general utilization of the infrastructure to generate that output. And it differentiates and elevates the conversation from just throughput, for example. I think we keep hearing about hardware. Frankly, it's all we hear about most of the time, but we forget about the software. Dasha, is there anything else you want to add on software optimization when it comes to sustainability decisions? And from my perspective, I can also tell that this is something that Nebius decided to invest in from the very beginning, like to have the software layer on top of the of the the hardware stack because we see efficiency gains at that level. And I can provide a couple of examples of how this can work. Well, software layer is actually responsible for can be responsible for tracking certain failures in your workloads. In the notes, your inputs in your interest to actually fix this seamlessly so the workload doesn't start to retrain, for example. Because retraining, it means that actually drawing twice as much of resources. Auto scaling is something that is very necessary to ensure that the bursty consumption that the AI workloads are, like, normally characterized with. Right? This bursty consumption would be met with the right sized cluster. That would mean to ensure that the servers or GPUs are not over provisioned, meaning that they're not staying idle. And idling can actually draw quite a lot of power, especially when we talk about GPUs. So it's always the question about leveraging as much of the of the available capacity to meet the workload versus just provisioning it, over provisioning it, let's say to one specific client and then locking it for any useful outputs that could be there. So that's what auto scaling at the software layer is also solving. And there are many examples like that and I would actually invite also Val to contribute with his knowledge of the storage systems because what I also see that storage, it has to be AI tailored. There are several types of of data that needs to be stored. It could be the question of having active storage or cold storage. And it depending on what your workload actually needs. If it's the training data, it has to be going to the active storage, meaning that you need to have access to it faster. Meaning, fewer bottlenecks is actually less energy being drawn. And then if you are talking about some historical logs, it's it's best to allocate this to another type of storage that actually consumes zero close to zero power when when idling. Right? So this type of questions about how we actually manage and orchestrate the workload when already running can bring us to significant efficiency gains on top of what infrastructure can provide in terms of cooling savings and drawing less power per, like, per run. So tiering memory is probably the hottest thing in AI right now for inference to respond to the agent demand and the ability to provide storage capacities with memory performance to, you know, leading offerings like the Nebius Token Factory is probably one of the most exciting things that we're working on together. So one of the reasons why memory tiering is so important is that today, it's a general best practice that if you wanna scale inference, the the memory that traditionally comes with GPUs is very tightly coupled to the GPUs. So and you need more and more memory for inference, which is really inference is a memory problem. You have to over provision GPUs just for the memory they bring along for the ride. And during inference, those GPUs are largely idle. So it's really a waste of the capital resources, and it's inefficient energy wise as well. If you decouple for the first time in the AI era, the compute from the memory, the GPUs from the GPU memory tiers, you can rebalance the system and only provision the memory needed for inference without over provisioning those idle wasteful GPUs at inference time. And that's the magic here is the software combined with the hardware to balance the system, weed out all the inefficiencies, and yield that good put that we're all chasing. For most organizations, even very sophisticated ones, what happens between plugging in power and getting tokens out, that process is completely opaque. And that's actually very expensive. So we're going to open this up. Let's contextualize at the scale here because I think the numbers are shocking to people who haven't looked at this closely. Val, what is the intelligence on how much energy AI actually consumes from training to inference to just keeping the lights on at an AI data center? We're now in the era of GPUs, which instead of thousands of cores per rack, have millions of cores per rack. So it's a fundamentally different level of parallelization. And not just having to understand the engineering for how to make something operate a million times in parallel versus a thousand times in parallel. That's already a very steep engineering challenge from a compute networking, you know, storage software perspective. But the energy consumption for a rack of GPUs is hundreds of kilowatts, and we're now in the era of the latest generation processors. I think by the time we'll be we'll be airing this, the latest generation of GPUs will be consuming up to a megawatt per rack Wow. Which is kind of, you know, in in historical context, almost an insane amount of energy consumption. And I think there's a real reason why many of the big announcements between the Frontier Labs that get a lot of headlines and the major GPU suppliers like NVIDIA and AMD and Google and others now. The announcements aren't in, you know, aren't in, like, a performance metric or even a dollar value. A lot of these deals are announced in how many gigawatts of capacity they've agreed to provide each other. And and so that's just a really interesting, you know, evolution of how we discuss an industry is instead of going from a compute metric or just an economic metric like dollars, we're already talking about gigawatts now, multiple gigawatts of energy for these deployments, largely for inference now. And it has evolved so quickly. This is really innovation and scale. We couldn't imagine this probably even three years ago. Right? A gigawatt is typically what one nuclear power plant generates. Right. We're talking about multiple nuclear power plants now dedicated just to one or more large scale AI data centers, AI Factories, and and with efficiencies, AI warehouses, soaking warehouses as well. Yeah. Incredible. Well, Dashan, in your work with organizations on sustainability, I'm curious what you've heard as misconceptions that you run into. Well, first of all, now apart from the discussion around energy consumption, there is also this big focus on what happens with water resources. And I guess the first, like, the intuitive reaction of many people would be assuming that data centers and this infrastructure consumes a lot of water. And it can be the case. However, the question is, how is the cooling system designed? So this is the question I think that the industry should be asking first before making jumping into assumption that one side or another consumes a lot of water. And provide an example of Nebius here. We actually do not rely on water intake even though we're now introducing the liquid cooling system. And it's because the system that we are building is closed loop. It doesn't include any evaporative components. And we actually use the outside air through the dry coolers to get the temperature of the fluid that circulates within the same loop of, like, actually thousands, millions of cycles to take get the temperature down. Also, another question is why was this design possible? Like, if going deeper into understanding the technology and linking back to, like, seeing this as a system as a whole. Because in our case, the reason why we were designing the system this way is also because our servers operate at higher temperatures. So that are specifically designed to be resilient and to be high performance under temperatures of up to forty five degrees. So it means that we don't need as much power and as much of cooling. So that's one thing about the water consumption. When thinking about the energy footprint of the AI workload, and we discussed it actually quite a few times already with you that the first thought is about hardware and the physical infrastructure. However, what gets overlooked is also how the model itself is being designed. Because the design of the model, the setup of the model actually defines to a big extent how efficiently it will operate on a given hardware. Actually, Nebius, we were also thinking about that. And the token factory actually recently introduced the post training optimization tool, which is linking us back to the discussion about the important of software. And what this tool does, it basically, like, a very in a nutshell, it tunes and tweaks the setup, the post training setup of the model so it can be more efficiently running on the hardware that's hosted. Meaning that the post training will be completed faster without failures, and it means that we'll withdraw less power. So this I would say, overlooking the the model set up when talking about the footprint and also another misconception that I face a lot in the discussions. And probably the third one that I would mention is indeed that there is a big discussion now and it gets a lot of traction in the news. They're adding energy built, a capacity. So there's a lot of investment in the area like literally building some facility energy that would be linked to your data center supply and power. And the question that I would ask is how much of this new build will actually be translated into useful computes? Or is it adding and then losing on overheads? So this is the question that I think will become even more important than the discussion about how much capacity company a or facility a needs to add to meet the demands. We all hear about how expensive, how energy consuming data centers are, but there's a lot happening inside the black box that most people realize. And the other point of emphasis, we just can't repeat it enough in this conversation, is that systems efficiency of hardware and software. Because one of the transparency metrics, hopefully, that will become more prominent throughout twenty twenty six is, are you using, you know, proper memory technologies, augmented memory technologies? Are you warehousing tokens or dropping them on the floor as you're pumping them out from your AI factories? Because in doing that, it's proven now that you get anywhere from seventy five to ninety percent efficiencies by balancing systems with the appropriate level of memory versus not doing it in the nascent era of inference. What advice we could give to clients that are like, customers that are choosing the provider, AI infrastructure provider, I would say that first question that needs to be like, first, every infrastructure provider has to be challenged in how they're building their stack. I understand that, of course, customers would largely be driven by such things as cost efficiency, like the price of the offering, as well as, of course, reliability. But also another question is, how is that achieved? And not every every company would be the same here. So like in our case, in case of Nebulas, the price and the affordability is managed through the efficiencies achieved throughout the stack, which gives us this freedom of actually setting the price that would be more favorable for the clients, not because we are just doing it, but but also because we can optimize our cost. Since our servers, they actually draw twenty percent less power than any off the shelf solutions because we design this in house. This also results in a very tangible cost saving for the team. Right? Then it translates into savings at the cooling system level, at the PUE level that we discussed. Right? And it's adds on top of that. And I can even provide you an example of Finland since I have these numbers already. So last year, because of hardware efficiency, meaning server efficiency and cooling system efficiency, we were able to avoid fifty gigawatt hour of of electricity use. And using that power, we can actually run our Paris site for, I guess, five to seven months, twenty four seven. So and, like, this is a lot about cost saving that then translates into cost efficiency for clients. And I think actually this will become a growing interest from for many clients, from enterprises, trying to understand how they work well that they're wrong with you, what kind of footprint, energy footprint, cons car carbon footprint it actually provides. As a provider, you should be able to do that for the clients. And if you're not able to do that, that should be also something to probably discuss and consider. Very helpful and very practical. As we close out today, I'm curious if you can leave us with some parting thoughts to our listeners. What is one thing you want them to remember? What is one mindset shift? What is one key message? Yeah. I think for messages, probably even more for providers and for consumers is that this is such a competitive industry, and it's literally at the forefront of science and technology and engineering that the market forces will drive efficiency as part of overall competitive offerings. You know? You really won't be able to compete particularly on token pricing, which is which is almost everything in the inference industry if you don't have an efficient system, hardware, software, etcetera, environmental. So I'm actually generally optimistic. Right? There's a easy it's easy to be a doomer in this industry, but when you take a look at just the the realities of how you build systems to compete, efficiency is not optional. It was optional in the past for IT systems. It was nice to have. It's just on a critical path. It's fundamental and essential towards having a competitive offering in AI training and particularly inference markets. From my perspective, we think we all agree that AI brings benefits. It can accelerate like drug discovery, research on diseases, and many many other like critical fields can benefit from AI. So it means the use of energy actually work. It makes AI worth use of energy. And what will define the future of AI is how well we engineer it. And if we think about this as a ecosystem of different decisions made at different levels with different players from chip from chip providers to infrastructure providers and how they design the system onto software and how different tools are being built, This is and if we all work together, this is how I think we can achieve a sustainable AI that produces maximum value, but also is mindful and conscious about the resources that is using. I feel so inspired and just encouraged by that. Thanks for listening to Deep Geeks. A huge thank you to our guests today, Dasha and Val, for bringing such depth and honesty to a conversation the whole industry really needs right now. If our episode today make you think differently about how AI gets built or powered, please share it with someone who needs to hear it. Find deep geeks on Spotify, YouTube, or wherever you get your podcasts. Until next time.

AI infrastructure is scaling at an unprecedented pace — but at what energy cost? In the debut episode of Deep Geeks, host Dr. Serena Huang sits down with Daria Mukhortova, Head of Sustainability at Nebius, and Val Bercovici, Chief AI Officer at WEKA, to unpack the real relationship between AI performance and energy efficiency. From data center heat recovery to token warehousing, this conversation goes deep on the engineering decisions that separate sustainable AI from wasteful AI.

Transcript

Meet the Speakers:

Dr. Serena Huang (Host) — Data and AI strategist with 10+ years of Fortune 100 leadership across GE, Kraft Heinz, and PayPal. Author of The Inclusion Equation (Wiley, 2025) and founder of Data with Serena.

Daria Mukhortova — Head of Sustainability at Nebius, where she embeds efficiency-first principles across Nebius’s AI infrastructure stack — from custom server design to closed-loop cooling systems.

Val Bercovici — Chief AI Officer at WEKA. Former CTO at NetApp/SolidFire with patents in AI agent smart contracts and streaming data integrity. At WEKA, Val drives product strategy around high-performance storage for accelerated compute and AI inference.

0:00

Why Sustainability Is an Engineering Problem, Not an Afterthought

Serena: Dasha, tell us about your work at Nebius and why sustainability sits at the center of an AI infrastructure company.

Dasha: My key task from the very beginning was to ensure that we don’t just figure out what Nebius is and what it builds, but also how. This translates into a few principles that myself and the team agreed on early on. First, we don’t treat sustainability as something that comes after — we treat it as a principle. For us, sustainability is a synonym for efficiency, a synonym for reliability, which makes it understandable as an engineering approach.

Serena: And Val, you’re approaching this from a different angle. As Chief AI Officer at WEKA, what does energy efficiency mean when you think about how AI systems are actually built?

Val: Storage for accelerated compute for AI is very different than storage for traditional computing. It’s a deep technical challenge of performance efficiency, not just capacity efficiency. Performance efficiency requires very sophisticated engineering. You can say that general performance equals revenue, and performance efficiency equals profit. As Dasha was saying, there are really great aligned incentives here. It’s not just altruistic — it aligns with us as a business trying to help our customers extract maximum profit out of these unprecedented capital and operational expenditures.

03:12

How Nebius Turns Data Center Waste Heat Into Community Energy

Serena: Let’s dive into something concrete. Nebius built data center infrastructure near Helsinki, and that wasn’t an accident. Walk us through the thinking — specifically the heat recovery piece, because the idea of recycling waste heat into usable energy flips this whole conversation.

Dasha: Our data center in Finland, about 30 minutes south of Helsinki, is what we internally call our playground for innovations that we then roll out across other sites. What’s interesting about this site is how it’s engineered to be not just a consumer of energy, but a contributor to the energy system. The heat recovery is built into the cooling system cycle.

First, it’s a great economic contribution to the local community. By reusing server heat — which is essentially free — you reduce the cost of producing heat for the municipality. In 2025, households spent 10% less on heating because they were able to leverage that free server heat as a resource. In terms of numbers, we’re recovering around 20–30% of electricity consumption as heat on an annual basis and giving it back to the local energy system. When we think about infrastructure as an interplay of solutions that can be efficient on their own but also connected with the local energy system, that’s the key.

Serena: I’m going to keep that story in my back pocket. I spend a lot of time with AI skeptics, and one of the top reasons people resist AI is energy consumption. You just illustrated a very different approach that can translate into greater good for the whole community.

06:40

What Energy-Efficient Infrastructure Unlocks for AI Customers

Serena: Val, from your vantage point, when customers come to WEKA, what does this energy efficiency actually unlock for them?

Val: If you’re striving for capacity efficiency and performance efficiency in the storage world, it’s a direct impact on your bottom line and CapEx. Since late 2025 and into 2026, we’re seeing the rise of agents. If we go from chat to reasoning, that’s an order of magnitude more energy consumption. From reasoning to agents, it’s another 10x — meaning 100x more intensive compute workloads than for chat.

The performance efficiency benefits get amplified 100x between WEKA and Nebius right now. Every ounce of inefficiency is not just bad for the environment, it’s bad for business, and it restricts agility. You’re seeing leading voices — the CEOs of OpenAI and Anthropic — publicly talk about balancing cash flows to survive as these unprecedented-scale startups. Performance efficiency is a big part of that.

08:37

Tokens Per Watt, PUE, and the Metrics That Actually Matter for AI Efficiency

Serena: What should we be measuring? I’ve heard terms like microwatts per token and PUE — help us understand the benchmarks that matter.

Val: I’ll start with transparency. Token pricing is a very public metric, very competitive. Nebius and WEKA do a lot of work optimizing token pricing. But with regard to this discussion, I’ll let Dasha focus on PUE. We can focus on tokens per watt.

Very often we measure tokens per second, sometimes tokens per GPU, and GPUs and other accelerator types are diversifying. But tokens per watt is a really key metric. Inefficient inference — which is where we are today — is like a logistics system with no warehouses. Amazon isn’t legendary for their factories; they’re legendary for their warehouses and delivery logistics. In the inference world right now, there are no warehouses — just factories with lots of inefficiencies.

Introducing a concept of token warehouses lets you stop wasting factory output and optimize delivery of tokens to users. It shows up directly in tokens per watt. Inefficient implementations can consume as much energy per chat or agent session as an individual household uses in a single day. Efficient token warehousing and efficient memory for tokens reduces that by 80% or more — that’s an active project we have with Nebius right now.

Dasha: PUE — power usage effectiveness — measures how efficiently energy is delivered to IT equipment. However, it doesn’t tell you how efficiently the energy is translated into AI outputs. It’s important to have a broader look at efficiency. What Val was talking about should definitely be part of the picture.

There’s good understanding of PUE in the industry, and people tend to compare providers based on it, but it doesn’t tell the full story. It’s important to also measure how much energy per workload or per token has been used. Another level of depth is differentiating between goodput and throughput — how much useful output is produced per watt of energy entering the system.

12:57

Why Software Optimization Matters More Than Hardware Alone

Dasha: This brings me to how we treat infrastructure. The discussion is largely focused on hardware — chips, servers, cooling systems. But the big question is about software, because software plays an important role in orchestrating all of that hardware and allocating workloads in a way that leverages all available capacity and keeps servers from staying idle. Idling is actually a killer to efficiency. This discussion needs to account for all layers, from how you build a chip to what your software is doing and how it orchestrates workloads.

Val: Goodput literally shows that you can have a very busy system that isn’t producing useful output, or that’s producing at a very slow rate. If you measure throughput or general utilization, you miss this. If you focus on goodput — actual useful output at the performance and efficiency level you need — you want utilization focused on that versus being inefficiently busy. Final tokens per second is more important than general utilization of the infrastructure.

Dasha: This is something Nebius decided to invest in from the very beginning — a software layer on top of the hardware stack. For example, the software layer can track failures in workloads and fix them seamlessly so the workload doesn’t have to retrain. Retraining means drawing twice as many resources.

Autoscaling is critical for matching the bursty consumption that AI workloads are characterized by with the right-sized cluster — ensuring GPUs aren’t overprovisioned and sitting idle. Idling draws significant power, especially with GPUs. Autoscaling at the software layer solves this.

Storage also has to be AI-tailored. There are different types of data — active storage versus cold storage. Training data needs active storage for faster access and fewer bottlenecks, which means less energy drawn. Historical logs are better allocated to storage that consumes near-zero power when idle. Managing and orchestrating workloads at runtime can bring significant efficiency gains on top of what infrastructure provides through cooling savings and reduced power draw.

16:49

How Memory Tiering Reduces GPU Waste in AI Inference

Val: Tiering memory is probably the hottest thing in AI right now for inference, to respond to agent demand. The ability to provide storage capacities with memory performance to offerings like the Nebius Token Factory is one of the most exciting things we’re working on together.

Today, it’s a general best practice that if you want to scale inference, the memory that traditionally comes with GPUs is tightly coupled to those GPUs. As you need more memory for inference — which is fundamentally a memory problem — you have to overprovision GPUs just for the memory they bring along. During inference, those GPUs are largely idle. It’s a waste of capital resources and inefficient energy-wise.

If you decouple compute from memory for the first time in the AI era — GPUs from GPU memory tiers — you can rebalance the system and only provision the memory needed for inference without overprovisioning idle, wasteful GPUs. That’s the magic: software combined with hardware to balance the system, weed out inefficiencies, and yield the goodput we’re all chasing.

18:09

How Much Energy Does AI Actually Consume? From Racks to Gigawatts

Serena: Let’s contextualize the scale, because these numbers are shocking. Val, how much energy does AI actually consume — from training to inference to keeping the lights on?

Val: We’re now in the era of GPUs which, instead of thousands of cores per rack, have millions of cores per rack. A fundamentally different level of parallelization. Energy consumption for a rack of GPUs is hundreds of kilowatts. With the latest generation processors, we’re heading toward a megawatt per rack.

In historical context, that’s almost insane. Many of the big announcements between frontier labs and GPU suppliers like NVIDIA, AMD, and Google aren’t in performance metrics or dollar values anymore — they’re announced in how many gigawatts of capacity they’ve agreed to provide each other. A gigawatt is typically what one nuclear power plant generates. We’re talking about multiple nuclear power plants now dedicated to large-scale AI data centers.

20:48

Common Misconceptions About AI's Environmental Footprint

Serena: Dasha, what are the most common misconceptions you encounter?

Dasha: First, beyond energy consumption, there’s a big focus on water resources. The intuitive reaction is that data centers consume a lot of water, and it can be the case. But the question is: how is the cooling system designed? That’s what the industry should ask first.

At Nebius, we don’t rely on water intake. Even though we’re introducing liquid cooling, the system is closed-loop with no evaporative components. We use outside air through dry coolers to reduce the temperature of fluid that circulates within the same loop through millions of cycles. This design is possible because our servers are specifically engineered to be resilient and high-performing at temperatures up to 45 degrees Celsius, so we need less cooling power.

Second, what gets overlooked is how the model itself is designed. The model’s setup defines, to a big extent, how efficiently it operates on given hardware. Nebius Token Factory recently introduced a post-training optimization tool that tunes the model’s setup so it runs more efficiently on the hosting hardware — meaning post-training completes faster, without failures, and draws less power.

Third, there’s a big discussion around adding energy capacity — building facilities linked to data centers to supply power. The question I would ask is: how much of this new build will actually be translated into useful compute? Or is it adding capacity and then losing it to overheads? That question will become more important than how much capacity a company needs to add.

25:07

Why Efficiency Is the New Competitive Advantage in AI Infrastructure

Val: Systems efficiency of hardware and software together is critical. One transparency metric I hope becomes more prominent throughout 2026 is whether providers are using proper augmented memory technologies. Are you warehousing tokens or dropping them on the floor as you pump them out of your AI factories? It’s proven now that you get 75–90% efficiency gains by balancing systems with appropriate memory versus not doing it.

Dasha: For customers choosing an AI infrastructure provider, my advice is: challenge every provider on how they’re building their stack. Customers are driven by cost efficiency and reliability, but also ask how that’s achieved. In our case at Nebius, affordability is managed through efficiencies achieved throughout the stack, which gives us freedom to set favorable pricing — not arbitrarily, but because we can optimize our costs.

Our servers draw 20% less power than off-the-shelf solutions because we design them in-house. That translates into savings at the cooling system level and the PUE level. Last year in Finland, because of hardware and cooling system efficiency, we avoided 50 gigawatt-hours of electricity use. That saved power could run our Paris site for five to seven months, 24/7.

This will become a growing interest from enterprise clients trying to understand what kind of energy and carbon footprint their workloads carry. As a provider, you should be able to report that to clients — and if you can’t, that’s something to consider.

28:01

The Future of Sustainable AI: Closing Thoughts

Serena: As we close out, what’s one thing you want listeners to remember?

Val: This message is for providers and consumers alike: this is such a competitive industry at the forefront of science, technology, and engineering that market forces will drive efficiency as part of competitive offerings. You won’t be able to compete on token pricing without an efficient system — hardware, software, environmentals. I’m generally optimistic. It’s easy to be a doomer in this industry, but when you look at how you build systems to compete, efficiency is not optional. It was optional in the past for IT systems. Now it’s on the critical path — fundamental and essential for a competitive offering in AI training and especially inference.

Dasha: We all agree that AI brings benefits. It can accelerate drug discovery, disease research, and many other critical fields — which means the use of energy is worth it. What will define the future of AI is how well we engineer it. If we think about this as an ecosystem of decisions made at different levels with different players — from chip providers to infrastructure providers to software builders — and if we all work together, that’s how we achieve sustainable AI that produces maximum value while being mindful of the resources it uses.

Serena: Thanks for listening to Deep Geeks. A huge thank you to Dasha and Val for bringing such depth and honesty to a conversation the whole industry needs right now. If this episode made you think differently about how AI gets built or powered, share it with someone who needs to hear it. Find Deep Geeks on Spotify, YouTube, or wherever you get your podcasts. Until next time.

Related Resources

Video

The Real Cost of AI: How Smart Companies Are Maximizing Token ROI

Watch the Video

Video

AI Token Economics and the Real Cost of Running AI Models

Watch the Video

PRODUCTS

DEPLOYMENT OPTIONS

USE CASES

INDUSTRIES

ARCHITECTURES

Learn AI

RESOURCES

TECHNICAL RESOURCES

ABOUT US

JOIN US