VIDEO

Your GPUs Are Waiting. The IO Blender and Memory Wall Explain Why.

Kai Wiliams of SemiAnalysis sat down with WEKA Chief AI Officer Val Bercovici at GTC 2026 to unpack the IO blender problem, the memory wall hitting AI infrastructure now, and why software efficiency has become a competitive imperative.

Below is a transcript of the conversation, which has been lightly edited for clarity.

Kai Williams: Well, welcome to another SAIL interview. Can you introduce yourself?

Val Bercovici: Absolutely. Val Bercovici, Chief AI Officer at WEKA.

Kai: And so what does WEKA do?

Val: WEKA provides storage and memory software, and I’ll explain what memory software means for AI infrastructure. We have a traditional strong brand in the supercomputing and high-performance computing space. Post-ChatGPT, we all realized it’s the same basic hardware infrastructure, and it turns out that our storage software works really, really well for AI. We’ll talk a bit more about how that actually applies to KV cache memory as well.

Kai: When ChatGPT happened, was it an immediate thing? Like, “This is the future of our company now”? Is this one of those JFK moments where you remember exactly where you were?

Val: Very much so. My background — just going a little further back — I used to be CTO at NetApp after they acquired a cloud storage company called SolidFire. That was around 2015-2017. I actually spent time with folks at Google, taking their container orchestration platform and helping open-source it as Kubernetes. So I actually thought, “Hey, storage is done. You can’t innovate more than storage for Kubernetes.” I went away and did some cybersecurity work for a while. Then November 2022 happened, and all of a sudden everyone had the realization that not only would you need storage for AI, but it’s exciting again because it’s a very different kind, very parallel storage that we’d never seen before commercially, outside of a few national labs. So yes, very exciting.

What Is the IO Blender Problem in AI Infrastructure?

Kai: Can you walk me through the problem you’re trying to solve? You mentioned this parallel thing, but what are the new challenges?

Val: So there are old and new challenges in AI. There’s pre- and post-ChatGPT challenges, to simplify it. Pre-ChatGPT, a lot of time was spent on model training, regular language models as well as large ones. One thing that only infrastructure people realize is that when you’re training models, it’s not off of just a bunch of files. It’s billions of files, and they’re not the same size. They’re not just large, they’re not just small; they’re scattered everywhere. So there are metadata requirements. We call this the “IO blender.” Pre-large language models, just training a model was really complicated. You had to have, for example, very popular systems like Lustre — full-time jobs for many, many teams of people just to make the file system perform across large files, small files, and lots of metadata lookups. These are problems we solved well with just one simple, consistent software layer.

Post-ChatGPT, it didn’t happen right away. It was something I literally joined the company anticipating, that you bake the cake when you train the model, and then you have to eat the cake; you have to monetize that huge capital investment. So we realized inference was going to be a really big deal. Quite frankly, we were probably a few months ahead of the curve, because we were here talking inference, inference, inference at GTC 2025, and a lot of the audience and a lot of the keynotes weren’t that inference-centric yet. It just became obvious to us that inference would be a thing.

The real notable element was that there were no real agents in production or widely used at GTC ‘25. Then May happened, Claude Code started to appear, and everyone said, “Oh wow, this is really good.” Then December was, I think, the inflection point Jensen Huang was referring to in his keynote, and all of a sudden, there’s no end to the rise and growth in token demand, much less just token demand. That’s the world we find ourselves in today. I wouldn’t say we’re taking a victory lap, but we’re very gratified that the market has realized our vision. And we have a product in the market today that addresses this, that’s upward-compatible with something called Context Memory Storage (CMX) from NVIDIA, which will be available later this year when Rubin appears and when BlueField-4 appears.

Why Training and Inference Require Completely Different Infrastructure

Kai: Can you talk a little about the specific challenges of inference? Because we have these training challenges that have been around for a while. I was talking with a networking guy who said they have to wire data centers differently because peak demand is so much higher. But what specifically are the challenges you need to solve with inference that weren’t there before?

Val: Yeah. To summarize it: When you’re doing training, you really are building a supercomputer. You have to have scale-up networking — InfiniBand very often — as Jensen elaborated last year. You really have to make multiple GPUs across multiple racks behave as one, and there are some amazing technologies NVIDIA has brought to market. The Mellanox acquisition really accelerated that.

Inference is very, very different. Inference — particularly the way we’re enabling it — is a lot like web serving. It’s much more stateless if you do it right. It’s much more decentralized than centralized. In theory – and I think why we saw the Groq LPU announcement at the keynote — you construct inference with very different infrastructure, and the requirements are different. There’s no big variance in latency for training; it’s like a giant set of batch jobs, basically checkpoints.

For inference, we’re now seeing segmentation. It’s not just low latency you want, but fixed latency for smooth, non-awkward voice conversations with agents, the way they were mocked during the Super Bowl ads, right? And then you want maybe a middle tier for chat sessions and research sessions. And then you need almost parallel batch sessions for agent swarms, because we’re seeing these multi-hour swarms and multi-day swarms right now. You don’t want to wait days for these things to come back. If you can cut the latency for every turn of a multi-turn agent, all those compounding delays become compounding shrinkages of time and latency. So these tiers are emerging, and that’s really the infrastructure difference between training and inference.

What Is the AI Memory Wall and Why Has It Already Arrived?

Kai: I guess it’s a lot of distinct queries you need to handle at once and be able to scale up and scale down. Do you have a sense of what’s coming next? You were early on the inference drum, and we’re seeing agents really start to take off — really just since November, so it’s still super new. As we get through 2026, are we going to hit a token crunch? What’s in the pipeline?

Val: Yeah. I don’t want to be too dramatic here, but we have already hit something called the “memory wall,” where the traditional balance in scientific computing between GPU kernels, between FLOPs and memory has been completely disrupted. The demands of model weights for large language models, coupled with large KV caches per user per agent session in a swarm, have literally hit a wall right now. There’s an imbalance — you just don’t have nearly enough memory, and you’re forcing a lot of pressure on recomputing. So instead of prefilling once logically, you’re prefilling thousands and millions of times for agent swarms redundantly.

Solving the memory wall is how you relieve that pressure and how you deliver memory latency, memory bandwidth. This was a big emphasis in the keynote in terms of Groq bandwidth and SRAM, but also memory capacity. You want very large memory capacities now to scale past this memory wall. That’s one thing the industry is moving toward, and again that’s why the CMX platform from NVIDIA is going to address that later this year. But the opportunity cost with exponential growth is right now — you don’t want to wait until the end of the year to scale past this memory wall.

The other thing is just the fact that the industry can’t fabricate enough. The supply chain can’t fabricate enough GPUs, can’t fabricate enough DRAM DIMMs or HBM, can’t fabricate enough NAND flash drives, or even hard drives. So efficiency — which used to be a luxury for well-capitalized frontier labs doing training, and even well-capitalized GPU and inference providers — performance efficiency, performance density per watt per GPU, has out of Machiavellian necessity risen to the fore.

Why Power Will Become a Key Bottleneck in 2026

Kai: Do you have a sense of the different bottlenecks coming? There’s power. We’re seeing memory prices rise, so it’s clearly a bottleneck. Beyond the power for datacenters and the memory, are there other bottlenecks that are going to hit hard this year, price-spike style?

Val: You’re starting at exactly the right point. Power is the ultimate bottleneck and we can’t get enough of it. In terms of price spikes, there have been so many spikes over the past few weeks and months. There are going to be increases; I hope those are more spikes than sustained climbs. One of the hot takes we’re having right now, and this is just anecdotal, is that there’s opportunistic panic-buying whenever there are extreme, intense shortages of supply for highly demanded things. We’re actually seeing some of that panic-buying inventory coming back and being sold back into the market, particularly for NAND flash devices. I wouldn’t take that as gospel or that the problem is solved. I would say that instead of constant, up-and-to-the-right price increases or spikes, we’re going to see some jaggedness as supply and demand adjusts to speculative buying.

But the fundamental, first-principles problem is that we just can’t bring new fabrication plants online fast enough. They’re multi-year projects, multi-billion-dollar capital investments. So no easy answer, no easy relief. Hence, the solution is in software, in being more efficient with the very scarce resources you have. And the good news is, if you talk to either model researchers or especially infrastructure specialists, we’re still leaving a lot of inefficiencies on the table. I always say it’s akin to: We’re running AI factories before the assembly line, before the Model T moment.

Kai: Well, hardware bottlenecks and hopefully software solutions. Thank you so much for taking the time to chat.

Val: Absolute pleasure.

Related Resources