00:00
What is Causing the AI Inference Capacity Crisis?
Matt Marshall, CEO & Editor-in-Chief, VentureBeat: We’re going to be talking about architecting for efficiency. You were painting a startling picture in our prep call.
Leading model providers—Anthropic and others, literally OpenAI—are turning away their top customers because they can’t meet this insatiable demand for inference. This is where things are headed now, and what is happening right now at the infrastructure level is causing this crisis. I think that’s the question we wanted to start off with: How real is this token rate limit problem for the enterprises in this room? Is it just the AI labs, or is it everyone?
Val Bercovici, Principal Product Manager, Strategy & AI at WEKA: Yeah, it’s good to follow the money in this industry, right? There are a lot of public companies, and the private companies are massive. They’re very widely reported on, very well reported on. We’re seeing a difference between some gross numbers and some net numbers.
If you take a look at some gross numbers, you’ll see people saying, “Well, the cost of inference has plunged almost a thousandfold over the last two years. So naturally inference must be cheap, and AI must be really cheap, right?”
But that’s not the net picture. The net picture is when you factor in the overwhelming consumption and volume of tokens. As the price of tokens is declining in one dimension, the demand for them is spiking and dwarfing the decline in price. The net effect—the net reality—is that the cost of AI is going up.
So much so, in fact, that there’s a lot of reporting lately around the fact that the economics of AI apps, and the entire stack that provides them, are fundamentally upside down. We’re back in this classic Uber game of investors subsidizing the real cost of the product. When you take a look at the net unit cost, it’s negative right now. The joke amongst practitioners and analysts is that everyone just wants to climb up to zero in unit economics.
Matt: So in other words, OpenAI and Anthropic are delivering their services at a net loss for now, when you do all the accounting?
Val: Right.
02:13
How Did Reasoning Models Change the Economics of AI?
Matt: Okay, so let’s go back to 2023–24. We had this wonderful revolution—applications were great, right? We were all filling out with GenAI. You didn’t have as much token usage or scaling. Can you walk us through what has changed in 2025?
Val: The big thing was at the end of last year. It’s amazing to think it was only nine months ago. We just introduced this concept—OpenAI was one of the first to publicly introduce the concept of reasoning models.
And Jensen himself was saying in his last earnings call a couple of weeks ago, the amount of reasoning tokens generated now is about 100 times more—two orders of magnitude more—than before reasoning, when you just had general pretrained models.
So one dimension is: we now have 100 times more tokens just at the base model layer. That was the turn of the year, from last year to this year.
Now, particularly this summer, I’m seeing the hype around agents completely fade, and the reality of the enormous business value of agents materialize. Agents themselves represent another compounded 100x volume of tokens.
So even if the price of inference has decreased by maybe a factor of 1,000 (optimistically), the actual demand for tokens is at least 10,000x now. That explains the situation we find ourselves in today.
03:40
Why Do Agent Swarms Drive Exponential Token Demand?
Matt: Let’s unpack that. You were talking about this new term—Agent Swarm—that’s basically making Andrei Karpathy’s “vibe coding” kind of obsolete, right? Which is causing this token demand spike. Can you talk a little bit about what’s going on there?
Val: It’s a little bit provocative. On the one hand, he’s a god. I’m Canadian, he’s Canadian, so we all love each other. He coined the term “vibe coding.” It’s not new.
But the state of the art—the way professional developers take full advantage of generative coding tools—is no longer vibe coding. It’s swarm coding.
And it’s not so much about what a cool model or AI agent can do for you—it’s specification and test-driven development. A swarm of agents works together in parallel. In my mind, there’s no such thing as “an agent.” If agents are successful at all, it’s always a swarm. Many tasks and subtasks in parallel—that is the value of agents.
That swarm of agents is a 100x amplification factor in terms of token demand.
The net benefits are astonishing. If you’re a professional developer using generative swarms today, you really stop writing code. You spend time thinking like a product manager—writing detailed specs, detailed tests, requirements for passing/failing tests, much like reinforcement learning in defining reward functions.
The swarm then takes those specifications and, through open source frameworks like Claude Code, Cursor agents, etc., transforms them into architectures, pseudocode, code versions, and finally into tested, documented applications with security scanning, vulnerability scanning, and performance tuning built in.
These are very complete systems now—not just demos.
Matt: So this is literally happening over the last couple of months?
Val: Yes.
06:07
When Did AI Agents Cross the Line Into AGI Potential?
Matt: I teased you in our prep call that you’re kind of “AGI-pilled”—all in, drinking the Kool-Aid. Why? You’ve been working on these fundamental technologies for decades. Why do you now feel we’re on the doorstep of AGI? Can you ground that in specifics, with step-change increases in intelligence?
Val: To set the record straight, up until May of this year, I cringed when I heard the term “feel the AGI.” Pure marketing hype, Sam Altman recruiting talent, but not credible.
Since May, things turned a corner. We saw the adoption of Cursor dramatically improve developer productivity by 30% or more. Claude Code has been a revolution for professional developers. Even one of Weka’s most skeptical engineers—the “Prince of Darkness”—started using Claude Code in May and June, and projects that took weeks or months were being completed in hours.
That’s when I realized: agents generating quality code had turned a corner. For me, agents are now the new scaling laws. Combine compute, data, reasoning models, test-time compute, and agent swarms, and you start to see how we’re getting to AGI.
08:05
What Architectural Decisions Should Leaders Make for AI Inference?
Matt: So what does this mean for the people in this room? Given this capacity crisis, what is the single most important architectural decision leaders need to make today?
Val: First, revisit your processes. The way you code today is not the way swarms code. Humans like modularized functions, different files, easy to read. Models and swarms prefer everything in one giant file—context is everything.
Then, revisit your infrastructure. If you can’t afford tokens, you can’t do anything. The nuances of inference are undocumented, and expertise is scarce. Work with experts, optimize inference, and move from negative unit economics to positive.
Matt: And when those negative economics flip—like with Uber—what happens? Is this good for the rest of us?
Val: I think they’ll flip suddenly. Public innovations in infrastructure will drive change. At Weka, our mission is to decouple memory from GPUs, radically transforming inference economics. Today’s AI factories have no assembly lines. By introducing efficiency and assembly-line structures into inference, first movers will gain insurmountable market advantages.
Matt: Great. Let’s open it up to Q&A.
11:41
What Role Do Small Language Models Play in Token Economics?
Audience Member: As models evolve, small language models today can do what large models did a few years ago. How does that play into this? How does infrastructure need to change?
Val: Great question. It won’t be a monolithic world of large or small models—it’s a spectrum. Planning is done by large models, execution by small ones. But don’t assume fewer tokens—smaller models at scale can generate even more tokens. It all comes back to the token economy and rethinking infrastructure principles, like using NVMe flash as memory extension to radically change inference economics.
14:32
How Quickly Did Swarm Coding Disrupt Infrastructure?
Audience Member: When did swarm coding manifest—was it sudden or gradual?
Val: It was sudden. A year ago, the idea of a single-person billion-dollar startup was laughable. Now, with agent swarms, people use Claude Code to process email, plan trips, handle HR, accounting, marketing, development. It’s scaling beyond developers. Continuing to handcraft code will soon be nonproductive and even dangerous competitively.
16:18
How Can AI Optimize Infrastructure and Inference Pipelines?
Audience Member: How can tools like Claude Code and Cursor maintain infrastructure context, given YAML files and complexity?
Val: Great question. I’m surprised AI Ops for GPU clusters hasn’t taken off yet, but it will—likely in the Linux Foundation’s VLM community. Configuring Kubernetes, handling GPU error rates—ripe for startups and open source projects. Inference configurations have billions of permutations, and it’s overwhelming. This is ripe for AI-driven optimization.
18:27
What Are the Security Implications of Agent Swarms?
Audience Member: What about security implications?
Val: Security is a huge, under-discussed use case for AI and agents. Think red teams and blue teams—now moving to red swarms and blue swarms. AI is both a tool and a weapon. Swarms will be essential for cyber defense, managing log volume, and SOC operations. It’s a major horizontal use case with huge startup opportunity.
Matt: Unfortunately, we’re out of time. Val, thank you very much.
Val: Thank you. I’ll stick around and meet people.
Matt: Great conversation. Thanks.