Bypassing the Inference Memory Wall to Scale Agentic AI
Dan Nishball, Director of Research at SemiAnalysis and Val Bercovici, WEKA Chief AI Officer, break down the token demand explosion, the memory wall slowing AI inference, and why cybersecurity agent swarms may be enterprise AI’s next frontier.
The token ROI flywheel
Val Bercovici: Hey there. I’m Val Bercovici, Chief AI Officer at WEKA.
Dan Nishball: Hey, I’m Dan Nishball, Director of Research at SemiAnalysis.
Val: So Dan, we just got off the stage talking about tokenmaxxing for cybersecurity. What are some of your initial thoughts about the talk?
Dan: Well, I think it’s all about how we provide for that insatiable demand in tokens.
There’s an incredible amount of value accruing to everyone across the supply chain, across the user base. Users are experiencing phenomenal ROI. They’re getting things done that would take 15 hours in 30 minutes. Human cost for that could be in the order of thousands of dollars, and they get it done for $20 or $30.
And of course, that’s not to say we’re going to need less humans, it just means we’re going to do a lot more, we’re going to cover a lot more ground, and we’re going to do much more intelligent work. But a prerequisite for that is actually getting the inference systems to produce a lot more tokens a lot more efficiently and maximizing the compute resources that we do have.
What the memory wall actually looks like in production
Dan: So Val, what kind of areas have you been researching in how you get tokens to be produced more efficiently, how you get inference systems to work more flexibly and more productively?
Val: So it’s been a year now, and we were just reflecting on stage, there actually was no token demand for agents about a year ago. But it really picked up right around May of 2025, so about 12 or 13 months ago with Claude Code.
And then it grew really steadily. We’ve observed memory bottlenecks — what the industry calls a memory wall — where, essentially, once you load in model weights and you start having multiple user sessions, especially now with multiple agent sessions and subtasks, you’re just very quickly out of memory.
You’re out of that HBM (high-bandwidth memory) on the GPU. You’re out of the DRAM on the motherboard. You’re thrashing, you’re giving bad response times to the users and subtasks, and sometimes your net token production even goes down, or it’s very inconsistent.
So balancing out these systems with more memory — abundant memory basically, which is a bit of an oxymoron, but more memory per GPU, almost disaggregating the GPU from the memory — is a really big win.
We’re seeing much more balanced performance, better consistent latency, much lower latency per subtask and per request. But also really good aggregate tokens per second that just keep scaling with more agent swarm concurrency, without adding more GPUs and without burning more energy.
The walls AI keeps hitting: training, communication, memory
Dan: Yeah. And we’ve had many walls throughout this whole AI journey. We’ve had the training wall, right? The whole Chinchilla limit. We had the communication wall, which we busted wide open with 72 GPUs in one rack, and even that’s going to expand to larger world sizes. We’ve got the dual 3D torus, which is a massive world size of 9,000.
You just talked about the memory wall. So I think what’s interesting is this is a story about how we overcome each subsequent wall we hit. And memory, HBM will be one of the biggest constraints, as will silicon. So it’s going to be really important.
How do we tackle these walls? How do we think about surpassing Moore’s Law and delivering more inference with scarcer and scarcer compute resources in terms of silicon wafers?
Why there isn’t enough memory on Earth
Val: Yeah. And it’s an honor to have you here because you’re a worldwide expert on this memory supply chain, for example. And my layman’s observation is there literally isn’t enough manufacturing capacity in the world to generate enough memory, whether it’s HBM or DRAM or NVMe, to satisfy the rising token demand.
So we have to engineer systems better. We have to keep scaling these walls and address the challenge through just better engineering because the raw materials are just not there.
Dan: Like we mentioned earlier, the ROI is so good to end users that there is this lack of price elasticity, which has driven up prices of memory multiple-fold over the last year, which means we just have to be smarter.
Peak oil and the case for efficient AI engineering
Dan: Obviously we can think about the peak oil framework in the 1970s and we just had to build more efficient cars. So this is really much the same. And, we don’t see the memory shortage getting any less dire. We’re seeing prices go up for DRAM, consumer memory, for HBM, and we see AI accelerators consuming even more of the memory resources.
Mythos and the 17-year-old bug
Dan: So we’ll just have to be smarter about how we consume all kinds of memory.
Val: Last question, just to put you a bit on the spot here, Mythos, right? You’re prolific users, so you have empathy firsthand for coding agents and for research agents.
Blue swarms vs red swarms as the next enterprise workload
Val: My instinct is that these cybersecurity agents, these blue agent swarms to fight red agent swarms, might become the next dominant enterprise workload. What are your thoughts about that?
Dan: Well, I think you made an interesting anecdote the other day that Mythos was able to find a bug that was hidden for 17 years. Was that right, Val?
Val: Yeah, and it may be longer, but it’s like so long that the world’s leading cybersecurity experts have had decades to find this, and they just were never going to.
Dan: Yeah. So this is an example of where this is not about reducing the amount of work we do; it’s about expanding the amount of work we do, looking in so many places. And when you have these red swarms looking at every conceivable place, in a way, what we’re trying to do with all our agentic research, we’re trying to look in places that we can't.
We’re trying to ingest more and more content that we humanly couldn’t do before. And if you think about red swarms, blue swarms, adversaries that can do that, the only way to really keep up is to be aware of it and to potentially be able to outpace them and leverage your resources to be able to match that capability.
Val: And I think that’s a good note to wrap it up on. This is Val and Dan, and we’ll catch you next time.
Featured Speakers
- Val BercoviciWEKA
What's Next
Scale Production AI Faster with NeuralMesh
Your models aren't slow. Your data is. Fix AI bottlenecks with high-throughput infrastructure.


