Inference Has a Memory Problem. What Comes Next?


GPU memory is AI’s scarcest bottleneck — and the path forward isn’t more hardware. Here’s what you need to know:
- HBM supply is structurally constrained for years, driving up costs and forcing organizations to rethink their AI infrastructure strategies
- NeuralMesh augments GPU memory with pooled flash storage at latencies the GPU can’t distinguish from HBM — at a fraction of the cost
- As AI shifts from training to inference at scale, software-defined memory infrastructure is critical
The AI industry has a memory problem, and throwing more hardware at it is no longer the answer.
As demand for large-scale inference grows, organizations are hitting a ceiling that faster GPUs alone cannot break. The roadblock isn’t compute. It’s memory: How much of it exists, how fast can it be accessed, and what does it cost. For investors, developers, and enterprise technology leaders, understanding this constraint is essential to making smart decisions in the years ahead.
Why GPU Memory Has Become AI’s Scarcest Resource
High-bandwidth memory (HBM) — the memory stacked directly on top of GPU chips — is finite, expensive, and increasingly hard to source. As WEKA Co-Founder and CEO Liran Zvibel explained in a recent conversation with Asit Sharma of The Motley Fool: “We are now at the point that NVIDIA can’t build more GPUs — not because it can’t make more Blackwells, or soon Veras, at TSMC — it’s because it cannot glue any more HBM on top of them.”
This isn’t a temporary shortage. It’s a structural constraint. Micron and other chip makers are investing tens of billions in new fabrication capacity, but those facilities will take years to come online. In the meantime, memory prices have surged. As Liran put it: “When you go and buy a standard server, it’s expensive. And the most expensive component — for a long time the CPU — is now by far the memory.”
For any organization scaling AI workloads today, this is the wall you will hit.
How Flash Storage Can Solve the GPU Memory Bottleneck
The conventional assumption is that solving a memory problem requires more memory. WEKA’s position challenges this assumption directly. By connecting pooled flash storage to GPU memory at low enough latency, the GPU cannot distinguish between actual HBM and data delivered from a high-speed, software-based storage layer.
Liran described it this way: “We can guarantee low enough latency into that memory that the GPU memory cannot actually tell if the other side is another memory or WEKA. And this is magnificent because doing it on flash is orders of magnitude cheaper than doing it on actual memory.”
The key insight here for technical decision-makers is this: The viability of WEKA’s approach scales with context window size. For small reads — bytes — memory will always win. But real-world AI workloads are operating at kilobytes, megabytes, and increasingly gigabytes. At this scale, as Liran explained, delivery over 400- or 800-gigabit Ethernet or InfiniBand is “completely indistinguishable” from pooled RAM.
Liran is direct about where NeuralMesh stands relative to alternatives: “We already deliver this kind of upside to customers, allowing them essentially to augment the HBM memory of the GPU to practically an infinite amount of NAND today with either NeuralMesh or NeuralMesh Axon on Augmented Memory Grid, and we’re showing 400 to 10,000x better than anything else on the market today.”
The practical implication? Organizations do not need to wait for memory supply chains to recover or for CXL-based pooling hardware to mature. Software-defined approaches, deployed on infrastructure that already exists in most data centers, can bridge the gap today and in the future.
What Large-Scale AI Inference Actually Requires
Training and inference are different problems, and they reward different architectures. This distinction matters enormously for how organizations should plan their AI infrastructure investments.
“Long-term, inference is going to be a bigger market than training,” Liran noted. “That’s not the case today. But with inference at scale, what wins is not necessarily the biggest and fastest. Inference at scale needs to be low-latency enough. You cannot frustrate the human at the end. It needs to be cost-effective.”
This is useful framing for any organization evaluating its AI stack. The question to ask is not, “What is the most powerful configuration?” but rather, “What is the most economical configuration that still delivers acceptable latency at scale?” These are very different procurement decisions, and conflating them is an expensive mistake.
Context memory management is central to this idea. As more developers build on top of existing codebases and use AI to reason over large bodies of existing work, KV cache and context storage become critical infrastructure instead of afterthoughts. “The context windows only grow,” Liran observed. “So far, we achieve improvement by requiring more memory, not less.”
Which Technology Companies Are Built to Last in the AI Era?
Beyond infrastructure, Liran and Asit’s conversation surfaced a pointed question about the broader software landscape: Which companies have real staying power as AI lowers the cost of building software?
“Companies need to deliver intrinsic deep technology and value, and they need to have a significant moat. Just figuring out distribution because you’ve been there long enough and a customer base that grows is a very high risk.”
For investors and enterprise buyers alike, this is a useful filter. Convenience is no longer a defensible moat. As Liran noted, AI agents can now be deployed to mirror a system of record, verify parity over time, and quietly replace it without ever triggering a formal procurement decision. The companies most likely to endure are those with deep technical differentiation, genuine switching costs, or regulatory requirements that mandate third-party certification.
Asit offered complementary framing for evaluating the market: “If I’m buying, I want the lowest total cost of ownership over a given amount of time, maybe through the life of the GPUs.” This lens — TCO over a hardware lifecycle rather than point-in-time capability — cuts through a lot of vendor noise.
The Long-Term Outlook for AI Infrastructure Investment
The opportunity ahead is significant, and the constraints are real but solvable. What matters now is where organizations place their bets.
The developers and enterprises that will come out ahead are those who recognize that the AI stack is not static, that training-era assumptions about compute and memory don’t hold in an inference-dominated world, and that software-defined approaches can deliver capabilities once thought to require specialized hardware.
As Liran put it: “What we’re seeing now is just the tip of the iceberg.”
The wall is real. And the path around it exists already. Get five practical strategies for keeping AI and compute running when memory is scarce in our “Nand Flash Shortage Survival Guide”.
Want to hear more insights from this conversation? Check out the full video on YouTube.