Why AI Storage Will Define the Future of Inference at Scale
How AI Storage Became the Bottleneck Defining the Future of Inference
Vikram Sekar: Welcome to the Semi Dope podcast. I’m Vikram from Vik’s Newsletter on Substack, and with me is a special guest today, Val Bercovici, Chief AI Officer at WEKA. WEKA is a California-based technology company that provides an AI-native data platform designed to solve the massive data bottlenecks found in modern high-performance computing. Founded in 2013, the company is currently valued at over $1.6 billion by focusing on the infrastructure needs of the AI era. In today’s episode, we’ll go over WEKA’s technologies, especially in the context era of AI and what it means for the future of inference at scale. Thanks, Val, for being on the podcast with us. How are you doing?
Val Bercovici: Really good. Excited to be here. Very enthusiastic and a new fan of the newsletter.
Vik: Great. Thank you so much for giving feedback on my article on context memory storage. It is a big area, and I learned a lot from all the articles on the WEKA website, so it was fantastic. The whole thing I learned about context storage is really fascinating, and I think we’ll get into some of that today with our listeners too and break down a lot of what’s going on. How does that sound?
Val: Absolutely. My favorite topics. Let’s go.
Vik: Awesome. Let’s start with a brief background. So you were a former CTO at NetApp. My internet travels showed me you were called the “cloud czar” for leading the cloud storage strategy, and now you are Chief AI Officer at WEKA. So what’s your transition here? How did you get to working on storage solutions, or is that something you’ve always been doing?
Val: Yes, and yes, and yes. You know, it feels like déjà vu because I remember I left NetApp, actually joined a small cloud-native storage startup called SolidFire out of Boulder, Colorado, and then got reacquired back, basically into NetApp surprisingly enough. And by then we were doing cloud storage, and that was after VMware, kind of a golden era of enterprise storage and these NAS and SAN acronyms of the past. And I figured, you know, Kubernetes was really cool. I actually got involved with the Borg team at Kubernetes, Craig McLuckie and Sarah Novotny, Joe Beda and others, and we created the Cloud Native Compute Foundation under the Linux Foundation together. You know, I got on the board, the first governing board of the CNCF. So I thought around 2017 there was nothing left to do in storage, and I left it.
But it was a very disruptive time back then, you know, 2012 to 2015. It’s hard to remember that now, but cloud was hyper-controversial, right? Kind of as controversial as AI eating software today. Back then it was cloud software eating the world. So it’s déjà vu in that it’s just big transitions all over again. And at WEKA I joined again on the promise of actually not joining another storage company. And what I mean by that is that we were using the term “data platform” last year. And fundamentally that confuses some people because it’s an overloaded term. There’s so many things up and down the stack, which we’ll get into, that could be called platforms. But fundamentally what WEKA provides is high-speed storage and memory for AI infrastructure, which ultimately is the key bottleneck that we’ll be diving into right now.
Vik: That’s awesome. It’s amazing that you’re saying that in 2017 nobody really thought about storage solutions too much. It was considered boring and probably commodity, but fast forward not even a decade, things have turned around entirely. We have entire generative AI and people running agents all over the place, and we want to store infinite context, and the need for storage has completely changed within a short span of a decade. Who would have thought, right?
Val: Exactly. Exactly. And yeah, it changed even three weeks ago, which we’ll get into, you know, with Jensen’s latest announcements at the CES keynote.
What Is Context Memory in AI and Why Is Demand Exploding?
Vik: Yeah. Let’s get into that straight away then, because for those who missed that announcement, at CES ’26 Jensen announced that they are introducing an inference context memory storage platform as a first-class citizen into the whole Rubin platform, and this is something that’s going to be here for the future and is here to stay because we need a lot of context going forward, especially in the agentic AI era. But maybe we should just start with what context memory is and why we need it so badly today.
Val: Yeah. What we’ve learned obviously in the just breathless evolution since three years ago, that ChatGPT moment three years ago at about two or three months now, is November of 2022, right? What we’ve learned is that prompt engineering came and went as sort of an important focus. You know, kind of instructing the models was important at one time when the models really didn’t understand how to process large attachments, if you will, large memory. Then we reached the era where RAG, retrieval augmented generation, where we figured out ways to attach just the right snippet of large attachments into the limited memory and be able to instruct the models to give us very clever answers. So chatting with your PDF, you know, was one of the most interesting things you could do, particularly when it was a large PDF and you only had to chat with a subset of it automatically.
We’re very much in the era of agents right now. So, of course, the viral agent of this week is Claude.computer and Model.bot and the amazing things people are doing with it. What we are seeing right now is that agents, I like to say, are not really a singular thing. You either have an agent swarm on your hands with dozens and hundreds of sub-agent tasks and sessions, or you’re not really doing an agent at all. And at that point now you basically have hundreds of parallel instructions to a model with hundreds of parallel attachments, so to speak, whether that’s a codebase, whether that’s documentation, whether it’s a video stream, telemetry, logs, etc. And there’s still, even though the models have gotten smarter about understanding larger and larger context windows, there’s never enough, right?
So context memory now, a way to extend the memory windows, extend the interactivity of these models, is the hottest field, and there’s lots of different ways to create memory for LLMs and for even diffusion models. And I always like to say if you think you know how to do this from your experience even two to three years ago, it’s almost easier to just unlearn what you know about memory and storage and relearn from scratch, because it really is very different this time.
Vik: That’s amazing. I want to get into what exactly has changed. But before that, I just want to quickly go through what you mentioned about prompt engineering, which was all the rage in the early days of GPT. And I think it evolved most recently into a slightly different form of context engineering, which I think Manus AI has some really good documents on their website of how they go about this. And the key of context engineering is that you want to keep as much of the previous conversation already in the GPT inference or the cache that it’s holding and just add on sequentially to what you can. Don’t delete anything from your instruction sequence; instead add on to it, because what that allows you to do in the era of context engineering is you can reload a lot of the KV cache that was already stored and then, you know, it’s incremental, so you don’t have to recompute.
So all of the storage stuff in the era of context engineering and now agents and sub-agents all comes down to, I suppose, storing this key matrix called the key-value cache that really explodes with larger and larger context that you put into the system. So with this in mind, how exactly has stuff changed from a few years ago to now? What exactly has changed?
How GPU Memory Hierarchies Work and Why They Create AI Bottlenecks
Val: The biggest change really is the fact that there’s never enough memory. If you just do the math, and the math is not complicated. It’s not calculus, but it is a multivariable formula. When you do the math of, for example, loading the weights of the models, which are very big, we’re in the era of trillion parameter models right now. The hyper-popular Opus 4.5 is a trillion parameter model. The latest DeepSeeks are trillion parameters. Qwen 2.5 just came out yesterday. The thinking version. That’s a high-end trillion parameter model. Merely loading the weights of these models into memory is more than a big GPU server can handle. It’s literally more than an H100 can handle. And that’s still a very, very popular GPU server. You need an H200 very often. You need multiple Hoppers or even more than one Blackwell server, which is eight GPUs per server, just to handle the weights of these models.
And then you and I enter the picture and we open up a prompt session, right? And we start chatting with the model. We start feeding attachments. We start feeding context into the model. And very quickly the math gets quite brutal because the round numbers about 100,000 tokens, which is roughly a megabyte of actual data, and that could be the system prompts that can be at the model level, at the agent level, at our own preference level. Then you’ve got the prompt itself. Then you’ve got the data in question, the real context that we want to focus on, those attachments, so to speak. 100,000 tokens, one megabyte of data translates to 50 gigabytes of KV cache working memory. And that’s because you’re vectorizing. You’re adding 10 to 20,000 dimensions of intelligence to this 100,000 tokens, this one megabyte of data.
So if one megabyte equals 50 gigabytes, and that’s just at the beginning of one session for one user, and we have multiple users kicking off agent swarms, so hundreds of subtasks per user, you see that quickly you’re very much out of the working memory, the high-bandwidth sort of primary memory tier that’s on the GPU package itself.
And last year saw a really nice evolution of the engineering of realizing we have these old concepts, right? Virtual memory concepts of being able to use other storage tiers or other tiers, if you will, of memory, and basically be able to flush data, old memory pages out, you know, bring fresh memory pages in. So we’ve applied these algorithms and concepts to context memory right now, which means, and NVIDIA has a very nice—I think you included it in your newsletter—a very nice hierarchy now of essentially the first level is a high-bandwidth memory on the GPU package itself, and that’s memory. If you’re a gamer you know this, right? It’s memory on the GPU card itself, so to speak. It’s independent of the motherboard’s memory. Every GPU on a server, there’s eight GPUs, each one has its own dedicated, some people call it VRAM for video RAM if you come from the gaming background. Data center people call it HBM, high-bandwidth memory.
So there’s dedicated memory per GPU, and that’s where most of the work really happens, but there’s never enough. By the time you load the model weights and a few of our own prompts, you’re out of that memory. And now the software layers, so we have very popular inference servers, vLLM, SGLang, Triton, you know, TRT-LLM from NVIDIA, that understand now with these KV cache managers how to tier the memory, how to basically tier down to other memory.
And now we go to the motherboard, the shared memory across all the GPUs on the motherboard. So there’s about one terabyte of shared memory, and sliced up on each GPU on the HBM there’s almost one to two terabytes of dedicated memory in aggregate split across eight GPUs. So let’s say you’ve got about four terabytes of actual working memory. Again, with today’s latest models, with today’s very context-rich user sessions and agents, not enough.
So context memory now, and context memory engineering, is all about where do we go next? Unfortunately, there are these 10-year-old standards called the NVMe, non-volatile memory express was the original term or definition of NVMe. I like to call it non-volatile memory extension because I like to joke there’s no storage in NVMe. It is designed to treat NAND flash as a memory extension using more native memory semantics. Flash memory is memory. It’s just slower memory than DRAM.
And now there’s ways to engineer storage solutions based on this NVMe protocol into this tiering hierarchy. And the art of it, you know, it is obviously still code, so it is science, but it seems a bit like alchemy at the moment: how do you make this lower performance, higher capacity, lower cost tier of memory, NVMe, behave just like regular memory to the model so that you and I, especially in a voice chat, don’t notice awkward latency with a voice agent and things like that?
Vik: Awesome. That’s a great rundown of the whole technology situation we are in right now. So just to summarize, there is a memory hierarchy, and ideally we would want to keep all any and all information, model weights, KV cache, everything on HBM, and the reason is that it’s the highest G1, as NVIDIA calls it, tier of memory, and it is the fastest, lowest latency. Great if we could have infinite of it, but because of the way it’s made and the density it allows and the supply you can actually get a hold of, it’s virtually impossible to do that, right?
Then there is the second tier that one could then, the backup option there would be actually DRAM, which is on the CPU host, and that on the Rubin platform seems to be about 1 TB or more, which is a significant step up. But like you say, 100,000 tokens, 1 MB of input to tokens context, and that explodes to 50 GB of storage requirements. DRAM isn’t going to cut it either. You’re going to have hundreds of users, thousands of users, each kicking off tens of sub-agents, hundreds of sub-agents. You know, the problem is just staggering.
So the next question is that obviously you go to an even higher capacity, the backup of the backup plan, and you go to NVMe storage. Now you hope or science your way out of this by making it as fast as possible, so we can’t tell that it’s actually being pulled from NVMe SSDs and it behaves just like HBM. That’s the ideal goal, right?
Val: That’s the ideal goal. And there is, again, some complicated formulas, but we can maybe walk through those to understand how that can actually work if you piece these puzzles together correctly.
Why Legacy Storage Protocols Like NFS and Lustre Fail Modern AI Workloads
Vik: Yeah. Yeah. Absolutely. Now I want to get to that formula, and I love formulas. I’m definitely going to get to it. But I just want to understand one thing. So NVMe storage at scale is not a new technology. It’s just a bunch of flash. It’s been around for a long time. So what is different today between the storage solutions that are fast enough with low latency and, you know, the older solutions like NFS and Lustre, which have been around for decades?
Val: That’s exactly it. They’ve been around for decades. And when those solutions started, NFS and Lustre in particular—they’re very, very popular in the AI world—were literally built around the era of hard disk drives. We have to remind ourselves, this is now history. It’s funny because I started my career with these things, but it’s history now. These are antiques where essentially you have rotating media, these rotating platters, many, many platters, and you have this head, an actuator which goes across all these platters, and you have to wait for a platter to spin before you can read data off of it. Then the actuator has to move the head to where, you know, which sector, which track, if you remember for people that, you know, vinyl records again, which part of the record the data is on, and then finally retrieve it. It really is like a record player and very much like an antique.
And NFS and Lustre were excellent protocols for that because they made assumptions around the latencies of accessing metadata. You know, basically finding out exactly where the index card is to find the ultimate data, you know, looking at the glossary of the book and so forth. That was the way that these protocols worked.
Fast forward to flash drives. NAND flash drives first started out as these SCSI, small computer systems interface, devices underneath these network protocols. Then you had Fiber Channel. You had serial access, SCSI, SAS, all these other protocols that came out of the spinning rust, spinning disc era. Only when the NVMe standard came around were we finally able to treat NAND flash as a true memory extension.
And even then you really didn’t have the latency sensitivity that you needed to make flash behave more like memory. Where NVMe devices, and this is going to be very important for your audience and your readers going forward, are also going to go through an upcoming evolution called high-bandwidth flash. And so these flash devices speaking the NVMe protocol are nothing more than layers and layers of NAND flash with these ARM-style microcontrollers in front of them. And each of these microcontrollers has a work queue, and you have multiple queues per device. You have multiple devices per NVMe fabric.
So the art to this is a modern protocol. We call it Neural Mesh. And it’s a modern protocol that actually understands that there are thousands of queues across tens of thousands of devices in a fabric. And you have to understand natively the depth of each queue so you understand at a global level for every read and write across the entire system what, you know, that latency is going to be across the entire system by understanding all the individual components. That’s what we call Neural Mesh. That’s the magic of WEKA, and it’s because WEKA has the advantage of timing. We were designed from the era where there was no spinning disc. We tier to that, but we don’t use it directly. You know, there isn’t really a need for NFS or Lustre. It was just a brand new era with NVMe.
And another key thing here is when WEKA was born, the networks, the high-speed networks at supercomputers were faster than the motherboard. And even today, I have to keep reminding people that everyone just assumes the memory hierarchies are static. The motherboard is always the fastest way to move data around, and everything on the network is the next compromise down. And it’s inverted now. And NVIDIA, by acquiring Mellanox and building now, as Jensen will say, they don’t ship chips, they ship systems. And the systems are these chips networked together to form these amazing AI factories, these supercomputers. They all depend on the fact that the network is actually faster than a motherboard. And that’s how the GPUs are not kept waiting, are actually able to collaborate together and behave like a giant GPU.
Vik: That’s awesome. There is so much to talk about that. I’m glad you mentioned all the things that you did. So I want to hit on the latency aspect first. So essentially the biggest concern historically over the storage solutions of the past has been latency, and they’re primarily designed around spinning rust discs, which are completely different when you’re looking at NAND flash, NVMe storage today, and those protocols are not really valid anymore and they kind of need to be rewritten. So in terms of the equation you mentioned below, how does all this latency translate into maybe time to first token or how does that whole thing work out, you know, how does it relate?
Val: Yeah. So, a very important question, a vital question. You know, the motherboard math is again non-intuitive because you have to factor in the software and the hardware into the final latency equation for all of these pieces. Latency particularly because we have two kinds of kernels. We have the kernel that boots the server, the Linux kernel that runs on the CPU. Then we have the kernel that actually runs the inference server. That’s the GPU kernel. We hear about CUDA all the time, right? CUDA kernels in the NVIDIA world. And so those are two different kernels that have to communicate on the same server, and the inference server that runs our DeepSeek that gives us our tokens from DeepSeek or Qwen 2 or GLM or all these wonderful new models, MiniMax, etc., OpenAI of course and Anthropic included. These kernels are communicating between each other.
So when an LLM is processing your request and it runs out of memory, it still communicates before it runs out of memory at nanosecond speeds and with high-bandwidth memory and very, very low nanoseconds. We aren’t even talking about this little thing called SRAM, which we should come back and talk about because that’s really what the whole Groq, with the Q acquisition was about.
Vik: I always joke—
Val: Yeah, I always joke that NVIDIA should have labeled SRAM as the G-zero tier, but I think they will. They have nowhere to go now, right? So there has to become the G-zero tier.
Vik: Absolutely.
Val: But in order to not confuse matters, let’s just focus on G1 and above. So with the G1 layer, that’s nanoseconds, and that’s the GPU kernel. As we know now, we’re basically tiering and including G2 in this hierarchy very commonly today in every modern LLM inference environment. And that crosses the motherboard now because you’re going from the GPU kernel across something called a bounce buffer, historically, sometimes now a chip-to-chip interconnect between the GPU and the CPU and the memory that the CPU addresses. And on paper, the DRAM, the DIMMs, the DRAM DIMMs on the server motherboard that the CPU controls, on paper it’s also nanoseconds, but not in reality. By the time that communication between the GPU kernel and the CPU kernel happens, we’re into the microseconds of latency. So already, you know, three orders of magnitude between the HBM and the effective DRAM latency, and that’s not great, but it’s functional, right? The inference servers definitely do enough parallelism and enough asynchronous memory copies and things like that that it doesn’t feel like a big step down.
However, the big step down now comes from various legacy protocols entering this equation by tiering the G2 memory, the DRAM, the CPU DRAM, to storage. They introduce enough overheads because of the way they go between user mode and kernel mode and the way they deal with metadata that you’re not just going another three orders of magnitude from, you know, microseconds to milliseconds. You’re going to many, many milliseconds, two, three, four, five milliseconds, thousands and thousands of microseconds as you make that transition. And that’s where you hit a wall, the kind of the memory wall, because that’s where it’s very noticeable additional latency to the model and then to you the user. And an agent really notices that because if the agent’s spinning up hundreds of subtasks, they can either complete in minutes or they can complete in hours with this additional latency.
So the key design goal for WEKA is to keep this real-world effective transition from high-bandwidth memory in nanoseconds to DRAM in microseconds to flat microsecond latencies at DRAM performance levels, but with the benefit of flash capacity. So it is a best of both worlds scenario when you understand the engineering and the math. And instead of capping out at four terabytes of memory per server, you can extend that G2 layer now to hundreds of terabytes easily, to petabytes. And in some extreme cases, we’ve done the math. Some of the biggest, you know, LLM apps like Cursor of the world and the various models they use are processing tens of trillions of tokens a day, and the KV cache numbers that you refer to, especially that great Manus blog from last summer. I call it the two billion dollar blog because, of course, they’re part of Meta now for two billion dollars. That requires an exabyte of KV cache to service trillions of tokens a day effectively.
So we’re going to be seeing this. This is why Jensen called it the biggest future storage market of the world, because when you’re dealing with an exabyte of memory, you have to think about some kind of storage semantics at least. And even the industry now, referring specifically to context memory, is referring to it in the context of cache writes and cache reads, sort of like a database or a file system, but remembering it’s at these crazy memory speeds, hundreds and hundreds of gigabytes per second, ultimately terabytes per second, at these crazy capacities, at these microsecond rigid, you know, tight latencies.
How High-Speed Networking Enables Petabyte-Scale AI Storage
Vik: Awesome. And all of this is possible today only because of that network layer that you mentioned earlier. The fact that there is so much innovation across the interconnect and networking space, like InfiniBand or Ethernet that hooks up or scale and scale up, scale out across the data center. You’ve got this really high-bandwidth fabric that connects everything together, including storage. So now you’re able to serve petabytes of data over this high, high-bandwidth fabric right to the GPU. And I believe that there are technologies from NVIDIA for this, like GPU Direct to Storage. So it bypasses a lot of these host controllers and all these middlemen and just gives the information straight to the GPU, right?
Val: Precisely. So, so many acronyms here, so many protocols we’re throwing out here. Hopefully we’re not losing people. But after you deal with the storage protocols, there is effectively a new memory protocol. So memory access is a native thing. You access the HBM, you access DRAM, etc. Direct memory access is figuring out, again, how to access it. Remote direct memory access is where things get interesting, where you access memory at memory latencies and memory throughputs, but you access it over a high-speed network of some kind. That can be these other obscure technologies, the memory pool technologies from CXL, like an extended PCI over network, or it can actually be the Mellanox-style high-performance RDMA networks.
And unfortunately for the readers, you just have to get used to these things being called multiple names meaning the same thing. So NVLink is a brand, that’s one name for it. Backend network is another name for it. East-west network is another name for it. NCCL network is another name for it. If you’re a developer, they actually all mean the same thing. They basically mean another set of network adapters on a GPU server that are just dedicated to very high-speed traffic.
And what’s fascinating to me, and it’s good to be in the networking business nowadays, is what Jensen announced three weeks ago is effectively a third network on these servers. You have the regular communications network, which again goes by multiple names. Front-end network is a common one. North-south network is a topology that’s another name. Storage network, just, you know, internet, public access network. That’s one way to get inside these servers. That’s often one or two high-speed network adapters.
Again, traditionally you’ve had eight or more network adapters. Oracle Cloud, for example, gives you 16 east-west network adapters. Now there’s a third network that’s being introduced with the Rubin generation, so second half of this year, that will be powered by these more powerful data processing units, these super network adapters called DPUs. The brand name is BlueField and the BlueField 4 generation.
So now with BlueField 4 on a third dedicated network, you now can isolate GPU traffic from NVIDIA’s preferred way to do context memory traffic from the regular front-end network for all the other traffic, if you will.
And what’s fascinating to me is you’ve got resource-constrained geniuses like DeepSeek over in Wanghu in China, and they don’t play by NVIDIA’s rules, right? They utilize every single ounce of resource that’s available to them. So they’re going to be some of the first, I predict, perhaps alongside WEKA—we’ve been doing this as well—to publish how you can use all these networks dynamically, just in time, to get all the bandwidth you can, just in time. You don’t really have to go on these dedicated toll lanes, if you will.
What Is the Context Memory Network and How Does It Work?
Vik: Does this network have a name? I know we have we scale up, scale out, and scale across. What is this thing called?
Val: I think it’s going to be called the context memory network, right? There’s no other logical way to describe it. You can add fancier, you know, sub-adjectives to it and all that and superlatives, but it is a dedicated network for the BlueField adapters, and it’s supposed to be dedicated in a clean architecture, just for context memory, so it doesn’t interfere with the occasional bullet train of traffic on the GPU network, and it doesn’t collide with the regular front-end network for regular user traffic and other more traditional legacy storage traffic.
What Is High-Bandwidth Flash and How Does It Change AI Infrastructure?
Vik: Awesome. Now I want to, before I get into Neural Mesh and, you know, Augmented Memory Grid, which is WEKA’s unique solution here, I just want to touch upon what you mentioned about high-bandwidth flash, because you’re right, that is a very interesting concept and idea for the listeners of this podcast. So where does that fit in? Because the idea of high-bandwidth flash is that you take a bunch of flash chips and you stack them up like you would do HBM, and all of a sudden you have this entire high capacity. You could get, I don’t know, 4 terabytes on a little thing that looks like HBM.
And now the question I think I have had, and a lot of people would I assume have, is where do you put this thing? If you put it on, next to the GPU, just for context storage, you’re now taking away from the beachfront of HBM. So nobody wants to give up HBM for high-bandwidth flash. And then there is always the question of endurance of flash in general. If it fails on the network storage, you could change out the flash drive. No worries. But how do you change it out on a GPU? It’s terrible when that happens. So where does high-bandwidth flash fit in?
Val: It’s a great question. I think we’re going to be talking about this a lot this year, especially when these devices ship very, very soon. So, high-bandwidth flash for me, it’s like a game of musical chairs. The first principle science doesn’t change. NAND flash is still NAND flash. If you stack a lot of it, you basically have to figure out exactly where the wear leveling is, because what most laypeople don’t know about flash is it has a very finite life cycle of how many writes, how many four kilobyte blocks you can write to it before it just stops accepting writes and then the device can only be read-only. So kind of great for archive after a certain point, basically, because it becomes inherently immutable, but that’s not what people want out of storage. It is absolutely not working for memory, right? Memory has to have these very rapid load-store operations and billions and trillions of them in parallel.
So high-bandwidth flash is essentially taking all the same components, stacking more NAND layers. Now instead of stacking, you know, QLC layers, these quad or quintuple layer cells which are very, very dense but don’t endure writes very well over many, many years, maybe it’s stacking more of these TLC, triple layer cells that endure more writes, but it’s 3D stacking of flash depending on, again, the grade of the flash, and then how many controllers and queues do you put in front of it.
So high-bandwidth flash is not magic. It’s just a lot more controllers, right? A lot denser ARM-style controllers and other kinds of ASICs that are in front of denser and denser high-endurance flash, so that you can have more queues. And ironically, you know, WEKA provides us at a fabric level today what WEKA provides is a single way to see thousands and thousands, tens of thousands of queues across thousands of NVMe devices as effectively one drive. You know, that’s what this WEKA Neural Mesh software actually does.
So what we’re going to see is just more and more density available to the market. It absolutely will be a denser option now for our own stack and through our partners like Dell and Super Micro and Hitachi and HPE and so forth. You’re going to be able to see these, you know, meshes of even denser high-bandwidth flash. Essentially the promise is at a lower cost, extend the context memory comfortably into the petabytes and exabytes, because I predict by the end of this year we’re going to be seeing more than one exabyte per GPU, you know, per superpod of just context memory.
How WEKA’s NeuralMesh™ And Axon Solve the AI Storage Bottleneck
Vik: Fascinating. So, the nice thing about NeuralMesh™ is that it gives you all this on a networked drive that to a GPU looks like as if it’s local storage. It doesn’t really know the hardware behind it from a software sense. So from my looking at WEKA’s website, it seems like the Axon is a critical innovation that enables this whole thing of what is called converged storage. Could you maybe dig into that a little bit so we understand what the technology is all about?
Val: Axon is the classic definition of, you know, luck equals preparation plus opportunity, right? Axon was just designed inherently to be very intelligently native to a GPU server. A GPU server, as we’ve already said, has eight GPUs, memory on the GPU, shared memory on the motherboard, DIMMs there, but often comes, you know, with up to 16 drives, NVMe drives per GPU before you add on remote network storage to it. So, these embedded drives were originally kind of left alone. You know, they were very convenient to boot the Linux kernel for the GPU, for the CPU side. Very convenient to hold these large model weights and load them quickly from storage into the GPU memory and begin to, you know, either train or infer the weights and the tokens respectively, but they were largely ignored.
And what we’re seeing with a severe supply chain crunch today in the industry is that now you can basically, if you’re a GPU provider, instead of, you know, getting your real, you know, your data center built with all the real estate and power and cooling and all the other things you have to do, the water management, and then waiting another, you know, three, four months just for your storage systems to arrive and giving up millions and sometimes billions of dollars of inference revenue while you’re waiting, you’ve got built-in storage and memory resources on those GPUs themselves with those stranded drives.
Axon is simply a way to install software, no additional hardware, and transform those drives into either a large capacity pool of storage—that’s even faster for loading weights, even faster for logs and temporary storage—but when you have enough of them, the math is around 50 to 60, but 72 is a nice one-to-one match. 72 GPUs in an NVL 72 rack. If it’s about a one drive per GPU ratio, 72 drives actually can give you the bandwidth to get memory performance, hundreds and hundreds of gigabytes per second of effective DRAM performance to these kernels.
And also, you know, it gives you just the latency that we talked about that if you can manage the queues on all those drives as if they were memory instead of storage. So not using the NFS protocol, not using Lustre file systems which have complicated metadata management and so forth, but just using a native mesh architecture, you actually create this new category of software-defined memory out of standard vanilla generic GPU servers. And you can decide how much of those NVMe devices you allocate to storage and how much of them you treat as memory, as truly extended context memory.
Vik: It’s fancy. So what I’m hearing is basically you have all these local SSD drives in a rack, and those are not always entirely used, and so Axon is a way of intelligently carving out a portion of each NVMe drive per GPU and then pooling all those together and providing it as a memory tier at a high access bandwidth rate and not really storage.
Val: And this is really elegant for most NVL 72 racks. If you were to pick that kind of configuration, you know, instead of talking about AMD, Helios or others, in the NVIDIA world, you get this built-in scale-up network called InfiniBand in these racks. And so without even purchasing additional network ports, which are very expensive at 400 and 800 gigabits per second, without purchasing expensive switches that are 800 gigabit switches for, you know, 32, 64 ports, this is all built in. So it’s literally just adding WEKA software in this configuration, Axon, and almost magically getting additional storage capacity and really magically getting additional memory capacity without memory compromises out of this system.
Vik: That’s a really clever way of engineering it just out of software and existing hardware.
Val: Our engineers are geeky to a fault, and they love very elegant, clever, deep tech solutions. And as I said, that was a preparation. The opportunity is, of course, the severe NVMe and DRAM crunch today, and so the timing couldn’t be more ideal for this.
What Is Augmented Memory Grid and How Does It Extend AI Context Memory?
Vik: Perfect. Yeah. So this entire thing, if you scale it out over the network, not, you know, have local SSDs that you carve out with Neural Mesh, when you put it out on just a bunch of flash drives on a network, is that what Augmented Memory Grid is? And what is the concept of a token warehouse? Maybe we can explain all of that.
Val: Yeah, really important here. So let’s start with the first part of the question. What is Augmented Memory Grid? It’s another software configuration that is optional. So it makes Axon optional because it doesn’t depend on using the, if you will, built-in or stranded drives inside an NVL 72 or GPU rack of any kind. It works there very, very well. In fact, it’s the easiest way to get Augmented Memory Grid up and running. But it’s independent of that. It fundamentally works across the three kinds of networks we have today. We have the traditional north-south network where a lot of conventional storage traffic goes, and some people call that high-performance storage. It’s just all relative now compared to memory. But, you know, that’s one way to configure AMG, Augmented Memory Grid.
The second one, in the near future, second half of this year, will be over the NVIDIA context memory network, the BlueField network that goes to just a bunch of flash. The best practices way to do it today is also the third way, and these are not mutually exclusive. You can do all of them. So the third way is where literally, as long as there is that memory network, that GPU compute network available, and that could be in the rack, that can be across an aisle, that can be across a whole fleet. Pretty much the limit is inside the data center. Speed of light is a thing, and, you know, switch ports are expensive, and, you know, spine-leaf and all that gets complicated and expensive if you try and scale it out too much.
But as long as there’s enough of that bandwidth, that east-west bandwidth—again, the brand is NVLink, but it’s not really NVLink traffic—as long as there’s enough east-west bandwidth, six or more, eight or more, 16 maybe ports of east-west dedicated GPU network bandwidth, the backend bandwidth, that AMG just takes all of that and it naturally extends the G1 HBM memory, the G2 DRAM memory into the memory that we actually call G1.5, because it performs at the level of G2 memory.
But the way inference happens, if we have time to get into the two key phases of inference, prefill and decode, by being able to optimize those things, we deliver better than G2 performance. It sometimes can rival G1 performance using today’s NVMe technology, but in aggregate, of course.
How Pooled DRAM and CXL Compare to Augmented Memory Grid for AI Inference
Vik: Where does pooled DRAM come into this? So what are the comparison of speeds between pooling DRAM over CXL versus running AMG?
Val: That’s a great question, and the key thing is, as I mentioned earlier, on this bounce buffer concept of the two kernels communicating and actually exchanging, you know, forwarding, storing, and forwarding memory back and forth between the two different buffers. CXL is a really great example of the fact that that latency is already, you know, memory access through CXL, already into the microseconds. It’s not nanosecond memory anymore. I think CXL is a fantastic stopgap solution today, right? It’s an innovation also about 10 years old where now, if the DRAM, if the HBM is so precious and we just don’t have enough of it, and the DRAM, the G2 is so precious we don’t have enough of it, there’s actually a lot of stranded DRAM in servers. And what I mean by that is it can be stranded in time, so there can be, you know, model weights and their mixture of experts, as an example, where just a subset of the weights are only active at any one time.
So there could be traffic patterns that come into an inference service where certain GPU servers and certain GPUs on those servers are busier than others, and certain DRAMs are fuller than others. So if you have spare DRAM that’s not being used at a particular moment, you want to make that available to another server that’s starved for DRAM. And again, you do this over yet another kind of network. Again, it’s not really a traditional Ethernet-style or even InfiniBand network. It’s a CXL memory pooled network.
But it’s a way to take this precious, finite, limited resource and not have any of it be stranded. It’s a way to maximize utilization of that G2 layer. And it has plenty of advantages because who doesn’t want sort of more access to available G2 memory? But it’s finite. You know, DIMMs are DIMMs. They have the same cost profile as DRAM memory, CPU memory. It’s a very high cost profile. They’re as limited in the market right now, and they just don’t have the capacity you want to do what Manus again advises, which is to maximize your KV cache hit rates, not evict—we really haven’t used this term yet—not have to evict KV cache, which is a common thing after about five minutes of inference operation, especially for agents, but keep KV cache forever so you can bump that hit rate, that KV cache hit rate, above the industry norm of 50, 60% into the best practices of 70% if you can do it without extended or augmented memory.
And it’s really exponential benefits if you keep going up to 80 and 90 and 99%. And that’s what naturally happens when you have large capacity KV cache without the memory latency compromise. You inherently get to 99% KV cache hit rates almost automatically. And there’s a video link we can leave here later on for the reader to follow up that discussion, but that is the nirvana of context engineering when you effectively can automatically get 99% KV cache hit rates. There are no latencies for your users, for, you know, for big agentic jobs, for voice agents in particular, and more importantly, you’re very energy efficient now because the really energy-intensive, compute-intensive process of prefill is done once and you decode forever for those tokens. You don’t have to reprefill every five minutes.
If you want to pay a premium to Anthropic and Google and others, you can only reprefill every hour or so. But the reality is we need this information for days and weeks, and as the science evolves, we might find even months of value in these KV caches. And if you can prefill them once and decode forever, that is the ideal goal here. In programmer talk, it’s taking an order of N operation down to an order of one operation. That is an ideal algorithmic optimization. And that’s really the mathematically and technically what the Augmented Memory Grid offers. And I’ll pause there because I want to also, you know, talk about what the token warehouse does over and above that.
What Are Token Warehouses and How Do They Improve Inference Economics?
Vik: Yeah, let’s, so we’ll get, the token warehouse is obviously the next immediate concept we need to hit up. The whole point about the KV cache hit rates is that if you have basically a large amount of storage available at very low latencies to the GPU, what it can do, the inferencing solution can do, is it can store KV cache, and as long as you get KV cache hits, the cost of inferencing dramatically drops. If you need to recompute, it’s a very expensive proposition. If you have a cache miss, the cost is now treated as an input token because you have to recompute the prefill phase.
Val: Exactly.
Vik: If you can keep hitting cache every time you need KV, then that is the ideal solution, and that will dramatically change the landscape of tokenomics and, you know, changes everything based on how agents work, because now, like you say, you can keep the context for a week. Like if you look at Anthropic’s pricing right now, you’ll see that holding it for an hour is a certain pricing versus holding it for 5 minutes is another pricing in cache.
Val: Exactly. Exactly. That comes from a fundamental hardware limitation. So if you can store it infinitely, then it drops the tokenomics, changes completely, and that’s where token warehouses come in.
Vik: Exactly.
Val: So where token warehouses come in is the widely reported reality of inference economics today are negative. The joke is people want to climb up to positive unit economics, you know, in inference serving today, and that’s because it is expensive. You have to recompute these tokens over and over again every five minutes or in some cases an hour, but you’re always paying for the privilege of extending that recompute. And it’s hard whether you’re, you know, an AI-native startup or whether you’re an enterprise in production to meet the insatiable demand of tokens right now with this particular cadence.
I kind of like to call it like assembly lines, you know, before Henry Ford and Model T, we were basically having all the workers come to a car, do their job, and then leave and go to another car or something, but the workers moved around. It was very expensive to move them around. And the assembly line brought the work to them, right? Don’t move the workers around. Pump out 10 times, 100 times more cars than before just by reorganizing your factory.
So that’s what we’re seeing now, basically, with these token warehouses now that enable the ability to now not evict KV cache, to minimize that prefill. And at scale now you’re basically at data center scale able to have all sorts of systems now, again, over one network, the north-south network, over the context memory network, and especially over the GPU, you know, GPU direct, if you will, network. All those three network types can participate in not recomputing those tokens and figuring out based on their service levels where they want to read those precomputed tokens from. And because the precomputed tokens are always available in the warehouse, in the lake house, you’re now able to choose whether you want to, you know, burn trees, so to speak, and light up an entire rack’s worth of energy to recompute tokens, or whether you just want to read them from this token warehouse.
And this decision is going to become more and more important because everything changes, right? So the evolution of Hopper to Blackwell, Blackwell to Rubin means that the precomputing is faster and more efficient. So it’s not always a static equation of yes, it’s always cheaper to read from the token warehouse, from the stored KV cache. But the hack ultimately with all the gobbledygook we’re talking about here, if you don’t follow the technology, is it’s very transparent pricing. So the way to really understand tokenomics and the way to understand whether you can be a positive gross margin AI-based business is to go to sites like OpenRouter, which are wonderful because openrouter.ai will let you look at your category and then your models, and for any particular model.
So let’s pick DeepSeek R1. Let’s pick OpenAI’s GPT-4o model. You can actually see all the inference providers that serve that model across the world. Hopefully more than one for some of your models so you have competitive choice. And it’s very transparent. You have a price per million input tokens there that you were referring to, a price per million output tokens, and the ones that are just being filled in now are price per million cached read tokens and a price per million cache write tokens, which isn’t often populated now because these are advanced features that aren’t yet offered by most inference providers.
But you can see it right there. The real hack is if you can see a cache read token that’s close to a full price to full input token price and compare those that are often 10 to 1 differences today. If you see that narrowing to, you know, 5 to 1, 3 to 1, 2 to 1, 1.5 to 1, you know someone has a token warehouse. You know they’re being efficient there. And you know now that their pricing is positive unit economics and therefore your own app economics, your agent economics can be positive unit economics as well, because you don’t want less. You want more tokens. Your billions of tokens are just going to go up that you want, but your price per can’t continue to be what it is today.
Vik: Yeah. So the message is if there is any way to make the GPUs work less, that will save money. So find a way to make the GPUs lazy, because if you can just store the stuff you need and retrieve it, it is far better than churning those GPUs around.
Val: And the natural consequence of that is you don’t buy less GPUs. That’s just not in the vocabulary of anyone in this industry. You always preciously consume whatever allocation of GPUs you get, but you can reorder them as in an assembly line so much more efficiently that in this two-phase approach of, you know, prefill and decode, you essentially can allocate fewer GPUs to that redundant prefill operation and more GPUs now to the money-making decode operation, which is, you know, directly mapped to output token pricing, for example. And you’re just generating more tokens off the same CapEx and the same OpEx, the same energy consumption, which is, which is again the ideal optimization state.
What Is the Future of AI Storage: NVMe, Ethernet, Optics, and Beyond
Vik: That’s fascinating. That’s really good insight. Thank you for that. What are your views on the future of storage? What are we going? Where are we going? What’s next? What can we expect down the line? What do you see?
Val: So there’s all sorts of exotic media coming out, DNA-based storage, right, for archival and glass and diamond vein storage and all that. You know, ultimately we’re still kind of bound by the reality of the modern storage protocols. So more innovation in NVMe, NVMe fabrics, more innovation in the networks supporting NVMe. So better scale up, whether it’s InfiniBand or more and more Ethernet now. Certainly more innovations like Ultra Ethernet, you know, and RDMA, RoCE, you know, RDMA over Ethernet, over Ultra Ethernet. Those are going to be the heart of innovations going forward because, you know, Bob Metcalfe will just be very happy. He’s grinning right now, right? Ethernet just keeps winning and keeps being the most manageable way to manage large networks, even that insanely fast serialization/deserialization of those networks.
Optics, right? Going from copper to more optics to not have these furnaces basically on the network side, you know, with the very, very hot transceivers. That’s a big innovation. CPO, co-packaged optics, that’s a big innovation. So optics and Ethernet and, you know, these modern storage protocols, NVMe and NVMe fabrics, understanding all the queues across all the devices and having just native software that understands that without some of the burden of 30-year-old technologies that were designed for spinning rust.
Vik: Yeah. And this is all important. All these innovations affect everything else. So everything is so squeezed now that, as Jensen described, it’s extreme co-optimization across networking, compute, storage that is really going to drive the industry forward. And it’s not just any one thing that would change the world forever or anything. Maybe we already saw that with the LLM. But there’s so much required everywhere.
Val: Absolutely. I mean, there’s never been a better time to be a true systems engineer, right? You’ve got to engineer all sorts of discrete components into a very balanced, cohesive, reliable, single system.
Vik: All right, that’s it for this episode. Thanks so much, Val, for spending the time to be with us. This was an enlightening conversation. I learned a lot about this, and I hope our listeners did, too. Please check out weka.io. I know I learned a lot from their fantastic collection of blogs on their website that talk so much about storage protocols and technologies from a very understandable point of view and a lot of pictures. So definitely do check that out. And if you enjoyed this podcast, please consider leaving us a review or a like on your favorite platform, and thanks for listening.
Val: Absolute pleasure, Vik.
Related Resources
AI Infra Summit: Shimon Ben-David, WEKA and Nave Algarici, NVIDIA discuss what is next in AI Infrastructure
Thank you for your WEKA Innovation Network program inquiry.
A WEKA channel representative will respond promptly.
You’re on your way to solving your most complex data challenges.
A WEKA solutions expert will be in contact with you shortly.
Thank you for your interest in WEKA’s Technology Alliance Program (TAP).
A member of the WEKA Alliances team will follow-up with you shortly.
Thank you!
A WEKA representative will be in touch with you shortly.
Thank you!
A WEKA representative will be in touch with you shortly.