VIDEO

The Agentic AI Infrastructure Playbook

WEKA CTO Shimon Ben-David joins VentureBeat’s CEO and Editor in Chief Matt Marshall in a fireside chat to discuss how insufficient GPU memory is the leading bottleneck to unlocking maximum AI inference efficiency.

Speakers:

  • Shimon Ben-David - CTO, WEKA
  • Matt Marshall - CEO and Editor in Chief, VentureBeat

Transcript

00:00

Understanding the Memory Wall Problem in AI Inference and Why Enterprise Agents Need KV Cache Solutions

Matt Marshall: WEKA is on the front lines of a lot of things having to do with agents. If anyone’s been involved in building enterprise agents, you’ll probably hear about the memory wall. If you’re using a lot of tokens and the context that’s necessary to keep those tokens in memory from chat to chat or from instance to instance, they have been working on solving some of those things. So we’re going to be hearing from Shimon Ben-David, WEKA’s CTO, who can help sum up some of these things on the infrastructure side.

Thank you for supporting this event, Shimon. WEKA has arrived with the right product at the right time. You just closed with that piece around unit economics. One of the other things that are broken right now with agents is this statefulness, right? Because of the context that needs to be carried from call to call. And that has not been there. Google just announced an interactions API. You’re going to see this statefulness —y ou have that ability to save that context. We see longer context windows. So the cost of these tokens is going down, right? We keep hearing about the cost, but it turns out that the token use is actually going faster than the cost of tokens going down because of things like statefulness.

Shimon Ben-David: So that’s a good problem to have, right? People are using AI more and more.

Matt Marshall: Yeah, it’s definitely a good problem.

01:30

Why Managing Hundreds of AI Models Creates Inference Bottlenecks in Production Environments

Matt Marshall: Enterprises are struggling with just managing the model zoo, right? These hundreds of models where you’re turning to the models that work for specific cases. So why is the model repository becoming such a chokepoint for inference?

Shimon Ben-David: So I think — and it seems like we have a very technical audience, so that’s great — I think when we’re looking at inference environments, there’s a big difference between environments in pre-production, in benchmarks, and those of us that are taking our inference environment to production-grade, running multiple models.

There is not one or two models that I run as an organization. Definitely when I’m running an agent, multiple agents, agent swarms — which we’re starting to see more and more — all of these models need to be activated and deactivated in a fast fashion.

And sometimes there’s this: We’re all talking about high-level AI use cases, but eventually there’s physics behind it. These models contain a certain capacity that needs to be loaded into GPU memory or TPU (tensor processing units) memory, loaded and activated. We see customers with hundreds of terabytes or petabytes of model repositories, and these models actually need to be activated in a fast fashion because eventually in inferencing, latency is king, right?

If I have a user and I reply within one second and that’s my SLA, or it takes 10 seconds, maybe now they’ll go to the next model. So the ability to move between models, activate models in an inference environment at scale is very challenging. Now the challenge is that my physical components, my GPU servers, don’t have that capacity to accommodate for all of my models.

And then suddenly what do I do? I’ll give an example: A customer that we worked with, Cohere, a large foundational model provider running on multiple cloud environments. One of their challenges was they have a dynamic inference environment: Hundreds of GPU servers, thousands of GPUs. Obviously, they have multiple models, but they had variations per customer, per sub-customers that they were using, and they had their environment that could accommodate for a certain amount of prompts at a certain SLA.

Eventually, when they got a burst of more inference requests or prompts, they needed to scale out in an elastic way. This is the cloud. They needed to scale in and out in an efficient way. It took them around 5 to 15 minutes to get a new instance with the model loaded up and running, ready to serve a peak of inference requests. That’s a long time. We were able to decrease it, but you need that elasticity. You need to be able to accommodate for multiple models.

Additionally, what we’re also seeing is it’s not only that — because inferencing is growing — but there’s still a lot of training environments. So between training environments and inferencing environments, how do you manage your GPU utilization in an efficient way? How do you utilize both environments concurrently to make the best effort out of it?

04:41

How GPU Memory Limits and HBM Capacity Create the KV Cache Problem

Matt Marshall: This is really interesting. So as you hit this stateful issue and you’re hitting this memory wall, what’s stopping these customers? You mentioned Cohere, but generally these enterprise companies do have a multi-cloud environment. Why don’t they just go to the cloud and load up on GPUs for that inference?

Shimon Ben-David: You know, some of them wish they did. There’s some problems that you cannot throw enough money at to solve. GPUs are a scarce resource. And when we’re — again, being a deep tech company — when we’re looking at the infrastructure of inferencing, inferencing is not a GPU cycles challenge. It’s mostly a GPU memory problem.

If you’re looking at the notion of inferencing, you have this construct called KV cache, which is essentially the memory, the context windows of the models you’re using. In production-grade environments, again, you’re trying to spread the load across hundreds or thousands of GPUs to answer all these massive prompts. The challenge is actually your context window, your GPU memory, which is saved on the HBM (high-bandwidth memory) of the GPU memory.

Imagine a co-development environment. You’re swinging tons of code, tons of code eventually translates to capacity. I’ll give you an example: Depending on the model and the amount of layers, 100,000 tokens can be 40GB of memory. That’s out of your 80GB or 140GB of HBM, right? So if I’m throwing two, three, four books that are 100,000 tokens, that’s it. I ran out of my KV cache capacity on my HBM.

Suddenly what the inference environment needs to do is drop data. I pre-calculate, I prefill, I create that KV cache and I start decoding on it, sending out tokens. But as I generate more data, I now need to drop my previous data. Eventually when I need it again, I need to recalculate it. We constantly see GPUs in inference environments that are recalculating things they already did. So you’re prefilling, decoding, prefilling again.

The same thing, more in-depth — we actually see large model providers, and I think if you look at the pricing it’s very apparent, Anthropic, OpenAI, and others — they are teaching you how to generate prompts that are hitting your same GPU on the off chance that you land on the same GPU that has your KV cache. So then they can just start decoding your data instead of recalculating, because they would like to generate more tokens for you.

We call that the “memory wall”: How do you climb that memory wall? How do you pass it? Eventually, that’s the key for modern, cost-effective inferencing. You can try to throw more GPUs at it. You can try to complicate your orchestration environment. We do see multiple environments, multiple companies trying to solve that in different ways.

For example, there’s new models, linear models that are trying to create smaller KV caches to be more efficient. There are environments that are saying, “Hey, I already calculated the KV cache on one GPU, it’s in my GPU memory. Let’s try to copy it, or maybe I’m using my local environment for that. But how do you do that at scale in a cost-effective manner that doesn’t strain your memory, doesn’t strain your networking? That’s something we’re helping with some of these customers.

08:11

Real Cost Savings from KV Cache Acceleration: 4.2x Performance Improvement Case Studies

Matt Marshall: It sounds like it’s very much a big company issue so far, with OpenAI, Anthropic, some of these bigger cluster users. But it’s definitely something that’s coming in agentic usage. We heard last time from Wonder, this company that’s in food delivery, where basically they were getting notifications from Azure in Northern Virginia saying they had a location crunch and “go find another location.” So this sort of crunch is happening. Can you talk a little bit about the economics of the KV cache processes? How much are your customers saving?

Shimon Ben-David: So mileage will vary on use cases because some use cases. For example, if I have a simple chatbot just getting a few questions and answering, my context window is not that big, so maybe less valuable. But some use cases are very KV cache-heavy. Example: code development, tax returns, everything with regulation — you have a lot of context and benefit from KV cache.

In some of the benchmarking and working with customers — some inference providers, some LLM providers — we saw we can accelerate that by a factor of up to 4.2x. That’s a real number with multi-tenant variation inferencing.

Just to explain what the magnitude of 4.2x is: It sounds like a small number, right? Imagine that you have 100 GPUs and now these 100 GPUs are emitting a certain amount of tokens. Imagine that now these 100 GPUs are working as if there are 420 GPUs. Just by adding the KV cache accelerated layer we provide, that’s a ridiculous amount of money. We’re looking at some use cases where the savings would be millions of dollars per day for these inference providers.
Matt Marshall: That’s significant.

10:11

How LinkedIn Optimized AI Inference Using Speculative Decoding for 4x Throughput Gains

Matt Marshall: We’re out of time on the talk, let’s go to questions. Is anyone running inference and running into the efficiency issue?

Audience Member (AI Inference Optimizations at LinkedIn): We recently rolled out our own on-prem hosted hiring assistant for LinkedIn, and I was working on that. We were facing a lot of memory-bound problems on the decoding side of the stages. And we felt that speculative decoding was something that really saved us a lot of latency and increased throughput by around 4x. And these are public numbers on the engineering blog, so feel free. But just curious: How have you seen speculative decoding evolve, and how can we make it more efficient? Because right now how it works is it just looks at the context that you have passed so far, but maybe that can be augmented with the token factories.

Shimon Ben-David: I think it really depends on what our customers are doing on our platform. I’m not familiar with any customer that’s running speculative decoding. But as I mentioned, it really depends on the amount of KV cache you’re generating. And it’s one more way to decrease the amount of KV cache eventually, right? Because you generate less of it. But eventually there’s this paradox that the more you improve, the more you do, right? So even with that, imagine adding that acceleration layer on top of it and getting even further.

12:02

Fractional GPU Solutions and Multi-Tenancy Strategies for Cloud Providers and Data Centers

Audience Member (Security at Equinix): One of the things that at least we are seeing with our customers: A lot of them are opting for fractional GPUs. So how does it help solve the inferencing problem, or have you seen your customers opting for fractional GPUs? Because I think especially from a data center or cloud provider service, that’s one of the things people are figuring out on the back end.

Shimon Ben-David: Brilliant. I think it’s a good question. We’re actually working with a lot of the neoclouds, hyperscale cloud services, and most often now, sovereign clouds, where multi-tenancy in terms of clustering of GPUs and clustering of the storage is a major concern.

We see some of them offering fractional GPUs. Realistically, what we most often see are actually clusters of GPU servers. Because it’s so expensive, tenants are just buying large fractions, large numbers of GPUs for themselves. We see some of the neoclouds that are offering an on-demand environment.

Eventually a fractional GPU is also fractional in the memory of the GPU that you’re getting. So these sorts of solutions are able to take it even further. If you’re getting a fractional GPU and you’re about to do the fractional memory, now suddenly being able to say, “Hey, I can spill my context over — unrelated to what my fractional memory can be — and can be significantly accelerated.”

It’s funny because I actually also heard from another one of these GPU clouds that mentioned they’re going to deploy their core environment, core cloud, and there’s the edge environments, and obviously there’s the edge aggregation. Smaller data center, but still a data center. And his take on this type of acceleration was, “You’re actually safeguarding my investment because if I’m putting somewhat slower GPUs, not the best of breed, not a lot of them, and suddenly with this I can do much more.”

14:26

Best Practices for KV Cache Storage Time Limits and LRU Eviction Policies in Production

Audience Member (Data Scientist): I have this question about the KV cache time limit. Do you have any time limit on how long the cache can be stored?

Shimon Ben-David: I think that’s the million-dollar question. When we started, I have to say we were, engineering-wise, very naive: “Let’s just extend the GPU memory and drop it,” and then they’re like, “Oh yeah, we need to delete it at some point because otherwise you’re creating this token warehouse of sprawl of information.”

Eventually what we’re doing is we’re working with customers that are looking at it, and what we’re seeing is that different customers have different SLAs. For example, if we’re working with an inference provider that says, “My SLA is 10 minutes for the data, then afterward everything is best effort,” eventually if it spills over to our data environment, you’d like to keep it as much as possible. But then there’s the least recently used algorithm that if it’s not needed, it can stay there as much as possible according to the customer’s SLA, and then it can be evicted.

It’s a really good question. And one more thing I’ll say is that obviously we see this seeping into enterprises. We’re talking a lot about large inference providers, hyperscalers, neoclouds. But we see next year as the year of the enterprises. AI is becoming more and more dominant. And I mentioned this 420% improvement: Imagine an enterprise that’s running 100 GPUs, getting 120 GPUs’ worth — 20 GPUs is still hundreds of thousands of dollars.

16:09

Power Efficiency and Green Computing Requirements for AI Data Centers in 2025

Audience Member (Data Center Admin): I’m really interested in hearing all this because it helped me understand what you guys are trying to solve. My question is: We’ve been trying to solve how to get more power to build the clusters. How is it really location-agnostic? Is it really latency-agnostic? I feel like this is something that sparks a lot of questions.

Matt Marshall: Maybe what your customers want.

Shimon Ben-David: So, customers want it for free and at zero power.

[Laughter]

One of our investors is Al Gore with Generation IM, which has a very green mentality. And in the past, the green mentality meant, “Let’s have a checkbox. We’re green. Boom.” Now I’m sad but happy to say the green mentality also equates to dollars.

So the No. 1 requirement we see from our customers is: Can you be more cost-effective, more power-efficient? If my power utilization doubles, quadruples every several years, if I can get that power to begin with, by the way: We see our customers being very creative about how they generate power. A lot of these neoclouds are located in environments where they can generate the power: power plants, nuclear power plants, renewable energy, hydro. And then it’s, let’s use it as efficiently as possible.

One thing I would mention along these lines is: The more dense your solution is, the less power, the less switches, networking, cable or rack footprint, cooling, heating, the better. So that’s No. 1 out of a list of 100 requirements.

18:20

GPU Cluster Optimization Techniques: From Instruction-Level Parallelism to Coarse-Grained Computing

Audience Member: This reminds me of some of the work I did earlier on instruction-level parallelism back in the early 1990s, a lot of code extracts and code optimization in L1 cache. Do you see more at the coarse-grained level now — L1, L2, L3 — so not at the KV cache at the instruction-level optimization, but more at a cluster of GPUs and doing coarse-grained similar to how traditional HPC computing was done?

Shimon Ben-David: Really good topic. We were actually working with NVIDIA on their Dynamo project. And one of the things that they mentioned is we do see other methods that are trying to solve this KV cache environment challenge. And yes, we definitely see there are multiple tiers within the server itself that are trying to allocate for it. But it’s nice. And there’s also some proprietary hardware devices that are meant to accelerate cache within a server. The problem with scale is somewhat different.

Audience Member: It’s eventually morphed into those kinds of problems.

Shimon Ben-David: Exactly.

19:28

Future of GPU Costs, HBM Shortages, and Enterprise AI Cost Predictability Strategies

Audience Member: I have a two-part question. First, we have seen the overall cost of the GPU, all of this hardware, getting lower over time. What do you see in the near future — the cost overall from the GPU and including the memory — can it be decreased significantly?

Second, in the enterprise environment, usually we hope to get the cost upfront, very predictable, meaning sometimes it might not be necessary. It’s all tradeoffs, right? It’s about the SLA, the latency, all the parts. At the end of the day we also want to balance the cost. So if we can get the overall predicted cost, then we can also optimize what things we can do. We can also optimize a certain workflow so that it’s not prioritized — if it’s not urgent, it can be prioritized to do later to help optimize the cost in general.

Shimon Ben-David: What do we see with GPU costs and memory? It’s actually worsening. As you know, there’s a shortage in flash devices worldwide, and that relates to memory NAND, HBM costs are increasing significantly, flash devices, NVMe, SSDs — so it’s going to get worse before it gets better. I was talking with somebody today that during COVID, everybody bought a certain amount of something and then everybody bought it again. And why? That’s kind of the rush of everybody buying NAND today.

I would say if you can use anything you want, it gives you a lot of flexibility. I will point to WEKA a bit — it’s not a sales pitch as we mentioned — but imagine that because we created a software-defined environment, we can use multiple types of NANDs. So whatever is available we will use. More than that, we created an environment that can fold in and run all of that within your GPU servers without the need for external NAND. The more flexible you are, the more you’ll be able to provide during that time of shortage, and obviously you want to do it. It will be harder to get GPUs also, so the more you can get out of your GPUs, obviously the better.

To the second part of your question around predictability: I think in enterprises that’s the key, because we see a lot of enterprise AI workloads that are still pointed toward the organization. And a lot of AI use cases are not external-pointed — they’re still being used internally: internal chatbots, A/B environments. Next year we think we’ll see more of it going outbound, especially with sovereign clouds that are advancing technology and research for their sovereign environments. Their ability to be predictable is going to be more important.

Taking that shortage with this SLA and predictability requirement, what we’re saying is: If we can provide an environment where you have a set amount of known FLOPS (floating-point operations per second) on your GPUs, known memory, known KV cache capacity — and the knob that you need to turn to meet your SLAs, this augmented memory environment, which is a fairly cheap knob compared to buying 100 or 1,000 more GPUs — that’s definitely an advantageous way to go.

Like This Discussion? There’s More!

Shimon Ben-David is passionate about helping customers solve their memory crises. He recently spoke at AI Infra Summit in San Francisco about memory-first architectures will help solve inference challenges.