VIDEO

The Agentic AI Infrastructure Playbook

Weka is on the front lines of a lot of these things having to do with agents. If if anyone's been involved in building enterprise agents, you'll probably hear about the memory wall. If you're using a lot of tokens and the context that's necessary to keep those tokens and and and memory from chat to chat or for instance to instance and so they have been working on solving some of those things. So we are going to be hearing from Shimon Ben David, the CTO, who can help sum up some of these things on the infrastructure side. Thank you. So, yeah, thank thank you for supporting this event, Shibon and WEKA has arrived with a product right at the right time and so you just closed with that piece around the comics. One of the other things that are broken right now with agents is the statefulness, right, because of the context needs to be carried from call to call and that has just not been there and so Google just has announced this interactions API, you're going to see the statefulness as it's you have that ability to save that context. We see longer context windows. So the cost, as I understand it, the cost of these tokens is going down, right, we keep hearing about the cost, but it turns out that the token uses is actually going faster than the cost of tokens going down because of things like statefulness. So That's a good problem to have, right? People are using AI more and more. Yeah. It's it's it's definitely a good good problem. So in going to that, what we were talking about is the enterprises are struggling with just managing the model zoo, right? I mean these hundreds of models that you just heard Amjad talking about, right, where you're turning to the models that work for specific So so why is the model repository becoming such a choke point for inference? Yeah. So so I think, those of, and and it seems like we have a very technical audience, so that's great. I think when we're looking at inference environment, there's a big difference between environments in preproduction, in benchmarks, and those of of us that are taking inference environment to production grade running multiple models. There is no and we heard Amjad actually also saying it, and Mike also mentioned it. There's no one or two models that I run as an organization. Definitely, when I'm running an agent, multiple agents, agent swarms, which we're starting to see more and more, all of these models needs to be activated, deactivated in a fast fashion. Right. And and sometimes there's this we're talking about high level AI use cases, but there eventually, there's physics behind it. These models are contain a certain capacity that needs to be loaded into GPU memory or TPU's memory, loaded, activated. We see customers with hundreds of terabytes or petabytes of model repositories, and these models actually needs to be activated in a fast fashion because eventually in inferencing, latency is king. Right? If I have a user and if I reply with within one second, and that's my SLA, or it takes ten seconds, maybe now they'll go to the next model. Right? So the ability to move between models, activate models on my in inference environment at scale is very challenging. Now the the challenge is that, my physical components, my GPU, TPU servers, don't have that capacity to accommodate for all of my models. And then suddenly, what do I do? Do I now for example, I'll give a customer that we worked with, Cohere, a a larger foundational model provider, running on multiple cloud environments. And one of their challenges was they have a dynamic inference environment, about hundreds of GPU servers, thousands of GPUs that all did. Obviously, they have multiple models, command r and and more, but they had variations per customer per sub customers that they were using. And they they had their environment that could accommodate for a certain amount of prompts at a certain SLA. And, eventually, when they got a burst of more inference requests of prompts, they needed to scale out in a in an elastic way. What right. This is the cloud. They needed to scale out and scale back in in an efficient way. It took them around five to fifteen minutes to get a new instance with model loaded, up and running, ready to serve a peak of inference requests. That that's a long time. Right? Yeah. We were able to decrease it. Right? But but that but you need that elasticity. You need to be able to accommodate for multiple models. Additionally, what we're also seeing is that it's it's not only that because we the inferencing is is growing, but there's still a lot of training environments. Right? So between training environments and inferencing environments, how do you manage your GPU utilization in an efficient way? How do you utilize both environments concurrently to make the best effort out of it? Yep. Yep. This is really interesting. So, as you hit the statefulness and you're hitting this memory wall, what's stopping these customers? I mean, you mentioned Cohere, but generally, these enterprise companies do have, you know, maybe a multi cloud environment. Why don't they just go to the cloud and just maybe load up on GPUs to for that inference? You know, some of them, which they do. There's some problems that you cannot throw enough money to to solve. Yeah. GPU or a scarce resource. And when we're again, being a deep tech company, when we're looking at the infrastructure of inferencing, inferencing is is not a GPU cycles challenge. It's mostly a GPU memory problem. You're looking at, the the notion of, inferencing, you you have this construct that are called KV cache, which is essentially the memory, the context windows of the of the models that you're using. And in in production grade environments, again, you're trying to spread the load across hundreds or thousands of GPUs. And to answer all these massive prompts, the challenge is actually your your context window, your GPU memory, which is saved on the HBM of the GPU memory. Right? And imagine a co a co development environment like mentioned. You swing tons of codes, tons of codes eventually translate to to capacity. I'll I'll give an example. Depending on the model and the the amount of layers, right, a hundred thousand tokens can be forty gigabyte of memory. That's out of your eighty gigabyte or a hundred and forty gigabyte of HBM. Right? So if I'm throwing two, three, four books that are a hundred thousand tokens, that's it. I ran out of my my my KV cache capacity on my HBM. So suddenly, what what the inference environment needs to do, it needs to drop data. So I pre calculate, I I pre fill. Yeah. I create that KV cache, and I start decoding on it, setting up tokens. But as I generate more data, I now need to drop my previous data. So I end up eventually when I need it again, I need to recalculate it. So we constantly see GPUs in inference environments that are recalculating things that they already did. So you're prefilling, decoding, prefilling again the same thing. More than that, we actually see large model providers. And I think if you if you look at the pricing, it's very apparent. Anthropic, OpenAI, and others, They're teaching they're teaching you how to generate prompts that are hitting your same GPU on the off chance that you'll land on the same GPU that has your KV cache, so then they can just start decoding your data instead of recalculating it because they they would like to generate more tokens for you. So how do you and and we call that the memory wall. How do you climb that memory wall? How do you pass it? Eventually, that's the key for modern cost effective inferencing. You you can try to throw more GPUs at it. You can try to complicate your orchestration environment. Yeah. We do see multiple environments, multiple companies trying to solve that in different ways. For example, there's new models, linear models that are trying to create smaller KV caches so to be more efficient. There are environments that are saying, hey. I already calculated the KV cache on one GPU. It's in my GPU memory. Let's try to copy it. Or maybe I'm using my local environment for that. But how do you do that at scale in a cost effective manner that doesn't strain your memory, doesn't strain your networking? So that's what something that we're helping with some of these customers. Yeah. Yeah. It sounds like it's very much a big a big big company issue so far with, you know, OpenAI and Anthropic, some of these bigger cluster users, but it's definitely something that's coming, right, with this agentic usage. We heard last time from Wonder that this company is in food delivery where basically they were getting notifications from Azure in Northern Virginia saying they had a location crunch and go find another location. So this sort of crunch is happening, and and and so can you talk a little bit about the economics of the KB cash processes? Like, how much are you saving with with this? Brilliant. So, again For customers, how much are they they they saving? So mileage will vary on on use cases because some use cases, for example, if I have a simple chatbot, just getting a few questions and answering, my context window is not that big, so maybe less valuable. Some use cases are very cash heavy, Example, co co development, definitely. Right? Right. Tax tax returns. Everything with regulation, you have a lot of context and benefiting from KV cash. In some of the benchmarking and working with customers, some inference providers, some LLM providers, we saw that we can accelerate that by by a factor of up to four point two. That that's a real number with multi turn variation inferencing. Four point two x improvement. Just to just to explain what the magnitude of four point two. It sounds like a small number. Right? Imagine that you have a hundred GPUs, and now these hundred GPUs emitting a certain amount of tokens. Imagine that now these hundred GPUs are working as if they're four hundred and twenty GPUs. But just by adding this KV cache accelerated layer that we provide, four hundred twenty GPUs instead. That that's a ridiculous amount of money. The we're looking at some use cases where the saving amount would be millions of dollars per day for these inference providers. Right? Yep. Significant. So we're out of time on the talks, so let's go to questions. Is anyone I'm just wondering, running inference and running into the efficiency issue? Yeah. Do you want to maybe introduce yourself again and talk about the use case? For sure. So I'm working in AI inference optimizations at LinkedIn. We recently rolled out our own on prem hosted hiring assistant for LinkedIn, and I was working on that. So we were facing a lot of memory bound problems on the decoding side of the stages and we felt that speculative decoding was something that really saved us a lot of like latency and increased our throughput by around like four x And these are, like, public numbers on the engineering block, so feel free. Yeah. But just curious about, like, how you have seen speculative decoding evolve and how can we, like, make it more efficient? Because right now, how it works is it just looks at the context that you have passed so far, but maybe that can be augmented with the token factories. So so it could be. I think it really depends on what our customers are doing eventually in the on our platform. I'm not familiar actually with any customer that's running speculative decoding. But as I mentioned, it really depends on the amount of KV cash that you're generating. And it's it's one more way to decrease the amount of KV cash eventually, right, because you generate less of it. But but eventually, there there's this paradox that the the faster the the more you improve, the more you do. Right? So even with that, imagine adding, that acceleration layer on top of it and getting even further. Yeah. Thanks. Any any other questions around yep. Yep. Go ahead. I do have one question. So Rohit again. So I lead the security at Equinix. So one of the things that, at least we are seeing with our customers, like, a lot of them are opting for fractional GPUs. So how does it, help, like, either solve the inferencing problem, or have you seen, like, your customers, are they, like, opting for fractional GPUs too? Like because I think That's what especially from a data center or, like, cloud provider, guess, that's one of the things people are kind of figuring it on the back end. So yeah. Brilliant. I think it's a good question. We're actually working with a lot of the neo clouds Prescaler clouds, obviously, clouds, and most often and not now, sovereign clouds Where multitenancy in terms of clustering of GPUs and clustering of the storage is is a major concern. We see we see some of them offering fractional GPUs. Realistically, what we're most often they're not seeing are actually clusters of GPU servers. So tenants are getting because it's so expensive to get, they're just buying large frack large numbers of GPUs for themselves. We see some of the neo clouds that are offering, like, maybe an on demand environment. Eventually, a fractured GPU is is also fractional in the memory of the GPU that you're getting. True. So these sort of solutions are actually able to to take it even further. If you're getting a fractional GPU and you're bounded on the fractional memory, now suddenly being able to say, hey. I can spill my context over to unrelated to what my fractional memory can be, significantly accelerated. It's funny because I actually also heard another, one of these GPU clouds that mentioned they're going to deploy, they have their core environment, core core cloud, and there's the edge environments. And, obviously, there's the edge aggregation, smaller data centers, still data center, but smaller data centers. And and his take on on this type of acceleration was you're actually safeguarding my investment. Because if I'm putting somewhat slower GPUs, not the best of breed, not a lot of them, suddenly with this, I can do much more. So I I hope I answered your question. You did. And, yeah, I'll pick you up later. So yeah. Brilliant. Thanks. And yeah. Hi. I'm Tanu Singh. I'm a data scientist, and I have this question about the KBCash time limit. So do you have any time limit on how long can we can the cache be stored? I I think that's the one million dollar question. Right? Yeah. When when we started, we I I have to say that we were we were engineering wise. We were. Let's let's just be let's just extend the GPU memory and drop it, and then they're like, oh, yeah. We we we need to delete it at some point because otherwise, you're creating this token warehouse of sprawl of information. Eventually, we're doing is we're working with customers that are looking at it, and what we're seeing is that different customers have different SLAs. So for example, if we're working with an inference provider that says my SLA is ten minutes for for the data, and then afterwards, everything is best effort. So eventually, if it spills over to our data environment, then you you'd like to keep it as much as possible. But then there's sort of, like, least recently used algorithm that if it's not needed and it it it can stay there as much as possible according to the customer SLA, and then it can be evicted. Okay. Yeah. But that's a really good question. Thank you. And and one one more thing I'll say is that, obviously, we see that seeping into enterprises. We're talking a lot about large inference providers, hyperscalers, neo clouds, but but definitely next year, see it's the year of the enterprises. Right? The AI is is becoming more and more dominant. And, I mentioned this, four hundred twenty percent improvement, a hundred GPUs being but even imagine an enterprise that's running a hundred GPUs getting a hundred and twenty GPUs worth of like, we saved them twenty GPUs. Twenty GPUs is still hundreds of thousands of dollars, right, just to I guess I have the mic. Hi. I'm Jane. I just really wanna say I really appreciate the side of conversation. I heard data center mentions, and I'm from the data center side. I think what I'm just really interesting in hearing all this because it helped me understand what you guys are trying to solve. But my question is, you know, we've been trying to solve this to how to get you guys more power so you guys can build the clusters. How does it is it really, like, location agnostic? Is it really latency agnostic when it comes to your application? That was that's probably a bigger question than you can Are you asking specifically about our product or in general? Oh, just in general. Right? But I feel like this is something that it just sparks a lot of questions. Maybe what your customers want. Yeah. Wow. What do you What do your customer Okay. Okay. Will be helpful. Yeah. Try to tackle that. So customers want it for free and at zero power. Yeah. But but to be realistic yeah. To to be realistic by by the way, one of our investors is Al Gore with Generation AM, which is has a very green mentality. Right? And in the past, by the way, green mentality meant, let's have a checkbox. We're green. Boom. Now I'm I'm sad, but happy to say green mentality is, equates to dollars. So the the the number one, requirement we see from our customers is can you be more cost effective, more power efficient? Because now if my power utilization, doubles, quadruples every several years, And if I can get that power to begin with, by the way, because we see our customers also being very creative about how they generate the the power. So a lot of these new clouds are located in in environments where they can generate the power, power plants, nuclear power plants, renewable energy, hydro. And then it's like, let's use it efficiently as possible. So one thing which which I would mention is along these lines is the more dense your solution is, the less power, the less switches, networking cable, or rack footprint, cooling, heating, the the better. So that's number one out of, like, a list of a hundred requirements. We can talk about the new products there. The the rest of the ninety nine. Yeah. Okay. Thank you. Question. Yeah. Oh, sorry. It wasn't. This is mine. It's just sorry. Yeah. This reminds me of some of the work I did earlier on instruction level parallelism back in the early nineties, you know, where a lot of core ex code extraction, code optimization, and l one cache. Do you see more at the core screen level now, like l one, l two, l three? So not at the KV cache, at the instruction level optimization, but more at a cluster of GPUs and doing core screen similar to, like, how the the traditional computing was done. Yeah. Really good topic. We're actually working with NVIDIA on their Dynamo project, and, one of the things that they mentioned is that we do see other methods that are trying to solve this KV cache environment, challenge. And, yes, we definitely see that there is multiple tiers within the server itself that are trying to allocate for it, but it's nice. It's Certainly. Yeah. And there's also some proprietary hardware devices that are meant to accelerate KV cache within a server. The problem with scale is is is somewhat different. It's But but, yeah, eventually, it will morph morph into those kinds of problems. So I have two part question. So first is we have seen the cost of the overall, the GPU, all of this hardware is getting lower, right, the over the time. And and hopefully, gonna what do you say actually in the near future, the the cost of overall the from the GPU and the the including the memories are also gonna be decreased significantly. The second part is in the enterprise environment, you already we actually really hope to get the the cost upfront, like, predictable. Yeah. So meaning, sometimes it might not be necessary the the it's all the trade offs. Right? So about the the SLAs, the latency, all the the but in the end of the day, we also want to balance the cost. So if we can gather the the overall, the cost, you know, the predictor cost, then we can also optimize what are things we can do and it's it's just like the we talk about the PG and E. We can also optimize a certain workflow so that that can be if it's not a prioritized, the you know, it's not urgent, can be prioritized to do to do it better, so to help optimize the cost in general. Okay. I'll I'll I think I remember the first part. I'll start answering that. What do we see with GPU costs and memories? And it's it's actually worsening. So as you know, there's a shortage in flash devices worldwide, and that relates to memory NAND, HBMs. HBM costs are increasing significantly. Flash devices, NVMe's, SSDs. So it's gonna get worse before it's gonna get better. There's a lot of it's funny. I was talking about with somebody today. During COVID, everybody bought a certain amount of something, and then everybody bought it again in why? Because. So that that's kinda like the rush of everybody's buying NAND today. So there is or I would say it in that way. If you can use anything, it gives you a lot of flexibility. I will point about ourselves a bit. It's not a sales pitch, what we mentioned, imagine that with because we created a software defined environment, we can use multiple types of NAND. So whatever is available, we'll More than that, we created an environment that can fold in and run all of that within your GPU servers without the need for external NAND. So there's a lot the the more flexible you'll be, the more you'll be able to provide during that time of shortage. And, obviously, you wanna do you you it will be harder to get GPUs also. So the more you can get out of your GPUs, obviously, the better. To the second part of your question around predictability, I think in enterprise, that's the key because we see a lot of enterprise AI workloads that are now, and I think we talked about it also before, that are pointed toward the organization still. And a lot of, AI use cases are not, external pointed. They're still being used internally, internal chatbots, AB environment. Next year, we we we we think that we'll see more of it going outbound, especially with sovereign clouds that are advancing technology and research for their, sovereign environments. The ability to be predictable is going to be more important. So for that, taking that shortage and taking this SLA and predictability requirement, what we're saying is if we can provide an environment or if customers can benefit from an environment where you have a set amount of known flops on your GPUs, known memory, known KV cache capacity that you can and and the knob that you need to turn to meet your SLA is just this augmented memory environment, which is a fairly cheap knob compared to buying a hundred or a thousand more GPUs. That that's definitely an advantageous way to go. We are out of time. Shimon, thank you very much for joining us and for supporting this event. Yep. Thank you.

WEKA CTO Shimon Ben-David joins VentureBeat’s CEO and Editor in Chief Matt Marshall in a fireside chat to discuss how insufficient GPU memory is the leading bottleneck to unlocking maximum AI inference efficiency.

Speakers:

Shimon Ben-David - CTO, WEKA
Matt Marshall - CEO and Editor in Chief, VentureBeat

Transcript

00:00

Understanding the Memory Wall Problem in AI Inference and Why Enterprise Agents Need KV Cache Solutions

Matt Marshall: WEKA is on the front lines of a lot of things having to do with agents. If anyone’s been involved in building enterprise agents, you’ll probably hear about the memory wall. If you’re using a lot of tokens and the context that’s necessary to keep those tokens in memory from chat to chat or from instance to instance, they have been working on solving some of those things. So we’re going to be hearing from Shimon Ben-David, WEKA’s CTO, who can help sum up some of these things on the infrastructure side.

Thank you for supporting this event, Shimon. WEKA has arrived with the right product at the right time. You just closed with that piece around unit economics. One of the other things that are broken right now with agents is this statefulness, right? Because of the context that needs to be carried from call to call. And that has not been there. Google just announced an interactions API. You’re going to see this statefulness —y ou have that ability to save that context. We see longer context windows. So the cost of these tokens is going down, right? We keep hearing about the cost, but it turns out that the token use is actually going faster than the cost of tokens going down because of things like statefulness.

Shimon Ben-David: So that’s a good problem to have, right? People are using AI more and more.

Matt Marshall: Yeah, it’s definitely a good problem.

01:30

Why Managing Hundreds of AI Models Creates Inference Bottlenecks in Production Environments

Matt Marshall: Enterprises are struggling with just managing the model zoo, right? These hundreds of models where you’re turning to the models that work for specific cases. So why is the model repository becoming such a chokepoint for inference?

Shimon Ben-David: So I think — and it seems like we have a very technical audience, so that’s great — I think when we’re looking at inference environments, there’s a big difference between environments in pre-production, in benchmarks, and those of us that are taking our inference environment to production-grade, running multiple models.

There is not one or two models that I run as an organization. Definitely when I’m running an agent, multiple agents, agent swarms — which we’re starting to see more and more — all of these models need to be activated and deactivated in a fast fashion.

And sometimes there’s this: We’re all talking about high-level AI use cases, but eventually there’s physics behind it. These models contain a certain capacity that needs to be loaded into GPU memory or TPU (tensor processing units) memory, loaded and activated. We see customers with hundreds of terabytes or petabytes of model repositories, and these models actually need to be activated in a fast fashion because eventually in inferencing, latency is king, right?

If I have a user and I reply within one second and that’s my SLA, or it takes 10 seconds, maybe now they’ll go to the next model. So the ability to move between models, activate models in an inference environment at scale is very challenging. Now the challenge is that my physical components, my GPU servers, don’t have that capacity to accommodate for all of my models.

And then suddenly what do I do? I’ll give an example: A customer that we worked with, Cohere, a large foundational model provider running on multiple cloud environments. One of their challenges was they have a dynamic inference environment: Hundreds of GPU servers, thousands of GPUs. Obviously, they have multiple models, but they had variations per customer, per sub-customers that they were using, and they had their environment that could accommodate for a certain amount of prompts at a certain SLA.

Eventually, when they got a burst of more inference requests or prompts, they needed to scale out in an elastic way. This is the cloud. They needed to scale in and out in an efficient way. It took them around 5 to 15 minutes to get a new instance with the model loaded up and running, ready to serve a peak of inference requests. That’s a long time. We were able to decrease it, but you need that elasticity. You need to be able to accommodate for multiple models.

Additionally, what we’re also seeing is it’s not only that — because inferencing is growing — but there’s still a lot of training environments. So between training environments and inferencing environments, how do you manage your GPU utilization in an efficient way? How do you utilize both environments concurrently to make the best effort out of it?

04:41

How GPU Memory Limits and HBM Capacity Create the KV Cache Problem

Matt Marshall: This is really interesting. So as you hit this stateful issue and you’re hitting this memory wall, what’s stopping these customers? You mentioned Cohere, but generally these enterprise companies do have a multi-cloud environment. Why don’t they just go to the cloud and load up on GPUs for that inference?

Shimon Ben-David: You know, some of them wish they did. There’s some problems that you cannot throw enough money at to solve. GPUs are a scarce resource. And when we’re — again, being a deep tech company — when we’re looking at the infrastructure of inferencing, inferencing is not a GPU cycles challenge. It’s mostly a GPU memory problem.

If you’re looking at the notion of inferencing, you have this construct called KV cache, which is essentially the memory, the context windows of the models you’re using. In production-grade environments, again, you’re trying to spread the load across hundreds or thousands of GPUs to answer all these massive prompts. The challenge is actually your context window, your GPU memory, which is saved on the HBM (high-bandwidth memory) of the GPU memory.

Imagine a co-development environment. You’re swinging tons of code, tons of code eventually translates to capacity. I’ll give you an example: Depending on the model and the amount of layers, 100,000 tokens can be 40GB of memory. That’s out of your 80GB or 140GB of HBM, right? So if I’m throwing two, three, four books that are 100,000 tokens, that’s it. I ran out of my KV cache capacity on my HBM.

Suddenly what the inference environment needs to do is drop data. I pre-calculate, I prefill, I create that KV cache and I start decoding on it, sending out tokens. But as I generate more data, I now need to drop my previous data. Eventually when I need it again, I need to recalculate it. We constantly see GPUs in inference environments that are recalculating things they already did. So you’re prefilling, decoding, prefilling again.

The same thing, more in-depth — we actually see large model providers, and I think if you look at the pricing it’s very apparent, Anthropic, OpenAI, and others — they are teaching you how to generate prompts that are hitting your same GPU on the off chance that you land on the same GPU that has your KV cache. So then they can just start decoding your data instead of recalculating, because they would like to generate more tokens for you.

We call that the “memory wall”: How do you climb that memory wall? How do you pass it? Eventually, that’s the key for modern, cost-effective inferencing. You can try to throw more GPUs at it. You can try to complicate your orchestration environment. We do see multiple environments, multiple companies trying to solve that in different ways.

For example, there’s new models, linear models that are trying to create smaller KV caches to be more efficient. There are environments that are saying, “Hey, I already calculated the KV cache on one GPU, it’s in my GPU memory. Let’s try to copy it, or maybe I’m using my local environment for that. But how do you do that at scale in a cost-effective manner that doesn’t strain your memory, doesn’t strain your networking? That’s something we’re helping with some of these customers.

08:11

Real Cost Savings from KV Cache Acceleration: 4.2x Performance Improvement Case Studies

Matt Marshall: It sounds like it’s very much a big company issue so far, with OpenAI, Anthropic, some of these bigger cluster users. But it’s definitely something that’s coming in agentic usage. We heard last time from Wonder, this company that’s in food delivery, where basically they were getting notifications from Azure in Northern Virginia saying they had a location crunch and “go find another location.” So this sort of crunch is happening. Can you talk a little bit about the economics of the KV cache processes? How much are your customers saving?

Shimon Ben-David: So mileage will vary on use cases because some use cases. For example, if I have a simple chatbot just getting a few questions and answering, my context window is not that big, so maybe less valuable. But some use cases are very KV cache-heavy. Example: code development, tax returns, everything with regulation — you have a lot of context and benefit from KV cache.

In some of the benchmarking and working with customers — some inference providers, some LLM providers — we saw we can accelerate that by a factor of up to 4.2x. That’s a real number with multi-tenant variation inferencing.

Just to explain what the magnitude of 4.2x is: It sounds like a small number, right? Imagine that you have 100 GPUs and now these 100 GPUs are emitting a certain amount of tokens. Imagine that now these 100 GPUs are working as if there are 420 GPUs. Just by adding the KV cache accelerated layer we provide, that’s a ridiculous amount of money. We’re looking at some use cases where the savings would be millions of dollars per day for these inference providers.
Matt Marshall: That’s significant.

10:11

How LinkedIn Optimized AI Inference Using Speculative Decoding for 4x Throughput Gains

Matt Marshall: We’re out of time on the talk, let’s go to questions. Is anyone running inference and running into the efficiency issue?

Audience Member (AI Inference Optimizations at LinkedIn): We recently rolled out our own on-prem hosted hiring assistant for LinkedIn, and I was working on that. We were facing a lot of memory-bound problems on the decoding side of the stages. And we felt that speculative decoding was something that really saved us a lot of latency and increased throughput by around 4x. And these are public numbers on the engineering blog, so feel free. But just curious: How have you seen speculative decoding evolve, and how can we make it more efficient? Because right now how it works is it just looks at the context that you have passed so far, but maybe that can be augmented with the token factories.

Shimon Ben-David: I think it really depends on what our customers are doing on our platform. I’m not familiar with any customer that’s running speculative decoding. But as I mentioned, it really depends on the amount of KV cache you’re generating. And it’s one more way to decrease the amount of KV cache eventually, right? Because you generate less of it. But eventually there’s this paradox that the more you improve, the more you do, right? So even with that, imagine adding that acceleration layer on top of it and getting even further.

12:02

Fractional GPU Solutions and Multi-Tenancy Strategies for Cloud Providers and Data Centers

Audience Member (Security at Equinix): One of the things that at least we are seeing with our customers: A lot of them are opting for fractional GPUs. So how does it help solve the inferencing problem, or have you seen your customers opting for fractional GPUs? Because I think especially from a data center or cloud provider service, that’s one of the things people are figuring out on the back end.

Shimon Ben-David: Brilliant. I think it’s a good question. We’re actually working with a lot of the neoclouds, hyperscale cloud services, and most often now, sovereign clouds, where multi-tenancy in terms of clustering of GPUs and clustering of the storage is a major concern.

We see some of them offering fractional GPUs. Realistically, what we most often see are actually clusters of GPU servers. Because it’s so expensive, tenants are just buying large fractions, large numbers of GPUs for themselves. We see some of the neoclouds that are offering an on-demand environment.

Eventually a fractional GPU is also fractional in the memory of the GPU that you’re getting. So these sorts of solutions are able to take it even further. If you’re getting a fractional GPU and you’re about to do the fractional memory, now suddenly being able to say, “Hey, I can spill my context over — unrelated to what my fractional memory can be — and can be significantly accelerated.”

It’s funny because I actually also heard from another one of these GPU clouds that mentioned they’re going to deploy their core environment, core cloud, and there’s the edge environments, and obviously there’s the edge aggregation. Smaller data center, but still a data center. And his take on this type of acceleration was, “You’re actually safeguarding my investment because if I’m putting somewhat slower GPUs, not the best of breed, not a lot of them, and suddenly with this I can do much more.”

14:26

Best Practices for KV Cache Storage Time Limits and LRU Eviction Policies in Production

Audience Member (Data Scientist): I have this question about the KV cache time limit. Do you have any time limit on how long the cache can be stored?

Shimon Ben-David: I think that’s the million-dollar question. When we started, I have to say we were, engineering-wise, very naive: “Let’s just extend the GPU memory and drop it,” and then they’re like, “Oh yeah, we need to delete it at some point because otherwise you’re creating this token warehouse of sprawl of information.”

Eventually what we’re doing is we’re working with customers that are looking at it, and what we’re seeing is that different customers have different SLAs. For example, if we’re working with an inference provider that says, “My SLA is 10 minutes for the data, then afterward everything is best effort,” eventually if it spills over to our data environment, you’d like to keep it as much as possible. But then there’s the least recently used algorithm that if it’s not needed, it can stay there as much as possible according to the customer’s SLA, and then it can be evicted.

It’s a really good question. And one more thing I’ll say is that obviously we see this seeping into enterprises. We’re talking a lot about large inference providers, hyperscalers, neoclouds. But we see next year as the year of the enterprises. AI is becoming more and more dominant. And I mentioned this 420% improvement: Imagine an enterprise that’s running 100 GPUs, getting 120 GPUs’ worth — 20 GPUs is still hundreds of thousands of dollars.

16:09

Power Efficiency and Green Computing Requirements for AI Data Centers in 2025

Audience Member (Data Center Admin): I’m really interested in hearing all this because it helped me understand what you guys are trying to solve. My question is: We’ve been trying to solve how to get more power to build the clusters. How is it really location-agnostic? Is it really latency-agnostic? I feel like this is something that sparks a lot of questions.

Matt Marshall: Maybe what your customers want.

Shimon Ben-David: So, customers want it for free and at zero power.

[Laughter]

One of our investors is Al Gore with Generation IM, which has a very green mentality. And in the past, the green mentality meant, “Let’s have a checkbox. We’re green. Boom.” Now I’m sad but happy to say the green mentality also equates to dollars.

So the No. 1 requirement we see from our customers is: Can you be more cost-effective, more power-efficient? If my power utilization doubles, quadruples every several years, if I can get that power to begin with, by the way: We see our customers being very creative about how they generate power. A lot of these neoclouds are located in environments where they can generate the power: power plants, nuclear power plants, renewable energy, hydro. And then it’s, let’s use it as efficiently as possible.

One thing I would mention along these lines is: The more dense your solution is, the less power, the less switches, networking, cable or rack footprint, cooling, heating, the better. So that’s No. 1 out of a list of 100 requirements.

18:20

GPU Cluster Optimization Techniques: From Instruction-Level Parallelism to Coarse-Grained Computing

Audience Member: This reminds me of some of the work I did earlier on instruction-level parallelism back in the early 1990s, a lot of code extracts and code optimization in L1 cache. Do you see more at the coarse-grained level now — L1, L2, L3 — so not at the KV cache at the instruction-level optimization, but more at a cluster of GPUs and doing coarse-grained similar to how traditional HPC computing was done?

Shimon Ben-David: Really good topic. We were actually working with NVIDIA on their Dynamo project. And one of the things that they mentioned is we do see other methods that are trying to solve this KV cache environment challenge. And yes, we definitely see there are multiple tiers within the server itself that are trying to allocate for it. But it’s nice. And there’s also some proprietary hardware devices that are meant to accelerate cache within a server. The problem with scale is somewhat different.

Audience Member: It’s eventually morphed into those kinds of problems.

Shimon Ben-David: Exactly.

19:28

Future of GPU Costs, HBM Shortages, and Enterprise AI Cost Predictability Strategies

Audience Member: I have a two-part question. First, we have seen the overall cost of the GPU, all of this hardware, getting lower over time. What do you see in the near future — the cost overall from the GPU and including the memory — can it be decreased significantly?

Second, in the enterprise environment, usually we hope to get the cost upfront, very predictable, meaning sometimes it might not be necessary. It’s all tradeoffs, right? It’s about the SLA, the latency, all the parts. At the end of the day we also want to balance the cost. So if we can get the overall predicted cost, then we can also optimize what things we can do. We can also optimize a certain workflow so that it’s not prioritized — if it’s not urgent, it can be prioritized to do later to help optimize the cost in general.

Shimon Ben-David: What do we see with GPU costs and memory? It’s actually worsening. As you know, there’s a shortage in flash devices worldwide, and that relates to memory NAND, HBM costs are increasing significantly, flash devices, NVMe, SSDs — so it’s going to get worse before it gets better. I was talking with somebody today that during COVID, everybody bought a certain amount of something and then everybody bought it again. And why? That’s kind of the rush of everybody buying NAND today.

I would say if you can use anything you want, it gives you a lot of flexibility. I will point to WEKA a bit — it’s not a sales pitch as we mentioned — but imagine that because we created a software-defined environment, we can use multiple types of NANDs. So whatever is available we will use. More than that, we created an environment that can fold in and run all of that within your GPU servers without the need for external NAND. The more flexible you are, the more you’ll be able to provide during that time of shortage, and obviously you want to do it. It will be harder to get GPUs also, so the more you can get out of your GPUs, obviously the better.

To the second part of your question around predictability: I think in enterprises that’s the key, because we see a lot of enterprise AI workloads that are still pointed toward the organization. And a lot of AI use cases are not external-pointed — they’re still being used internally: internal chatbots, A/B environments. Next year we think we’ll see more of it going outbound, especially with sovereign clouds that are advancing technology and research for their sovereign environments. Their ability to be predictable is going to be more important.

Taking that shortage with this SLA and predictability requirement, what we’re saying is: If we can provide an environment where you have a set amount of known FLOPS (floating-point operations per second) on your GPUs, known memory, known KV cache capacity — and the knob that you need to turn to meet your SLAs, this augmented memory environment, which is a fairly cheap knob compared to buying 100 or 1,000 more GPUs — that’s definitely an advantageous way to go.

Like This Discussion? There’s More!

Shimon Ben-David is passionate about helping customers solve their memory crises. He recently spoke at AI Infra Summit in San Francisco about memory-first architectures will help solve inference challenges.

Watch the Full Video Here

PRODUCTS

DEPLOYMENT OPTIONS

USE CASES

INDUSTRIES

ARCHITECTURES

Learn AI

RESOURCES

TECHNICAL RESOURCES

ABOUT US

JOIN US