00:00 How AI Workloads Have Evolved From Training to Agentic Systems
From Classical ML to Agentic AI
We are in the age of agentic AI. What does that mean? Running models, running code, running software that is going to be an integral part of our workforce, performing operations for us.
If we’re looking at AI workloads throughout the years, throughout the phases, we started many years ago with massive AI and ML environments—training environments feeding data to models, building the models. All of these notions of AI, of code, of data as code—the more data you have, the more accurate your models become.
The Progression Through AI Eras
Then we moved on a few years ago to large language models, the ChatGPT environments where models have been created and could generate data with little context.
Moving on to reasoning, where you’re not just generating the next letter or word. You’re actually using models that generate the next words, then doubt themselves, make a better decision, and provide a better answer.
All the way to agentic workloads—agents taking information, reasoning around it, and then performing a task.
02:26 What's the Difference Between AI Training and Inference?
Understanding the Distinction
Just to make it obvious: Training is generating the models, whether it’s classical AI or ML. Inferencing is running these models at scale, or with the purpose of running them at scale.
Training Environments: The HPC Model
If we’re looking at the environments we’re seeing today, training is a certain environment:
- High GPU utilization to maximize throughput
- Very long batch operations; we work with organizations running training for hours, days, and weeks
- It’s very much an HPC environment: you’re throwing jobs a batch operation, and the job completes in a few weeks
- All the notions around checkpointing the environment and recovering from checkpoints
Inference Environments: Real-Time and Bursty
Inferencing, on the other hand—which is what the majority of organizations are going to move to—is completely different in terms of requirements.
Inferencing is a very low latency, real-time environment. Think about it: if your kid or your employee is running a query against ChatGPT or your corporate AI workloads, if it takes a second versus a minute, that’s a completely different experience.
These inferencing environments are completely different from training:
- Real-time
- Bursty: employees work at different hours of the day, asking for different amounts of data
- Not sustained: you could have a peak of inference and then a drop
04:22 Why Compute Alone Doesn't Solve AI Inference Scaling
Compute Alone Doesn’t Solve the Problem
As you deploy inference environments at scale, you’re bringing up massive GPU environments or accelerator environments. You’re throwing more GPU cycles, more GPU memory. All of these GPUs have memory within them. You’re throwing that at the environment, and all the prompts are going into your GPUs and asking for information.
As you scale in a production-grade environment—not when you’re running on your laptop or one or two POC environments, but in a production environment at scale—compute alone doesn’t solve your scaling problem.
Throwing more GPUs to accommodate more users doesn’t necessarily solve your problem in an economical way. Eventually your inference challenge, if you look into how it works, is bottlenecked on the memory of the GPUs.
The Reality of Production Inference
Initially, when organizations are playing with inferencing, they think inference looks like, “I have a user, the user queries an LLM, gets an answer, and that’s it.” That’s correct in a POC environment.
Usually in a production environment, this is closer to reality: You have multiple components: LLM routers, tens or thousands of GPUs and GPU servers, multiple vector databases, graph databases, reasoning environments. It’s a very production-grade environment. Inference environments have a lot of moving parts.
06:03 Understanding Latency, Prefill, and Decode in Production AI
No One-Size-Fits-All Solution
There’s no one inference environment that can accept all of your inference requirements. Video search and summarization inference environments are not identical to a physical intelligence environment, which is not identical to a chatbot or to a risk-assessment inference environment.
The Critical Role of Latency
However, there are some commonalities. For example, the importance of latency: the lower the latency of the environment, the better across all your inference workloads.
If I’m chatting with a smart city environment and asking for suspicious characters, and get an answer from my vector database and graph database in a millisecond versus 10 seconds? Completely different value.
If I’m looking for some data on my storage and getting it in microseconds instead of milliseconds? Completely different value.
Latency is hugely important.
Prefill vs. Decode: The Economic Equation
There are two macro notions within GPU inferencing: prefill and decode. Obviously it’s much more complicated than that, but in general:
- Prefill: When you’re inputting data to the LLM (your text documents, an image, your code)
- Decode: When the LLM generates outputs
You’d like to decode as much as possible because prefill is a necessary evil. Decoding is where you spit out tokens, where you generate value. If you’re selling these tokens, that’s where you can charge for more tokens.
06:59 The GPU Memory Wall Problem: Why Context Sizes Outpace Hardware
Agent Swarms Compound the Problem
We have the rise of agent swarms. We talked about agentic workloads. It’s now being compounded further because agents are now generating and controlling more agents. Each of them is reasoning and generating more tokens. Your already problematic GPU memory environment is now an even bigger problem.
The Economic Reality
How do you create these environments in a cost-efficient way? Just throwing more GPUs, more networking—that doesn’t solve it economically. You need a good ROI on your inference environments. Otherwise, what’s the value?
The Token Cost Paradox
As accelerators and newer transformers and models generate more tokens per second, we would have imagined that the cost per token decreases. Performing 2,000 tokens per second today and 4,000 tomorrow, obviously the cost per token decreases.
But we’re seeing models where their context sizes increase significantly, and they’re generating more tokens. Suddenly the cost per token decreases, but the overall cost of your tokenized environment increases because you need more tokens. That’s one paradox we’re aware of.
The Numbers Behind the Memory Wall
This is where agentic AI is hitting the memory limits of your infrastructure. That’s the big problem.
We can see that memory capacity across GPUs is increasing, doubling every two years: 40 GB, 80 GB, 130 GB in some accelerators, and more while the context sizes of models, in the last two years, increased by 240x.
Models are now requiring massive amounts of GPU memory that the GPUs don’t have anymore. Then you need to be creative: tensor parallelism, run across multiple GPUs, create more inference servers. But that’s the memory wall—where the efficacy of your GPU memory compared to what it needs to run actually decreases.
Scaling the Memory Wall
How do we create an economic inference environment at scale? We need to scale that memory wall. We need to make sure that even though we have that memory limit, we go beyond it in an economical way.
This is where we introduce the augmented memory revolution.
10:21 How Token Warehouses Enable Prefill Once, Decode Forever
The Vision: Endless GPU Memory Capacity
What if you could actually have endless capacity for your GPU memory? If my GPU has 80 GB or 130 GB, but now I need terabytes or petabytes of tokens, what if I could have that while still not constantly recalculating my cache or taking my data and recreating it?
The Token Warehouse™ Concept
Essentially, that’s the notion of a token warehouse. If I already prefilled data and created tokens, created KV cache structures, and now because I’m limited in memory, instead of deleting that, I offload it to my augmented memory environment.
Then essentially, that would allow me to prefill once and decode forever. Every piece of data—my tax documents, my pictures, my code—would always be available for the GPUs, even though they’re not strictly on the GPU memory. Instead of recalculating it, the GPU will just load it.
Previous Attempts and Why They Failed
This is something that has been tried in the past:
- With DRAM, very local to the GPU server
- With local NVMe, still available for the GPU servers
- With shared storage (basic or high-performance shared storage) that couldn’t handle it and wasn’t fast enough
Basically, recalculating was more efficient than loading.
11:48 Real-World Results: Achieving 4.2x More Tokens with Augmented Memory
Testing the Theory
We had that theory at WEKA, and what we did is we tested it in a production environment.
We took eight GPU servers and calculated a real engineering development use case on it. We ran it, and we saw that by treating WEKA as augmented memory to these eight GPU servers, we could generate 4.2x more tokens for developers.
The Economic Impact
To put it in perspective: imagine that our 64 GPU memory environment could behave like 256 GPUs. How much does a GPU cost? There’s value here.
12:37 How to Build Cost-Effective Production AI Inference Infrastructure
The Essential Components
What does it take to win in a scale-out, production-grade inference environment, which the majority of organizations are now going to build?
A cost-effective performance token warehouse:
- Prefill once, decode forever
- Don’t throw away tokens
- Generate more tokens by increasing KV cache hit rate
By doing this, you can generate more tokens. Obviously it’s more economical, and your user experience increases.
If I’m a code developer working on your environment, I’m getting better feedback, better time to first token, better token intervals.
Beyond Performance: The Complete Solution
Inferencing is not just about performance and tokens per second.
It’s also:
- How can I create this inference environment in a simple fashion?
- How can I do it in a secure way? We work with a lot of GPU clouds where multi-tenancy is a must. How can you do it in a way that your tenants can infer securely and separately?
- How do you create model repositories where your agentic workloads can just load multiple models without copying the models between the different GPUs and local NVMe?
That’s what it takes to win.
Profitable AI requires overcoming the memory wall, simply because you’ll get better outcomes and it will be more cost-effective.