VIDEO

How Memory-First Architecture Solves AI Inference Challenges

As organizations move from training to inference, WEKA CTO Shimon Ben-David explains how scaling the memory wall is the key to unlocking AI’s strategic value.

00:00

How AI Workloads Have Evolved From Training to Agentic Systems

From Classical ML to Agentic AI

We are in the age of agentic AI. What does that mean? Running models, running code, running software that is going to be an integral part of our workforce, performing operations for us.

If we’re looking at AI workloads throughout the years, throughout the phases, we started many years ago with massive AI and ML environments—training environments feeding data to models, building the models. All of these notions of AI, of code, of data as code—the more data you have, the more accurate your models become.

The Progression Through AI Eras

Then we moved on a few years ago to large language models, the ChatGPT environments where models have been created and could generate data with little context.

Moving on to reasoning, where you’re not just generating the next letter or word. You’re actually using models that generate the next words, then doubt themselves, make a better decision, and provide a better answer.

All the way to agentic workloads—agents taking information, reasoning around it, and then performing a task.

02:26

What's the Difference Between AI Training and Inference?

Understanding the Distinction

Just to make it obvious: Training is generating the models, whether it’s classical AI or ML. Inferencing is running these models at scale, or with the purpose of running them at scale.

Training Environments: The HPC Model

If we’re looking at the environments we’re seeing today, training is a certain environment:

  • High GPU utilization to maximize throughput
  • Very long batch operations; we work with organizations running training for hours, days, and weeks
  • It’s very much an HPC environment: you’re throwing jobs a batch operation, and the job completes in a few weeks
  • All the notions around checkpointing the environment and recovering from checkpoints
Inference Environments: Real-Time and Bursty

Inferencing, on the other hand—which is what the majority of organizations are going to move to—is completely different in terms of requirements.

Inferencing is a very low latency, real-time environment. Think about it: if your kid or your employee is running a query against ChatGPT or your corporate AI workloads, if it takes a second versus a minute, that’s a completely different experience.

These inferencing environments are completely different from training:

  • Real-time
  • Bursty: employees work at different hours of the day, asking for different amounts of data
  • Not sustained: you could have a peak of inference and then a drop
04:22

Why Compute Alone Doesn't Solve AI Inference Scaling

Compute Alone Doesn’t Solve the Problem

As you deploy inference environments at scale, you’re bringing up massive GPU environments or accelerator environments. You’re throwing more GPU cycles, more GPU memory. All of these GPUs have memory within them. You’re throwing that at the environment, and all the prompts are going into your GPUs and asking for information.

As you scale in a production-grade environment—not when you’re running on your laptop or one or two POC environments, but in a production environment at scale—compute alone doesn’t solve your scaling problem.

Throwing more GPUs to accommodate more users doesn’t necessarily solve your problem in an economical way. Eventually your inference challenge, if you look into how it works, is bottlenecked on the memory of the GPUs.

The Reality of Production Inference

Initially, when organizations are playing with inferencing, they think inference looks like, “I have a user, the user queries an LLM, gets an answer, and that’s it.” That’s correct in a POC environment.

Usually in a production environment, this is closer to reality: You have multiple components: LLM routers, tens or thousands of GPUs and GPU servers, multiple vector databases, graph databases, reasoning environments. It’s a very production-grade environment. Inference environments have a lot of moving parts.

06:03

Understanding Latency, Prefill, and Decode in Production AI

No One-Size-Fits-All Solution

There’s no one inference environment that can accept all of your inference requirements. Video search and summarization inference environments are not identical to a physical intelligence environment, which is not identical to a chatbot or to a risk-assessment inference environment.

The Critical Role of Latency

However, there are some commonalities. For example, the importance of latency: the lower the latency of the environment, the better across all your inference workloads.

If I’m chatting with a smart city environment and asking for suspicious characters, and get an answer from my vector database and graph database in a millisecond versus 10 seconds? Completely different value.

If I’m looking for some data on my storage and getting it in microseconds instead of milliseconds? Completely different value.

Latency is hugely important.

Prefill vs. Decode: The Economic Equation

There are two macro notions within GPU inferencing: prefill and decode. Obviously it’s much more complicated than that, but in general:

  • Prefill: When you’re inputting data to the LLM (your text documents, an image, your code)
  • Decode: When the LLM generates outputs

You’d like to decode as much as possible because prefill is a necessary evil. Decoding is where you spit out tokens, where you generate value. If you’re selling these tokens, that’s where you can charge for more tokens.

06:59

The GPU Memory Wall Problem: Why Context Sizes Outpace Hardware

Agent Swarms Compound the Problem

We have the rise of agent swarms. We talked about agentic workloads. It’s now being compounded further because agents are now generating and controlling more agents. Each of them is reasoning and generating more tokens. Your already problematic GPU memory environment is now an even bigger problem.

The Economic Reality

How do you create these environments in a cost-efficient way? Just throwing more GPUs, more networking—that doesn’t solve it economically. You need a good ROI on your inference environments. Otherwise, what’s the value?

The Token Cost Paradox

As accelerators and newer transformers and models generate more tokens per second, we would have imagined that the cost per token decreases. Performing 2,000 tokens per second today and 4,000 tomorrow, obviously the cost per token decreases.

But we’re seeing models where their context sizes increase significantly, and they’re generating more tokens. Suddenly the cost per token decreases, but the overall cost of your tokenized environment increases because you need more tokens. That’s one paradox we’re aware of.

The Numbers Behind the Memory Wall

This is where agentic AI is hitting the memory limits of your infrastructure. That’s the big problem.
We can see that memory capacity across GPUs is increasing, doubling every two years: 40 GB, 80 GB, 130 GB in some accelerators, and more while the context sizes of models, in the last two years, increased by 240x.

Models are now requiring massive amounts of GPU memory that the GPUs don’t have anymore. Then you need to be creative: tensor parallelism, run across multiple GPUs, create more inference servers. But that’s the memory wall—where the efficacy of your GPU memory compared to what it needs to run actually decreases.

Scaling the Memory Wall

How do we create an economic inference environment at scale? We need to scale that memory wall. We need to make sure that even though we have that memory limit, we go beyond it in an economical way.

This is where we introduce the augmented memory revolution.

10:21

How Token Warehouses Enable Prefill Once, Decode Forever

The Vision: Endless GPU Memory Capacity

What if you could actually have endless capacity for your GPU memory? If my GPU has 80 GB or 130 GB, but now I need terabytes or petabytes of tokens, what if I could have that while still not constantly recalculating my cache or taking my data and recreating it?

The Token Warehouse™ Concept

Essentially, that’s the notion of a token warehouse. If I already prefilled data and created tokens, created KV cache structures, and now because I’m limited in memory, instead of deleting that, I offload it to my augmented memory environment.

Then essentially, that would allow me to prefill once and decode forever. Every piece of data—my tax documents, my pictures, my code—would always be available for the GPUs, even though they’re not strictly on the GPU memory. Instead of recalculating it, the GPU will just load it.

Previous Attempts and Why They Failed

This is something that has been tried in the past:

  • With DRAM, very local to the GPU server
  • With local NVMe, still available for the GPU servers
  • With shared storage (basic or high-performance shared storage) that couldn’t handle it and wasn’t fast enough

Basically, recalculating was more efficient than loading.

11:48

Real-World Results: Achieving 4.2x More Tokens with Augmented Memory

Testing the Theory

We had that theory at WEKA, and what we did is we tested it in a production environment.

We took eight GPU servers and calculated a real engineering development use case on it. We ran it, and we saw that by treating WEKA as augmented memory to these eight GPU servers, we could generate 4.2x more tokens for developers.

The Economic Impact

To put it in perspective: imagine that our 64 GPU memory environment could behave like 256 GPUs. How much does a GPU cost? There’s value here.

12:37

How to Build Cost-Effective Production AI Inference Infrastructure

The Essential Components

What does it take to win in a scale-out, production-grade inference environment, which the majority of organizations are now going to build?

A cost-effective performance token warehouse:

  • Prefill once, decode forever
  • Don’t throw away tokens
  • Generate more tokens by increasing KV cache hit rate

By doing this, you can generate more tokens. Obviously it’s more economical, and your user experience increases.
If I’m a code developer working on your environment, I’m getting better feedback, better time to first token, better token intervals.

Beyond Performance: The Complete Solution

Inferencing is not just about performance and tokens per second.

It’s also:

  • How can I create this inference environment in a simple fashion?
  • How can I do it in a secure way? We work with a lot of GPU clouds where multi-tenancy is a must. How can you do it in a way that your tenants can infer securely and separately?
  • How do you create model repositories where your agentic workloads can just load multiple models without copying the models between the different GPUs and local NVMe?

That’s what it takes to win.

Profitable AI requires overcoming the memory wall, simply because you’ll get better outcomes and it will be more cost-effective.

Related Resources