00:00
Why Do AI Unit Economics Matter More Than Traditional Performance Metrics?
Not all queries are created equal. One user prompt might cost $0.05 to process. Another might cost $5.
That quick query—“How do I get to Fort Mason?”—is much easier, faster, and cheaper than building a 30-prompt request asking it to plan a full trip to South America. But right now, we count both as the same thing.
So what is time to first token and why does it matter? Time to first token is the first signal that your system is alive. It builds trust and engagement. If it’s slow or unpredictable, the whole experience starts to break down.
Imagine running a restaurant. One dish takes 5 minutes and costs $10. Another takes 45 minutes and costs $100. If your only metric is how many meals go out, you miss whether customers are happy and whether the business is viable. That’s how we’re measuring AI today—by the wrong metrics. We’re looking at metrics that tell you nothing about value or profitability. We need better instruments, more precise measurements.
Throughput alone tells you nothing about value or profitability. What matters is unit economics—cost per token and efficiency per workload.
01:47
How Do AI Factories and Token Warehousing Reduce Inference Costs?
This leads me to one of the most important ideas gaining traction right now: AI Factories with token warehousing. It is a new approach to infrastructure that actually stores and reuses the most expensive part of AI: the AI processing and recomputing time.
Every inference process has two parts:
- Prefill: is where the model reads the prompt. It comes up with an internal state.
- Decode: is where it generates the actual tokens. It’s where the value is created
Here’s the problem: prefill is expensive, it eats up memory, it eats up compute, and it is often repeated unnecessarily. So imagine your users are asking the same question again and again and again and again. Prefill is what makes it not have to do that query every single time.
The goal is simple: prefill once, decode many times. So imagine your users are asking the same question again and again and again and again.
Prefill is what makes it not have to do that query every single time. So the goal that we all need to have is simple. You prefill once and you decode that many, many times. And that’s where token warehousing comes in by storing pre-filled outputs, which is the most expensive part of inference and recalling them instantly. You unlock massive performance and cost gains.
03:12
Why is Memory Optimization Critical for AI?
So at Weka, we’ve seen an improvement in time to first token by 75x and reducing energy consumption by 200x just by avoiding redundant prefill. And I’m going to say that again to make sure you heard it in the back by avoiding redundant prefills, we’ve seen an improvement in time to first token by up to 75 x and a 200 time reduction in energy consumption. That’s not just optimization. In agentic sessions where its long and it’s memory intensive, this is what’s essential for scale.
So think about this like a memory palace for AI. It’s not the Superman Fortress of Solitude, which is what I think of with Memory Palace, but it is. Think of a memory palace as every time your GPU starts from scratch. Without it, it never gets smarter. But with this idea, your GPUs remember what they already figured out. They move faster. They work cheaper, and they deliver a better experience.
So the more memory constrained your system is, the more you’re paying in latency costs, the fewer simultaneous users you conserve and the worse your token economics become.
Success in the new AI economy is all about figuring out intelligent ways to optimize and maximize the memory that we have.
04:44
What is GPU Utilization and Why Does it Matter?
So now I want to talk about the most common but least discussed problem in AI infrastructure. It’s GPU utilization. So your GPUs are the most expensive part of your AI stack.
So you think they’re running flat out right? Your most expensive resource. You’re getting every dollar out. Who here believes your GPUs are running at 100% capacity?
Here’s the thing – I’m going to let you in on a dirty, secret. Real world GPU utilization is about 10 to 40%. And don’t take my word for this. We’ve seen the same information from the Outer Post CEO, from Weights and Biases from GigaIO. We’re all finding the same thing. Your GPUs are not being used to their full extent.
So that means you are paying full price for part time work. That means millions of dollars of hardware are sitting idle, and your teams are thinking they’re maxed out. You’re thinking, that’s just the way this goes. So one of our customers, stability AI, is experiencing the exact same problem. They moved to Weka on AWS. They got to 93% GPU utilization on training pipelines.
That’s not theory. That’s what happens when you remove bottlenecks from the data layer and you architect for velocity.
Want to learn more? Watch Lauren’s full keynote address here: watch the full video and follow WEKA on LinkedIn to keep up with the emerging trends in AI Economics.