Token Power: The New Economics of AI
economically viable AI.
Transcript
What are the new economics of artificial intelligence (AI)?
That leads us to our next headline speaker. Her talk is Token Power: The New Economics of AI. Her name is Lauren Vaccarello. Really interesting lady. She’s CMO at WEKA, responsible for leading the company’s global marketing organization and strategy. She’s a veteran marketing executive, celebrated author, board member, and angel investor with a proven track record of accelerating revenue growth for enterprise software companies.
I think you’re going to really enjoy this. Please give a round of applause for Lauren Vaccarello.
Hello, everybody. How is day two going? Are we excited? Did we have a good night? Amazing. I’m here to talk about one of my all-time favorite topics: the economics of AI. Did you know that your infrastructure can actually be one of your biggest advantages when it comes to making AI profitable? That’s what I’m going to tell you about today.
I love infrastructure. We’ve officially entered the age of reasoning. This is a fundamental shift. We’ve moved past pattern-matching LLMs. Today’s AI systems are expected to think, reason, and act persistently, autonomously, and with memory. That’s powerful, but it’s also really expensive.
They say in tech, every era rewrites its own economics. The PC era gave us Moore’s Law.
The cloud era gave us elasticity. And now in the AI era, we’re dealing with something we’ve never seen before: intelligence that grows more expensive the smarter it gets. I’ve talked to teams spending $50,000 a day in compute costs just to keep inferencing alive. Not one of them has a clear answer on their actual cost per insight.
Can you imagine running any other type of business where you don’t know how much your costs are?
Why are AI inference costs unsustainable for many organizations?
That’s a problem. And that’s what we’re going to talk about today. It’s not just about how AI is evolving. A lot of sessions are talking about this, and that’s great. But we’re going to talk about the economics behind it.
And how it’s breaking—and what we can do to fix it. Reasoning models consume far more memory than traditional LLMs. I love saying “traditional LLMs,” by the way. It sounds like it’s been going on for such a long time. Right now, a single complex query can burn through hundreds of thousands of tokens. Suddenly, innovation costs bump up against the limits of cost and infrastructure.
This is mind-boggling. Is anyone here familiar with the ARC Prize? If you don’t know, it’s a benchmark to test reasoning. Super cool. Sam Altman’s team dropped in and previewed GPT-3. It beat the ARC Prize, but it cost $1 million for 16 hours of inferencing. That’s insane. That’s not even training.
It’s just inferencing.
So now we’re facing a really hard truth: AI, with its current cost, is not commercially viable. People and pundits speculate that AI is dead in the water—that costs will exceed what the market can bear. We don’t have a future in this. I know everyone here doesn’t actually believe that, but if you look at OpenAI’s recent 80% price reduction, it may be evidence that the market cannot bear these high costs.
How can AI infrastructure design determine ROI and scalability?
So we need to do something about this. What we need is smarter infrastructure. We need smarter economics—and we need it now. In the old world, you trained a model once and deployed it. Inference was predictable. But now, reasoning models are volatile. The cost per query can vary wildly. You’re not just paying for output—you’re paying for depth of thought.
And that is creating budgetary chaos for teams everywhere. This brings up something we like to call the AI economics triad: accuracy, cost, and latency. Accuracy is non-negotiable. You cannot return a half-right answer. If you ask ChatGPT, “Where is Fort Mason?” and it sends you to Ocean Beach, you’re never going to use it again. Then we have latency.
Latency is not just about slowing things down. Latency actually breaks the user experience. Latency ruins the illusion of intelligence. Who here is old enough to remember dial-up? I like to pretend I’m not, but unfortunately I do. In the old days, you’d go online, sit there, and wait. You’d get a glass of water, come back, and finally be connected to the internet. That was normal. That was fine.
Now you jump online with broadband and connect instantly. This is what we’re experiencing right now. Imagine your users: one option is to sit and wait, and the other is to get their response
Who would you choose? Do you want to go back to dial-up, or do you prefer broadband? I hope you like broadband. Right now, users experiencing delays are starting to lose trust. They sit there wondering, “Is this working?” At least with dial-up, we had that noise to tell us something was happening. With latency, it just sits there. It breaks the experience and ruins the illusion of intelligence.
It’s not just about milliseconds anymore. This is about momentum. In agentic systems, hesitation feels like failure. Users don’t want to wait around until your model eventually catches up. And cost—cost is the lever most of us are trying to pull.
But here’s the thing: cost is just a symptom. So how do we fix all of this?
How does agentic AI change infrastructure requirements, costs, and performance for enterprises?
I want to look at the three major factors shaping the moment we’re in today. First, we have agentic AI—long-running, memory-based systems that carry context across sessions.
I love agentic AI. Who here is building their own models? Who here is secretly building models for their day-to-day work? I’ll let you in on a secret: I trained a model at work that’s basically a virtual version of me. I use it all the time to give me extra bandwidth. Then I think about that from an infrastructure perspective.
How expensive must it be for the LLM I’m using to store a virtual Lauren? It’s incredibly powerful technology, but it’s resource intensive. I don’t think the economics of what they’re charging me actually make sense for them.
Second, we have elastic compute. When DeepSeek came out, it was mind-blowing.
Does anyone remember when DeepSeek came out and totally upended the costs behind LLMs? I thought, “This is insane.” They broke one of the truths we thought was real—that this had to be expensive. DeepSeek’s approach used elastic computing models. What they did was radical.
They combined inference and training on the same infrastructure. During the day, their GPU capacity ran inference. At night, when demand dipped, it switched over to training. Just like that. DeepSeek wasn’t purely innovative—it was an economic necessity. That necessity turned into a competitive advantage. Most organizations don’t have separate stacks for inferencing and training.
By putting inferencing and training on the same stack—flipping from one to the other—DeepSeek proved something many thought was impossible. Sustainable innovation is a design problem, not a funding problem.
Third, we’re seeing a fundamental shift everywhere. We’re moving away from queries per second and replacing it with metrics that actually matter. We’re looking at cost per token and time-to-first-token (TFT).
Why are we doing this? Because not all queries are created equal. One user prompt might cost $0.05 to process. Another might cost $5. A quick query—“How do I get to Fort Mason?”—is easier, faster, and cheaper to run. But a 30-prompt question asking it to plan my entire trip to South America is a different story.
Right now, we treat these as the exact same thing. But time-to-first-token really matters. It’s the first signal that your system is alive. It builds trust. It builds engagement. If it’s slow or unpredictable, the whole experience breaks down.
Here’s a better way to think about it. Imagine you’re running a restaurant. The only thing you care about is how many dishes are coming out. But one dish takes five minutes and costs $10 to make. Another takes 45 minutes and costs $100.
If your only metric is how many meals go out the door, what are you missing? Are your customers happy? Is your business economically viable? That’s how we’re looking at AI right now—through the wrong metrics. We’re tracking things that tell us nothing about value or profitability.
We need better instruments. We need more precise measurements. It’s not just about throughput—it’s about unit economics. And this leads me to one of the most important ideas gaining traction right now: AI factories with token warehousing.
What is token warehousing and why does it matter for AI scalability?
And this leads me to one of the most important ideas gaining traction right now: AI factories with token warehousing. Token warehousing is a new approach to infrastructure that stores and reuses the most expensive part of AI: processing and recomputing time.
Does everyone know what pre-fill and decode are? Let me explain. Every inference process has two parts: pre-fill and decode. Pre-fill is where the model reads the prompt and forms an internal state. Decode is where it generates the actual tokens—it’s where the value is created.
Here’s the problem: pre-fill is super expensive. It eats up memory, it eats up compute, and it’s often repeated unnecessarily. Imagine your users are asking the same question again and again. Pre-fill makes it so you don’t have to recompute that query every single time. The goal is simple: pre-fill once, decode many times.
That’s where token warehousing comes in. By storing pre-filled outputs—the most expensive part of inference—and recalling them instantly, you unlock massive performance and cost gains.
At WEKA, we’ve seen an improvement in time-to-first-token by up to 75x and a 200x reduction in energy consumption just by avoiding redundant pre-fills. Let me say that again to make sure you heard it in the back: by avoiding redundant pre-fills, we’ve improved time-to-first-token by 75x and cut energy consumption 200x.
That’s not just optimization in a generic session. In a long, memory-intensive session, this is essential for scale. Think about it like a memory palace for AI. It’s not the Superman Fortress of Solitude I picture when I think of a memory palace, but it’s close.
Think of a memory palace this way: every time your GPU starts from scratch, it never gets smarter. With token warehousing, your GPUs remember what they’ve already figured out. They move faster, work cheaper, and deliver a better experience. The more memory-constrained your system is, the more you pay in latency costs, the fewer simultaneous users you can serve, and the worse your token economics become.
Success in the new AI economy is all about finding intelligent ways to optimize and maximize the memory we have.
How does GPU utilization affect AI performance and economics?
Now I want to talk about the most common but least discussed problem in AI infrastructure: GPU utilization. GPUs are the most expensive part of your AI stack. You’d think they’re running flat out, right? Your most expensive resource. You should be getting every dollar out. Who here believes your GPUs are running at 100% capacity?
Here’s the thing—I’ll let you in on a dirty secret. Real-world GPU utilization is about 10 to 40%. Don’t just take my word for it. We’ve seen the same information from the Outerbounds CEO, from Weights and Biases, from GigaIO. Everyone’s finding the same thing: GPUs are not being used to their full extent.
That means you’re paying full price for part-time work. Millions of dollars of hardware are sitting idle, and teams think they’re maxed out. They assume, “That’s just the way it goes.” One of our customers, Stability AI, was experiencing this exact problem. They moved to WEKA on AWS and achieved 93% GPU utilization in training pipelines.
That’s not theory—that’s what happens when you remove bottlenecks from the data layer and architect for velocity. Why? Because GPUs are waiting. They’re waiting for data. They’re waiting for pre-fill to finish. They’re waiting for memory to flush. And every time that happens, you’re paying top dollar for your most advanced chip to sit idle.
Here’s what it looks like to fix that. Imagine you’re running a factory. You’re building agentic AI. You’re handling customs, logistics, planning. You want it to respond to queries. You want it to reason. You want your AI to run the show. With massive context windows and dozens of multi-step interactions, every time an employee types “What’s next?” it’s not just slow—it’s unaffordable.
The system has to stop, recompute, and think. But if you flip this model—store the data and stream it back instantly—you reduce costs, maintain session-level intelligence, and improve throughput.
When you detach state from a single GPU so any available GPU can pick up where it left off, you get the holy grail of inferencing economics. At WEKA, we’re seeing companies achieve over 90% utilization and even oversubscribe their GPUs. They’re no longer forced to let GPUs sit idle.
I’m here to tell you: it’s possible. Inferencing and training on the same GPU, on the same stack, is possible. That’s why we say the future of AI is memory-first. Solving that problem means rethinking how infrastructure feeds models. That’s how you make AI profitable and shift from survival to scale.
Let’s get to the core issue: memory. GPUs are powerful. Models are massive. But memory has not kept up. High-bandwidth memory is super expensive, limited, and the real constraint in modern inference. Once your system exceeds available memory, you’re forced to recompute or swap to slower tiers.
It kills latency. It hurts utilization. It destroys economics. The more memory-constrained your system is, the more you pay in latency costs, the fewer simultaneous users you can serve, and the worse your economics become. Success in the new AI economy requires maximizing and optimizing memory.
This is why we say the future of AI is memory-first. Solving that problem means rethinking how your infrastructure feeds your models. This is what makes AI profitable and shifts you from survival to scale.
What steps can organizations take to build economically viable AI systems?
So what does it take to win? What should you do with everything I just told you?
First, measure time-to-first-token. Pre-fill once, decode often. Measure the metrics that matter. Architect for reuse. Think in tokens, not queries. Ultimately, treat memory like your most valuable asset. It’s not an afterthought.
The AI winners of this era won’t be the ones with the biggest budgets or the flashiest models. They’ll be the ones who understand the economics and build accordingly.
It’s not just about making AI intelligent—it’s about making AI viable.
So I want you all to ask yourself: where is your organization in the journey to building smarter AI—AI that makes economic sense to run and also has the potential to change the world for the better?
Thank you, everyone. It has been a pleasure being here with all of you.
Want to learn more? See how WEKA can help you Maximize Token Efficiency. Keep up with emerging AI Economics trends by following WEKA on LinkedIn.
Related Resources
Why Prefill has Become the Bottleneck in Inference—and How Augmented Memory Grid Helps
Thank you for your WEKA Innovation Network program inquiry.
A WEKA channel representative will respond promptly.
You’re on your way to solving your most complex data challenges.
A WEKA solutions expert will be in contact with you shortly.
Thank you for your interest in WEKA’s Technology Alliance Program (TAP).
A member of the WEKA Alliances team will follow-up with you shortly.
Thank you!
A WEKA representative will be in touch with you shortly.