VIDEO

AI Token Economics and the Real Cost of Running AI Models

Keith Newman, host of Liftoff With Keith and partner at The GTM Firm, interviews WEKA’s Val Bercovici at AI Infra Summit 2025 to explore the cost of running AI models, memory constraints, and predictions for autonomous agents in 2026.

Transcript

00:00

Why AI Projects Fail: Token Economics and Unit Costs Determine Success

Keith Newman: Fantastic. We have Val Bercovici here from WEKA, the chief AI officer. I love meeting chief AI officers, by the way, because then we get to talk about all these hot topics. Last time I met you, tokenomics was the hot topic. Now we’re talking about memory and inference. Give us an update. How are you seeing things all shaping up and evolving?

Val Bercovici: The market’s moving a mile, a thousand miles a minute, a million miles a minute. So I don’t think that’s news to anyone. What matters, though, is literally what can we afford to run? That’s coming to the fore right now. There’s a lot of capital being spent. There’s a lot of deployments now.
There’s two conflicting studies I love to talk about. One is the famous MIT study, where 95% of AI projects failed. And we can talk about why that’s not necessarily a credible study, but I think it’s interesting to say these are experiments at this stage, right? They’re not necessarily production systems. So the 5% that work are really valuable.
Then there’s a contrasting study with 800 responses versus 52 that found that 95% of projects succeeded and only 5% failed.

Keith: What does MIT know anyway, right?
Val: [Laughter] So there’s a lot going on here. But one of the reasons that separates failures from the winners actually is the tokenomics. What is the unit cost, the unit economic cost of running these models? Can you afford and just check out Reddit forums, check out Twitter feedback and all sorts of feedback? Can you afford all the tokens you want?

You’re on Claude code. You’re generating these really cool apps, you’re vibe coding, then you deploy in production and oh my god, I can’t afford to run this. Just like the early cloud builds. My product is great, but my bill is way too high.

How can I reduce this bill? And that’s why understanding tokenomics, which is very different than any other kind of prior cloud computing trend, in depth really, really matters here.

Keith: So is that your new book, “Tokenomics”? Just like we had “Freakonomics” five years ago? Ten years ago?

Val: Yeah, you give me a great idea.

[Laughter]

02:03

How Memory Impacts AI Performance: The Tradeoff Between FLOPS and Memory in GPU Computing

Keith: Well, how does it all correlate, then? Because I think it is all about finding the right profit and the right balance with things like memory and inference.

Val: So what’s really important here is the role of memory in this equation, in the tokenomics equation. So if we take a little bit of a step back, high-performance computing, especially GPU computing, scientific computing, and so forth, has always had this interesting tradeoff between floating point operations per second, FLOPS, and memory.

And even before AI was a thing, just in big gene sequencing workloads and seismic analysis workloads and nuclear fusion workloads and all that, there’s just a lot of this tradeoff of how much do I compute vs. store in this limited memory, this really high-performance memory, and save the compute cycles and retrieve it if I’m using it a lot.
But there’s this constant tension. And now AI — whether it’s training, particularly inference, the new reinforcement learning trends — really exacerbates this. And what we’re seeing right now — and I like to actually pick on the most successful example in this space, it would be Anthropic — would be the Claude Code Service, which has created a lot of very addictive developers. They love this product and feature, but they also hate it at the same time because they can’t get the tokens they want at any price. They’re literally saying, I’ll give you $2,000 a month, give me unlimited tokens —

Keith: It’s great for margin, though.

Val: It is. So scarcity is a real thing right now. And there’s this encyclopedia, pretty much, that Anthropic has published around how to optimize what’s called your prompt caching, which is fundamentally how to optimize your token costs, your input and output costs.

Keith: We’re in the optimization phase, too, aren’t we, with the energy costs, with everything else?

Val: Very, very much. For Anthropic, the problem isn’t their token pricing; they dictate those. The problem is access to GPUs, and then the providers like Amazon and now Google and others, access to energy to actually power those GPUs. The buck ultimately stops there.

Keith: More scarcity.

03:59

Multi-Vendor AI Infrastructure Strategy: Supporting NVIDIA, AMD, and Hybrid Cloud Deployments

Keith: Gosh, I can take this conversation in a couple of different directions, but I think there’s a comment about competition. You mentioned the LLMs. You mentioned the data centers, you mentioned on a global basis, on a chip basis. How are you placing bets as a company in this area?

Val: We can be Switzerland here, right? Our design from day one was to be a software-only company. We run on all supported hardware, and that support matrix keeps growing and growing. So we support NVIDIA processors and accelerators, AMD processors and accelerators, a growing list of others.
We support cloud deployments. We support on-premises deployments. We support hybrid deployments.

Keith: Have to be loyal to your customers, too.

Val: The customers dictate this. They’re the ones that are saying, “No, I want to use AMD instead of NVIDIA for inference,” for example. Or, “No, I have a heavy training job and the NVIDIA stack is the most mature.”

04:58

GPU Prefill and Why It's the Biggest Bottleneck in AI Inference

Keith: What are your key inference challenges now, today, and looking forward?

Val: So the key inference challenges really boil down to this one underlying bottleneck for all of AI; if you really peel all the layers away and you find out what’s the real underlying bottleneck, it’s this thing called GPU prefill.

And it’s such a bottleneck that NVIDIA, for the first time in their history, pre-announced a processor about 18 months in advance. It’s not a general purpose GPU, not even a GPU that’s just for inference as opposed to training, but just for prefill — that first phase where you basically take your prompt, vectorize it 10,000, 20,000 dimensions, and then read from it in this thing called KV cache, and that’s the decode process.

So the real bottleneck is: Prefill has to always happen before decode; you run out of memory for decode within minutes, hence the Anthropic recommendations which are like an encyclopedia along right now. But what if you had this concept of a token warehouse? Or instead of prefilling many, many times when you run out of memory — which is all the time for multiple concurrent sessions — what if you prefilled once and you just decoded from that forever?

It’s a very groundbreaking notion and it unlocks almost everything in AI.

06:10

What Does a Chief AI Officer Do? Managing ROI and Cash Flow in Enterprise AI

Keith: Talk to me about being a chief AI officer in the middle of this race, looking at everything from minerals to real estate. What a wide area. How do you start to evaluate investment?

Val: Well, you’ve got to be a bit of a CFO, you’ve got to wear a CFO hat, right? And you’ve got to understand the net present value of certain things you’re involved in: What’s the instant rate of return? Managing cash flows is key here. I mean, I take a look at what Sarah Frier at OpenAI is trying to do, based on Sam Altman’s public proclamation and so forth. And it’s a real struggle to afford, month to month, the super ambitious goals of some of the very high-profile labs.

The other end of the spectrum would just be the enterprise is trying to consume this stuff, let alone even host it.
Keith: That’s a good point.

Val: And so figuring out who my providers are, to your point, there’s competition between GPU providers. It’s a layer cake of GPU providers, model providers, the AI app companies, and then the enterprise is consuming it all.

07:09

When Will AGI Arrive? How AI Agents Are Evolving from Supervised Interns to Autonomous Employees

Keith: 2026: What’s the innovation of the year going to be?

Val: I think we’re going to be shocked at where we are. I can’t tell you exactly what’s going to shock us, but if I take a look back just 10 months to the beginning of 2025? I wouldn’t believe where we are right now. I wouldn’t believe that coding agents actually work. That was hype 10 months ago. It’s a hard reality today.

Keith: So there will be a new version of vibe coding or something different.

Val: That’s the most pedestrian, I think, prediction. I think we can do so much better. There are two ways to evaluate this: There’s a textbook definition for what this next major breakthrough we anticipate, AGI (artificial general intelligence): You’ll now be able to ask AI to do something economically valuable without a lot of guidance or instruction, just like hiring a smart employee and letting them run.
The personal version of that is: Today, if you’re using agents at all — and if you’re not, you have to be, you definitely have to start using agents to see their value — but you really have to supervise them a lot. It’s kind of like an intern.

Keith: And on that note, you need to make sure people know how to use them correctly, ethically.

Val: All these things you have to do up front? My prediction is that a year from now, it’ll be automatic. The right decisions will be made by the agents when you didn’t give it instructions that were specific enough, to the point where you’ll trust it blindly.
So today, for example, I have agents generating a bunch of my email responses. So if you’re getting an email answer from me, apologies, but it might be AI-generated. But I don’t let them send —

Keith: Send some money….

Val: [Laughter] But I do have them prepare all the drafts, and I review the drafts, and month over month, I am almost shocked at how much better those draft responses are becoming, to the point where I’m going to start to soon trust them to send those drafts on my behalf.

Keith: It’s fascinating.

08:55

When Will Quantum Computing Be Ready for AI? Timeline and How AI Accelerates Quantum Development

Keith: You didn’t mention quantum.

Val: [Laughter] I don’t think I want to, honestly.

Keith: [Laughter] It’s a tornado, not the hurricane.

Val: It’s such a tangent. It’s a pun: Like many things about quantum, it’s unpredictable when it’s going to be real and valuable; it’s still five to 10 to 15 years out, and no one can get specific. However, it will create the ultimate virtuous cycle once it’s upon us.

Because AI — and Jensen (Huang) is very deliberate about this — there’s a “coup de quantum.” So AI is helping simulate the proper quantum computers so that we can improve what they should look like. It’s effectively accelerating the availability of usable quantum technology, and once that’s there, that will accelerate AI training and inference in almost unimaginable ways.

But we’re not there yet. We haven’t lit the actual spark for this yet. We’re building the kindling, we’re building the fire.

Keith: We just started. We’re hearing it more and more. I think we have a lot to go still, with regular AI, if that’s a phrase. And I love listening to WEKA. I love what you guys are doing. You’re on such a great roll. You’re too quiet of a story. More people need to watch and look at what you’re doing. And I wish you lots of success finishing out this year and next year.

Val: Thank you. Can’t wait to come back and see whether I was right or wrong on these predictions.

Keith: You are invited back, for sure, we’ll have you next year.

Val: Look forward to it.

Like This Discussion? There’s More!

Hear additional insights on how to leverage a memory-first architecture to accelerate AI inference from WEKA Chief Technology Officer Shimon Ben-David during his keynote at AI Infra Summit 2025.