AI Token Economics and the Real Cost of Running AI Models
Fantastic. We have Val Bercovici here from WEKA, the Chief AI Officer. I love meeting Chief AI Officers, by the way, because then we get to talk about all these hot topics. Last time, I bet you, tokenomics was the hot topic. Now we're talking about memory and inference. Give us an update. How are you seeing things all shaping up and evolving? The market's moving a mile, you know, a thousand miles a minute, a million miles a so I don't think that's news to anyone. What matters, though, is literally what can we afford to run. That's coming to the fore right now. There's a lot of capital being spent. There's a lot of deployments now. There's two conflicting studies I love to talk about. One is the famous MIT study where ninety five percent of AI projects failed. And we can talk about why that's not necessarily a credible study, but I think it's interesting to say these are experiments at this stage. They're not necessarily production systems, so the five percent that work are really valuable. Then there's a contrasting study with eight hundred responses versus fifty two that found that ninety five percent of projects succeeded and only five failed. What does MIT know about? So there's a lot going on here, but one of the reasons between, you know, that separates failures from the winners actually is the tokenomics. It's like what is the unit cost, unit economic cost of running these models? Can you afford? And just check out Reddit forums, check out Twitter feedback, you know, all sorts of feedback at shows like this. You know, can you afford all the tokens you want? You know, you're on Cloud Code, you're generating these really cool apps, you're vibe coding, then you deploy in production and oh my god, I can't afford to run this. My bill, just like the early cloud bills, you know, my product is great but my bill is way too high. How can I reduce this bill? That's why understanding tokenomics, which is very different than any other kind of prior cloud computing trend, understanding them in-depth really, really matters here. So is that your new book, Tokenomics, just like we had Freakonomics, about what, five years ago or ten years ago? You give me a great idea. Absolutely. I will. Well, how does it all correlate then? Because I think it is all about finding the right profit and the right balance with things like memory and inference. Yeah. So what's really important here is the role of memory in this equation, in the tokenomics equation. So if we take a little bit of a step back, high performance computing, especially GPU computing, scientific computing and so forth, has always had this interesting trade off between floating point operations per second, FLOPs, and memory. And even before AI was a thing, just in big gene sequencing workloads and seismic analysis workloads and nuclear fusion workloads and all that, there's just a lot of this trade off of how much do I compute versus store in this limited memory, this really high performance memory, and save the compute cycles and retrieve it if I'm using it a lot. But there's this constant tension. And now AI, whether it's training, particularly inference, the new reinforcement learning trends, really exacerbates this. And what we're seeing right now, and I'd like to actually pick on the most successful example in this space, it would be Anthropic. It would be their Cloud Code service, which has created a lot of very addicted developers. They love this product and feature, but they also hate it at the same time because they can't get the tokens they want at any price. They're literally saying, I'll give you two thousand dollars a month. Give me unlimited tokens. Scarcity is great for margin, though. It is. It is. So scarcity is a real thing right now. And there's this encyclopedia pretty much that Anthropic has published around how to optimize what's called your prompt caching, which is fundamentally how to optimize your token cost. We're in an optimization phase too, are we, the energy costs, with everything else? Very, very much. I mean for Anthropic the problem isn't their token pricing, they dictate those. The problem is access to GPUs and then their providers like Amazon and now Google and others access to energy to actually power those GPUs. So it does, the buck ultimately stops there. More scarcity. Gosh. I can take this conversation in a couple of different directions, but I think there's a comment about competition. Yeah. You mentioned the LLMs. You mentioned the data centers. You mentioned on a global basis, on a chip basis. How are you placing bets as a company in this area? We can be Switzerland here, right? Our design from day one was to be a software only company. We run on almost any well, we run on all supported hardware and that support matrix keeps growing and growing. So we support NVIDIA processors and accelerators, AMD processors and accelerators, a growing list of others. We support cloud deployments. We support on premises deployments. We support hybrid deployments. That You have to be loyal to your customers too. The customers dictate this, right? They're the ones that are saying, no, I want to use AMD instead of NVIDIA for inference, for example. Or no, I have a heavy training job and the NVIDIA stack is most mature. I'm sorry to interrupt. What are your key inferences challenges now looking today and looking forward? So the key inference challenges really boil down to this one underlying bottleneck for all of AI. If you really peel all the layers away and you find out what's the real underlying bottleneck, it's this thing called GPU prefill. And it's such a bottleneck that NVIDIA, for the first time in their history, preannounced a processor about eighteen months in advance. And not a general purpose GPU, not even a GPU that's just for inference as opposed to training, but just for pre fill. That first phase where you basically take a prompt, vectorize it ten thousand, twenty thousand dimensions, and then read from it in this thing called KvCache, and that's the decode process. So the real bottleneck is pre fill has to always happen before decode. You run out of memory for decode within minutes, hence the Anthropic recommendations, which are like an encyclopedia long right now. But what if you had this concept of a token warehouse, where instead of prefilling many, many times when you run out of memory, which is all the time for multiple concurrent sessions. What if you pre filled once and you just decoded from that forever? It's a very groundbreaking notion and it unlocks almost everything in AI. Talk to me about being a Chief AI Officer in the middle of this race, looking at everything from minerals to real estate. I mean, what a wide area. How do you start to evaluate investment? Yeah, well you've got to be a bit of a CFO. You've to wear a CFO's hat, right? And you've to understand the net present value of certain things you're involved in, what's the instant rate of return, managing cash flow is a key here. I take a look at what Sarah Fryer at OpenAI is trying to do based on Sam Mollman's public proclamation and so forth, and it's a real struggle to afford month to month the super ambitious goals of some of the very high profile labs. The other end of the spectrum would just be the enterprise that's trying to consume this stuff, let alone even host it. That's a good point. And so figuring out who my providers are, to your point. There's competition between GPU providers. That's a layer cake of GPU providers, model providers, the AI app companies, and then the enterprises consuming it all. Twenty twenty six, what's the innovation of the year going to be? I think we're going to be shocked at where we are. I can't tell you exactly what's going to shock us, but if I take a look back just ten months to the beginning of this year, I wouldn't believe where we are right now. I wouldn't believe that coding agents actually work. That was hyped ten months ago. It's hard reality today. So there'll be a new version of vibe coding or something different? That's the most pedestrian, I think, prediction. I think we can do so much better. I think so there's two ways to evaluate this. There's a textbook definition of what this next major breakthrough we anticipate, AGI, Artificial General Intelligence The textbook definition is you'll now be able to ask AI to do something economically valuable without a lot of guidance or instruction, just like hiring a smart employee and just letting them run. The personal version of that is today, if you're using agents at all, and if you're not, you have to be. You definitely have to start using agents to see their value, but you really have to supervise them a lot. It's kind of like an intern. And on that note, you need to make sure people know how to use them correctly, ethically. All these things you have to do upfront. My prediction is that a year from now it'll be automatic. The right decisions will be made by the agents when you didn't give it instructions that were specific enough, and to the point where you'll trust it blindly. So today, for example, I have agents generating a bunch of my email responses. So if you're getting an email answer from me, apologies, but it might be AI generated, but I don't let them send You've spent some money recently, right? But I do have them prepare all the drafts. I review the drafts, and like month over month, I am almost shocked at how much better those draft responses are becoming to the point where I'm going to start to soon trust them to send those drafts on my behalf. No, it's fascinating. You didn't mention quantum. I don't think I want to, honestly. If they are, it's tornadoing. Wait a second. What about the Herbs? It's such a tangent, and it's such a pun. Many things about quantum, it's unpredictable. When it's going to be real and valuable is still five to ten to fifteen years out and no one can get specific. However, it will create the ultimate virtuous cycle once it's upon us because AI, and Jensen's very deliberate about this, there's coup de quantum, So AI is helping simulate, you know, the the proper, you know, quantum computers so that we can improve what they should look like. It's effectively accelerating the availability of usable quantum technology, once that's there, that will accelerate AI training and inference in almost unimaginable ways. But we're not there yet. We haven't lit the actual spark for this yet. We're building the kindling, we're building the fire. Heat has started. We're hearing it more and more. I think we have a lot to go with regular AI, if that's a phrase. And I love listening to Weka, I love what you guys are doing. You're on such a great role. You're too quiet of a story, more people need to watch and look at what you're doing. And I wish you lots of success finishing out this year and next year. Thank you. Can't wait to come back and see whether I was right or wrong on the future. You'll find it back, for sure. We'll have you next year. Look forward to it.
Transcript
Why AI Projects Fail: Token Economics and Unit Costs Determine Success
Keith Newman: Fantastic. We have Val Bercovici here from WEKA, the chief AI officer. I love meeting chief AI officers, by the way, because then we get to talk about all these hot topics. Last time I met you, tokenomics was the hot topic. Now we’re talking about memory and inference. Give us an update. How are you seeing things all shaping up and evolving?
Val Bercovici: The market’s moving a mile, a thousand miles a minute, a million miles a minute. So I don’t think that’s news to anyone. What matters, though, is literally what can we afford to run? That’s coming to the fore right now. There’s a lot of capital being spent. There’s a lot of deployments now.
There’s two conflicting studies I love to talk about. One is the famous MIT study, where 95% of AI projects failed. And we can talk about why that’s not necessarily a credible study, but I think it’s interesting to say these are experiments at this stage, right? They’re not necessarily production systems. So the 5% that work are really valuable.
Then there’s a contrasting study with 800 responses versus 52 that found that 95% of projects succeeded and only 5% failed.
Keith: What does MIT know anyway, right?
Val: [Laughter] So there’s a lot going on here. But one of the reasons that separates failures from the winners actually is the tokenomics. What is the unit cost, the unit economic cost of running these models? Can you afford and just check out Reddit forums, check out Twitter feedback and all sorts of feedback? Can you afford all the tokens you want?
You’re on Claude code. You’re generating these really cool apps, you’re vibe coding, then you deploy in production and oh my god, I can’t afford to run this. Just like the early cloud builds. My product is great, but my bill is way too high.
How can I reduce this bill? And that’s why understanding tokenomics, which is very different than any other kind of prior cloud computing trend, in depth really, really matters here.
Keith: So is that your new book, “Tokenomics”? Just like we had “Freakonomics” five years ago? Ten years ago?
Val: Yeah, you give me a great idea.
[Laughter]
How Memory Impacts AI Performance: The Tradeoff Between FLOPS and Memory in GPU Computing
Keith: Well, how does it all correlate, then? Because I think it is all about finding the right profit and the right balance with things like memory and inference.
Val: So what’s really important here is the role of memory in this equation, in the tokenomics equation. So if we take a little bit of a step back, high-performance computing, especially GPU computing, scientific computing, and so forth, has always had this interesting tradeoff between floating point operations per second, FLOPS, and memory.
And even before AI was a thing, just in big gene sequencing workloads and seismic analysis workloads and nuclear fusion workloads and all that, there’s just a lot of this tradeoff of how much do I compute vs. store in this limited memory, this really high-performance memory, and save the compute cycles and retrieve it if I’m using it a lot.
But there’s this constant tension. And now AI — whether it’s training, particularly inference, the new reinforcement learning trends — really exacerbates this. And what we’re seeing right now — and I like to actually pick on the most successful example in this space, it would be Anthropic — would be the Claude Code Service, which has created a lot of very addictive developers. They love this product and feature, but they also hate it at the same time because they can’t get the tokens they want at any price. They’re literally saying, I’ll give you $2,000 a month, give me unlimited tokens —
Keith: It’s great for margin, though.
Val: It is. So scarcity is a real thing right now. And there’s this encyclopedia, pretty much, that Anthropic has published around how to optimize what’s called your prompt caching, which is fundamentally how to optimize your token costs, your input and output costs.
Keith: We’re in the optimization phase, too, aren’t we, with the energy costs, with everything else?
Val: Very, very much. For Anthropic, the problem isn’t their token pricing; they dictate those. The problem is access to GPUs, and then the providers like Amazon and now Google and others, access to energy to actually power those GPUs. The buck ultimately stops there.
Keith: More scarcity.
Multi-Vendor AI Infrastructure Strategy: Supporting NVIDIA, AMD, and Hybrid Cloud Deployments
Keith: Gosh, I can take this conversation in a couple of different directions, but I think there’s a comment about competition. You mentioned the LLMs. You mentioned the data centers, you mentioned on a global basis, on a chip basis. How are you placing bets as a company in this area?
Val: We can be Switzerland here, right? Our design from day one was to be a software-only company. We run on all supported hardware, and that support matrix keeps growing and growing. So we support NVIDIA processors and accelerators, AMD processors and accelerators, a growing list of others.
We support cloud deployments. We support on-premises deployments. We support hybrid deployments.
Keith: Have to be loyal to your customers, too.
Val: The customers dictate this. They’re the ones that are saying, “No, I want to use AMD instead of NVIDIA for inference,” for example. Or, “No, I have a heavy training job and the NVIDIA stack is the most mature.”
GPU Prefill and Why It's the Biggest Bottleneck in AI Inference
Keith: What are your key inference challenges now, today, and looking forward?
Val: So the key inference challenges really boil down to this one underlying bottleneck for all of AI; if you really peel all the layers away and you find out what’s the real underlying bottleneck, it’s this thing called GPU prefill.
And it’s such a bottleneck that NVIDIA, for the first time in their history, pre-announced a processor about 18 months in advance. It’s not a general purpose GPU, not even a GPU that’s just for inference as opposed to training, but just for prefill — that first phase where you basically take your prompt, vectorize it 10,000, 20,000 dimensions, and then read from it in this thing called KV cache, and that’s the decode process.
So the real bottleneck is: Prefill has to always happen before decode; you run out of memory for decode within minutes, hence the Anthropic recommendations which are like an encyclopedia along right now. But what if you had this concept of a token warehouse? Or instead of prefilling many, many times when you run out of memory — which is all the time for multiple concurrent sessions — what if you prefilled once and you just decoded from that forever?
It’s a very groundbreaking notion and it unlocks almost everything in AI.
What Does a Chief AI Officer Do? Managing ROI and Cash Flow in Enterprise AI
Keith: Talk to me about being a chief AI officer in the middle of this race, looking at everything from minerals to real estate. What a wide area. How do you start to evaluate investment?
Val: Well, you’ve got to be a bit of a CFO, you’ve got to wear a CFO hat, right? And you’ve got to understand the net present value of certain things you’re involved in: What’s the instant rate of return? Managing cash flows is key here. I mean, I take a look at what Sarah Frier at OpenAI is trying to do, based on Sam Altman’s public proclamation and so forth. And it’s a real struggle to afford, month to month, the super ambitious goals of some of the very high-profile labs.
The other end of the spectrum would just be the enterprise is trying to consume this stuff, let alone even host it.
Keith: That’s a good point.
Val: And so figuring out who my providers are, to your point, there’s competition between GPU providers. It’s a layer cake of GPU providers, model providers, the AI app companies, and then the enterprise is consuming it all.
When Will AGI Arrive? How AI Agents Are Evolving from Supervised Interns to Autonomous Employees
Keith: 2026: What’s the innovation of the year going to be?
Val: I think we’re going to be shocked at where we are. I can’t tell you exactly what’s going to shock us, but if I take a look back just 10 months to the beginning of 2025? I wouldn’t believe where we are right now. I wouldn’t believe that coding agents actually work. That was hype 10 months ago. It’s a hard reality today.
Keith: So there will be a new version of vibe coding or something different.
Val: That’s the most pedestrian, I think, prediction. I think we can do so much better. There are two ways to evaluate this: There’s a textbook definition for what this next major breakthrough we anticipate, AGI (artificial general intelligence): You’ll now be able to ask AI to do something economically valuable without a lot of guidance or instruction, just like hiring a smart employee and letting them run.
The personal version of that is: Today, if you’re using agents at all — and if you’re not, you have to be, you definitely have to start using agents to see their value — but you really have to supervise them a lot. It’s kind of like an intern.
Keith: And on that note, you need to make sure people know how to use them correctly, ethically.
Val: All these things you have to do up front? My prediction is that a year from now, it’ll be automatic. The right decisions will be made by the agents when you didn’t give it instructions that were specific enough, to the point where you’ll trust it blindly.
So today, for example, I have agents generating a bunch of my email responses. So if you’re getting an email answer from me, apologies, but it might be AI-generated. But I don’t let them send —
Keith: Send some money….
Val: [Laughter] But I do have them prepare all the drafts, and I review the drafts, and month over month, I am almost shocked at how much better those draft responses are becoming, to the point where I’m going to start to soon trust them to send those drafts on my behalf.
Keith: It’s fascinating.
When Will Quantum Computing Be Ready for AI? Timeline and How AI Accelerates Quantum Development
Keith: You didn’t mention quantum.
Val: [Laughter] I don’t think I want to, honestly.
Keith: [Laughter] It’s a tornado, not the hurricane.
Val: It’s such a tangent. It’s a pun: Like many things about quantum, it’s unpredictable when it’s going to be real and valuable; it’s still five to 10 to 15 years out, and no one can get specific. However, it will create the ultimate virtuous cycle once it’s upon us.
Because AI — and Jensen (Huang) is very deliberate about this — there’s a “coup de quantum.” So AI is helping simulate the proper quantum computers so that we can improve what they should look like. It’s effectively accelerating the availability of usable quantum technology, and once that’s there, that will accelerate AI training and inference in almost unimaginable ways.
But we’re not there yet. We haven’t lit the actual spark for this yet. We’re building the kindling, we’re building the fire.
Keith: We just started. We’re hearing it more and more. I think we have a lot to go still, with regular AI, if that’s a phrase. And I love listening to WEKA. I love what you guys are doing. You’re on such a great roll. You’re too quiet of a story. More people need to watch and look at what you’re doing. And I wish you lots of success finishing out this year and next year.
Val: Thank you. Can’t wait to come back and see whether I was right or wrong on these predictions.
Keith: You are invited back, for sure, we’ll have you next year.
Val: Look forward to it.
Like This Discussion? There’s More!
Hear additional insights on how to leverage a memory-first architecture to accelerate AI inference from WEKA Chief Technology Officer Shimon Ben-David during his keynote at AI Infra Summit 2025.
Thank you for your WEKA Innovation Network program inquiry.
A WEKA channel representative will respond promptly.
You’re on your way to solving your most complex data challenges.
A WEKA solutions expert will be in contact with you shortly.
Thank you for your interest in WEKA’s Technology Alliance Program (TAP).
A member of the WEKA Alliances team will follow-up with you shortly.
Thank you!
A WEKA representative will be in touch with you shortly.
Thank you!
A WEKA representative will be in touch with you shortly.