VIDEO

Optimizing AI Inference Economics | A Conversation with SemiAnalysis

WEKA’s Val Bercovici sits down with Wei Zhou, head of AI utility research at SemiAnalysis, during AI Infra Summit 2025 to discuss the evolution of token economics and its impact on costs as AI deployments become more sophisticated. Below is a lightly edited transcript of their conversation.

Transcript

00:00

What Are Token Economics in AI? How Software Shifted from Zero to High Marginal Costs in 2025

Val Bercovici: Tokenomics, you know, the hottest topic that we both love and share a passion for. What’s the latest and greatest in terms of the last few months? Because nothing else matters before that, really.

Wei Zhou: That’s a really good question. Just for background, I joined SemiAnalysis maybe half a year ago now basically to bring the vision of SemiAnalysis to its fruition from hardware all the way down to software, the models, and the tokens. I think people have talked about this ad nauseam, but we’re basically entering a different age of software where software is going from maybe zero or minimal marginal costs to relatively high marginal costs.

So the economics of serving tokens, the economics of these models in terms of the cost per token they produce, is extremely important. And there’s a lot of business model considerations when you are serving models as a service.

01:03

Who Pays for AI Tokens? Breaking Down GPU Providers, Model Builders, and Enterprise Users

Val: And the thing I love to dig into with you and colleagues of yours is the layer cake, right? So there’s a GPU provider view of tokenomics. There’s definitely the model provider view of tokenomics, the model builder view. The agent consumer of tokenomics, the end user, whether it’s an enterprise or consumer end user. So different audiences have different interests or priorities. How are you seeing that play out in the industry today?

Wei: I think like any development of a new industry or business model, people fixate or focus on the easiest things first. A lot of businesses bought a bunch of GPUs, renting them out to model runners, people either training models or inferencing the models that they train, and they pay by the hour for GPU.

What are they looking for? They’re looking for uptime. They’re looking for service level, service quality. They’re looking for the user experience. What kind of virtualization software you have sitting on top of bare metal. All those things. As you mentioned in the layer cake, we are definitely seeing more and more companies focus on the layer below that, which is once you do have a model, ideally an open-source model that you’re hosting where you basically don’t have to pay the API tax to anyone. How do you serve it efficiently and how do you design your infrastructure setup for the workloads you expect?

I think the really interesting thing about tokenomics — and I think you and I have talked about this and understand — is you can almost print any number you want. You want to produce a two-cent token, you can produce a two-cent token; you want to produce a $10 token, you produce a $10 token. It all really depends on a few things, which is how fast the user is experiencing the token. So that’s tokens per user, per second. And then how long you’re making your user wait for that first token out, which is time to first token. There’s this great economic tradeoff between how fast your user experience is vs. how many tokens are output by a given accelerator GPU per second. I think that’s kind of what businesses are trying to figure out today.

03:06

How Does Prompt Caching Reduce AI Costs? Comparing Anthropic and OpenAI's 2024-2025 Pricing Models

Val: Exactly. I think the ends of that spectrum are interesting. You can have the private jet experience for tokens at a certain cost, or you can have the public transit bus experience at a much lower cost, but definitely different quality of service, different experience.

Now, one of the underlying mechanisms for this is prompt caching. We talked just beforehand about this. Let’s dig into this a little bit. We see at least on the surface that the commercial labs, the Anthropics, the OpenAIs have distinct prompt caching pricing with different levels of complexity, which you talk about. It seems like the open models and the open model providers aren’t implementing or pushing prompt caching as much. What’s your take on that? Because you’re seeing a final level of granularity on this.

Wei: It’s a great observation. Just to level-set everyone, I think prompt caching as a price discriminator or a discounting mechanism really only appeared maybe in the second half of last year. This is really novel, but just taking two or three steps back:

What is token pricing today? Token pricing today actually looks a lot like the hardware utilization required to produce a specific token. I’m sure if you go on Anthropic or OpenAI or even OpenRouter, you can see there’s a different price for an input token vs. an output token. And then there’s this new token that people are classifying, which is a cached token. From that perspective, what all these prices really reflect is the different kind of compute intensity of each token. And prompt caching is an extremely important and extremely influential economic factor when it comes to the total cost of any given stream of tokens that you’re consuming.

And I’m sure you have a lot to say on this, but it’s basically a way to incentivize users to craft their prompts or craft their workloads to make the most economic use of the hardware.

05:05

What If Prompt Caching Lasted Days Instead of Hours? Extended Cache Benefits for Agent Swarms

Val: To that point, there’s a lot of, I would almost say ironically, manual cache management in these agentic systems as we’re developing them, especially as we’re deploying them. What are your thoughts? You know our opinions on this around augmented memory and the impact it could have, but what are your thoughts on the practical benefits of taking what seems to be the state-of-the-art of prompt caching, which is maybe paying a premium for an hour’s worth of caching?

What if we were able to offer a day? What if we’re able to offer a week or a month of prompt caching at, pick your cost point, higher or lower than today? What would be the benefits of that for the workloads you’re seeing and maybe ideally, agentic swarm workloads?

Wei: I think it’s a good question. Basically right now the way it works is the API pricing and these models kind of grew up in this world of chat. In chat, you can get a few thousand tokens in, maybe a few thousand out, or more with reasoning. Users can or can’t. Sometimes they want to keep what they were talking about, but mostly people just press the new chat button. You basically refresh everything. You get to repeat.

Val: One or two turns, a few turns.

Wei: Exactly.

When it comes to agentic workloads or what we describe as swarms, where you have different models operating on different parts of the context and doing different things with it, there’s an enormous corporate value to not constantly refresh all these tokens, re-prefill. Because for every token that has to be prefilled, that costs money. For every token that has to be decoded, that also costs money — more money, in fact. To the extent you can basically get tokens for free with cached tokens with prefix hitting, prefix cache hits, then you don’t need to pay for those tokens.

As context windows explode, as the token length per query explodes, there’s actually increasing value to being able to find all those tokens with a prefix hit, rather than with fresh prefill. And what you describe, which is storing the tokens over multiple days and months, I think that’s super helpful for enterprises because then they can basically just pick up where they left off.

07:13

Why Is KV Cache Hit Rate the #1 Metric for AI Agent Success? 50% vs 90% Performance Impact

Val: Manus, a very popular agent now, published a pretty popular context engineering blog a couple of months ago. They said the No. 1 indicator of success for agents is KV cache hit rate. And what we’ve seen in our models is the ability to basically take that incremental difference between 50-60%, 60-70%, and so forth, all the way up to 90% has an exponential benefit. It’s a classic exponential graph. So what are your thoughts around how systems of the future, agents of the future are going to depend on these kinds of cache hit rates on agent swarms? And then we’ll table a reinforcement learning question for after.

Wei: Like I said, free tokens, right?

Val: Yeah.

Wei: I think one common misconception people have is that because they look at the prices of these tokens and think, “Oh, output tokens are really expensive, like 10x expensive.” That must be the thing that really drives things. But if you look at actual workloads for API services and stuff like that, actually it’s the input workloads that dominate.

Val: Absolutely. The ratios are hundreds of input tokens to one output, it’s crazy.

Wei: For a lot of these agentic workloads, yeah. I think people are still anchoring on this chat interface, which is 2:1, 3:1, 4:1. And nowadays we see 10:1, 50:1, 100:1, and the total sequence length is very long. So the input token price has huge influence on the total price of your token on average. And the more of those that are at a 90% discount and are essentially free if you own your own hardware, the lower your overall token cost would be.
Val: Yeah, that’s a cool point.

08:47

How Do Reinforcement Learning Costs Scale in 2025? Memory-Bound Inference vs. Pre-Training

Val: At the keynotes this morning it was really refreshing to see Marco, who’s the PyTorch product manager at AMD, really mapped the reinforcement learning challenge as FLOPS and memory, showed how again, it’s very, very memory-bound. What are your thoughts on that in terms of, it seems like today the latest trends as we wrap up 2025 is that reinforcement learning seems to be the next step to the next milestone, the next step function of AI, perhaps even AGI. So given that it’s really inference-bound and decode-bound, what are your thoughts on tokenomics as really advancing the whole field and science of AI?

Wei: Well, I don’t know if we have enough time for this, but the summary is there are three ways of scaling model performance, roughly speaking. There’s the pre-training, the classical, “More FLOPS, more data.” There’s been post-training with all kinds of techniques, supervised fine-tuning, RL. And these methodologies have actually been a bigger driver of gains of the models recently because you can basically mold the model against a set of verifiable tasks. You can either verify it with a grader or with the LLM as a rubric and make the model more practically useful.

And then there’s test-time compute. The more time you spend testing or you can see some of these models shoot out multiple different prompts at the same prompt and try to find the best answer. Reinforcement learning requires four passes. Basically, you’re producing a lot of traces, a lot of rollouts, and this is very much an inference workload. Inference workloads, as we’ve discussed, are very much memory-bound because the decode tokens being produced are typically bound by the memory communication rate between chips.

So I think with better memory-caching techniques, you can drive the cost down of this technique and you can reduce your overall inference spend for essentially training.

Val: I would say accelerate the RL loops as well. And just bring new checkpoints, key checkpoints to market faster, which is really cool.

Wei: Yeah, it’s all about throughput at the end of the day.

11:03

What Is SemiAnalysis Inference Max? Benchmarking B200, H200, and MI350 GPU Performance

Val: So let’s rap on Inference Max — an amazing product service and not sure exactly how you categorize it — becoming really widely followed by the industry. You had a great first impression, great reception by the industry. We know it’s version one and there’s so much more we can do. Why don’t you just briefly outline what it is, and especially what the challenges are in terms of what we discuss, where we want to take it in version two, and how we can fix some of those right here, right now.

Wei: I don’t know about right here, right now, but Inference Max is really cool. SemiAnalysis, as a firm, started off doing research on semiconductors, but recently we’ve really started focusing on the things that matter for people for deploying AI today. Both from a bare-metal rental perspective, and now for Inference Max, from a model deployment perspective: What hardware setup is the most efficient for delivering the most software or model.

Val: And there’s so many options there, right?

Wei: So many options.

Val: It’s quite a recipe. You have to figure it out.

Wei: We try to pick a few models that we think are representative of the market today. So we have GPT-4-OSS-120B, which is a 5- or 6-billion active parameter model, then we have DeepSeek-R1, which is a 37-billion active parameter model. So we try to match the more popular model architectures with what we think are the most popular accelerators today.

Today we have B200, GB200, the MI series all the way up to 350. And we have the current system setups, we’re looking to expand more. So what is the methodology today? We basically have three flavors of input-output sequence length, we have two models, and we have FP4 and FP8, different quantizations.

Then we measure all of these unit economic metrics across a sweep, all the sweeps of hardware across an array of what we call interactivity: tokens per second per user. Basically what you as a user experience. And you can see this beautiful economic trade-off in terms of the throughput you can get out of your GPU vs. the throughput a user expects.

13:15

How to Model Real AI Workloads: 8:1 vs 100:1 Input-Output Ratios in Production Systems

Wei: You mentioned shortcomings, or what it could be.

Val: Opportunities to improve.

Wei: Sorry. Opportunities to improve.

Val: Yeah, let’s connect what we just talked about in terms of the input-output token ratios we’re seeing in the real world today to how we want to use this really amazing foundation of Inference Max to represent that.

Wei: When we talk to actual endpoint providers, they say, “I just look at the 8:1 because that’s what we see.” Others say, “Actually, I’m seeing a lot of agentic use cases that are 50:1 or 100:1.” That’s a different discussion. So there are things we can do to improve the different input-output lengths we use.

The other thing right now is we’re using random sequences, so there is no KV cache hit. So we’re not really measuring the real-life performance of these machines.

Val: System prompts and all sorts of things in there.

Wei: System prompts and things like that, but also we’re just not maybe fully capturing what a real use case would look like because we don’t have that yet. But those are all on the roadmap.

Val: And those are, again, to be able to take Inference Max in this direction. It’s not a commitment, obviously, but it’s very resource-intensive. What are your thoughts around that? I’ve got some proposals that we can just openly talk about here and see if we can start something and recruit some sponsors. But what are your thoughts on the requirements to go from Inference Max 1 to version 2?

Wei: Yeah, I think the first thing is you’ve got to bribe me.

[Laughter]

No, we’ve been fairly open with our compute partners on this, and AMD and NVIDIA have been huge supporters of this, both from the software engineering resources and hardware resources sides. And we utilized some neoclouds as well for compute resources.

I think the biggest thing that can help us is engineering resources on our end. If there are people out there that are very interested in producing what we think will be the definitive benchmarking for AI performance, you should reach out to our careers page at SemiAnalysis. We’re always hiring. Beyond that I think it would be nice to bring in additional hardware partners because this is live benchmarking done every day. And it’s super transparent, super open, there’s no decision by committee. It’s open source. Everyone can look at the code and make your own judgments.

15:27

Can AI Benchmarking Follow the Kubernetes Foundation Model? $50K/Week Infrastructure Costs

Val: So that’s the cool thing. Again, there are lessons to be learned because some of the resource requirements, those GPUs to complement the software, the GPU resource requirements are pretty significant. They’re bordering $10,000-a-day type of requirements, about $50,000 a week. And some of these benchmarks even exceed a day’s worth of runtime.

So one of the things we can consider doing is what worked with Kubernetes in the early days when I was part of the Cloud Native Compute Foundation and helping form that: When you have a good governance model with benchmarks that are relevant in the real world, credible by actual end users and customers, that all the vendors through this strong governance model can agree upon, there’s a way to draw a lot of sponsorship and a lot of funding to afford this kind of Inference Max 2. So that’s something I think is really worth pitching here: Can we make this happen?

Wei: That’s a good point. And while we do appreciate the offer of money being thrown at us, I think for us what’s important is to remain unbiased and to strike a healthy balance between securing the resources that we need to run this benchmark, but also to produce something that the whole industry can agree upon as being a fair, balanced assessment.

Val: And so it’d be a nonprofit foundation people would be throwing money at, and you could be chairing it, etc, and that’s the point.

Wei: So maybe we can convert it into a PBC one day —

Val: Exactly, do the Sam Altman thing.

[Laughter]

Val: That’s a fun note to end it on.

Wei: Cheers.

Related Resources