VIDEO

Optimizing AI Inference Economics | A Conversation with SemiAnalysis

So it's awesome to catch up here, Wei. It's been a couple of months since we've actually seen each other. Tokenomics, the hottest topic that we both love and share a passion for. What's the latest and greatest in terms of the last few months? Because nothing else matters before that, really. Yeah. That's a really good question, Val. I mean, so just for background, I joined Semi Analysis maybe a little bit half a year ago now. And, you know, basically to bring the the vision of Semi Analysis to its fruition. Right? From hardware all the way down to software, the models, and the tokens. And, you know, I think, you know, we talk about this ad nauseam, but, like, we're basically entering a different age of software where software's going from maybe zero or minimal marginal cost to, you know, relatively high marginal cost. And so the economics of serving tokens, the economics of these models in terms of the cost per token they produce is extremely important, and there's a lot of business model considerations when you are serving models as a service. Yeah. And the thing I love to dig into with you and colleagues of yours like Dan Ishbal are the layer cake. Right? So there's a GPU provider view of tokenomics. There's definitely the model provider view of tokenomics, model builder view, the agent, you know, consumer of tokenomics, and, like, the end user, whether it's an enterprise or consumer end user. So it's like different, you know, different audiences have different interests or or priorities. You know, how are you seeing that kind of play out, you know, in the industry today? Yeah. I think like any development of a new industry or business model, people kind of, you know, fixate or focus on the the easiest things first. Right? Like, so for now, a lot of businesses have come out, bought a bunch of GPUs, renting them out to, you know, model runners, people either training models or, you know, eventually inferencing the models that they train, and, you know, they pay by the hour for GPU. And, you know, what are they looking for there? They're looking for uptime. They're looking for service levels, service quality. They're looking for the user experience, you know, what kind of virtualization software you have sitting on top of bare metal, all those things. But as you mentioned in the layer cake, we are definitely seeing more and more companies focus on the layer below that, which is once you do have a model, ideally, like an open source model that you're hosting, right, where you basically don't have to pay the the API tax to anyone. Right? You know, how do you serve it efficiently, and how do you design your infrastructure setup for the workloads you expect? I think the really interesting thing about tokenomics is, you know, and I think you and I have talked about this and understand this, you can almost print any number you want. You you know, you wanna produce a a two cent token, can produce a two cent token. You wanna produce a ten dollar token, produce a ten dollar token. It all really depends on a few things. Right? Which is, you know, how fast the user is experiencing that token. So that's tokens per user per second. Yeah. And then how long you're you're making your user wait for that first token out. Right? Time to first token. And there's this great economic trade off between basically how fast your user experience is versus how many tokens are output by a given accelerator GPU per second. And I think that's kind of what businesses are trying to figure out today. Exactly. I think the ends of that spectrum are interesting. You can have the private jet experience for tokens at a certain cost, or you can have the public, you know, transit bus experience at a much lower cost, but definitely different quality of service, different experience. Now, one of the underlying mechanisms for this is prompt caching. We talked just beforehand about this. Let's dig into this a little bit. We see, at least on the surface, that the commercial labs now, the Anthropics, the OpenAI's, have distinct prompt caching pricing with different levels of complexity we should talk about. It seems like the open models and the open model providers aren't implementing or or pushing prompt caching as much. What's your take on that? Because you're seeing, you know, a final level of granularity on this. Yeah. No. I think it's a great observation. I mean, just to level set everyone, you know, I think actually prompt cashing as a as a, like, a price discriminator or, like, a discounting mechanism really only appeared, like, maybe second half of last year. So it's really novel. But just just taking, like, two, three steps back, like so what is token pricing today? Token pricing today actually looks a lot like the hardware utilization required to produce a specific token. I'm sure if you go on Anthropic or OpenAI or even OpenRouter, right, you can see there's a different price for an input token versus output token, and then there's this new token that people are classifying, which is like a a cash free token. Right? A prompt, you know, cash tokens. So from that perspective, what all these prices really reflect is the different kind of compute intensity of each token, and prompt caching is an extremely important and extremely influential economic factor when it comes to the total cost of any given stream of tokens that you're consuming. And, yeah, I'm sure I'm sure you you have a lot to say on this, but it's it's it's it's basically a way to incentivize users to craft their prompts or to craft their workloads to make the most economic use of the of the hardware. Yeah. So to that point, there's a lot of, I would almost say ironically, manual cache management in these agentic systems as we're developing them, especially as we're deploying them. What are your thoughts? You know our opinions on this around augmented memory and the impact it could have. But what are your thoughts around the practical benefits of taking what seems to be the state of the art of prompt caching, which is maybe paying a premium for an hour's worth of caching? What if we were able to offer a day? What if we were able to offer a week or a month of prompt caching at pick your cost point, you know, higher or lower than today? What would be, like, the benefits of that for the workloads you're seeing and maybe ideally the agentic swarm workloads? Yeah. I think it's a good question. So, you know, basically, now, the way it works is the API pricing and these models kinda grew up in this world of chat. Right? And chat, like, you can get a few thousand tokens in, maybe a few thousand out or more with reasoning. And, you know, users can or can't sometimes they wanna keep what they were talking about. Mostly, people just press the new Yeah. You know, new chat button. Right? And you basically refresh everything, get to re One or two terms, a few terms. Exactly. Exactly. When it comes to agentic workloads, or we describe it like swarms, right, where you have, like, different models operating on different parts of the context and doing different things with it, there's an enormous, I think, corporate value to being able to not constantly refresh all these tokens re prefilled. Because, you know, for every token that has to be prefilled, that that costs money. Right? And for every token that has be decoded, that also costs money, more money, in fact. And so to the extent you can basically get, you know, tokens for free with cache tokens, with prefix hitting, prefix cache hits, then you don't need to pay for those tokens. And, you know, as context windows explode, as the, you know, call it, like, the token length per query explodes Yeah. There's actually, like, increasing value to be able to find all those tokens with a prefix hit rather than with fresh prefill. And what you described, which is storing the tokens over multiple days, months, I mean, I think that's super helpful for enterprises because then they can basically just pick up where they left off. Right? And it's cool. Like, you know, Manus, very popular agent now, published a pretty popular context engineering blog a couple of months ago. And they said the number one indicator of success for agents is KV cache hit rate. And what we've seen in our models is the ability to basically take that incremental difference between fifty percent to sixty percent, sixty percent to seventy percent, and so forth all the way up to ninety percent. That has, like, an exponential benefit. It's a classic exponential graph. So what are your thoughts around, you know, how systems of the future, agents of the future, are gonna depend on these kinds of cache hit rates on agent swarms? And then we'll we'll table a reinforcement learning question for after. Yeah. I mean, you know, like I said, free tokens. Right? Like Yeah. Yeah. I think one common misconception people have is that, you know, it because they look at the prices of these tokens, they're like, output tokens are really expensive, like ten x expensive. That must be the thing that really drives things. But if you look at actual, like, workloads for, you know, the API services and stuff like that, actually, it's the input workloads that dominate. Absolutely. The the ratios are, like, hundreds of input tokens to one output token. It's crazy. Yeah. For a lot of these agent workloads, yeah. So, like, I think people are still anchoring on this, like, chat interface, which is, like, two to one, three to one, four to one, and nowadays we see, like, ten, fifty, a hundred, and Yeah. The total sequence length is very long. So the input token price Has huge influence on the total price of your tokens on And the more of those that are, you know, at, like, a ninety percent discount or, like, essentially free, right, if you own your own hardware, the lower your overall token cost would be. Yeah. That's a cool point. And so the the keynotes this morning, was really refreshing to see Marco, who's a PyTorch product manager at AMD, really mapped the the reinforcement learning challenge as flops and memory And showed how, again, it's very, very memory bound. Yep. So what are your thoughts on that in terms of it seems like today, the latest trends as we wrap up twenty twenty five is that reinforcement learning seems to be the next step to the next milestone, you know, the next step function of AI, perhaps even AGI. So given that, it's really inference bound and decode bound, what are your thoughts on tokenomics as really advancing this whole field in science of AI? I don't know. I don't know if we have enough time for this, but, I mean Summary's off. Yeah. The summary the summary is, look. There's three ways of scaling model performance. Right? Roughly speaking. There's, you know, the pretraining, the classical You know, more flops, more data. There's been post training with, you know, all kinds of techniques, you know, supervised fine tuning, RRL, and then and these methodologies have actually been a a bigger driver of of gains of the models recently because you can basically, you know, mold the model into you know, against, like, a set of verifiable tasks. You can either verify with a grader or with a LLM as a rubric and make the model more practically useful. And then there's, you know, there's test time compute, Like the more time you spend testing or like you can see some of these models shoot out multiple different prompts, the same prompt, and try to find the best answer. Reinforcement learning requires four passes. Basically you're producing a lot of traces, a lot of rollouts, and this is very much an inference workload. And inference workloads, as we've discussed, are very much memory bound. Right? Because the decode tokens being produced are typically bound by the the the memory communication rate between chips. So, yeah, I think, you know, with with better memory caching techniques, you know, you can, you know, drive the cost down of this of this technique, and you can reduce your overall, you know, inference spend for essentially training. Right? And I would say accelerate those RL loops as well and just, you know, bring new checkpoints, you know, key checkpoints to market faster, which is really cool. Yeah. No. It's all about throughput at the end of the day. So let's wrap on InferenceMax. Amazing, you know, product service. I'm not sure exactly how you categorize it. Becoming really widely followed by the industry. You had a great first impression, great reception by the industry. And we know it's version one. Yes. And we know there's so much more we can do. Why don't you just briefly outline what it is and especially what the challenges are in terms of what we discussed, where we wanna take it in version two, and how we can fix some of those right here, right now. Right. I don't know about right here, right now, but, you know, so inference max is really cool. So semi analysis as a firm, started off, you know, doing research on semiconductors. But, you know, recently, we we've really started focusing on, like, what are the things that matter for people deploying AI today? Right? Like, both from a bare metal rental perspective and now for inference max, like, from a, you know, model deployment perspective. Right? What saw what hardware setup is the most efficient for delivering the most software or model? And there's so many options there. Right? So many options. It's it's quite a quite a recipe you have to figure out. So Yeah. And so we try to pick a few models that we think are representative of the market today. So we have GPT OSS one hundred twenty b, which is like a five or six billion active parameter model, and then we have DeepSeq R one, which is like a thirty seven billion active parameter model. So we try to match the more popular model architectures with what we think are the most popular accelerators today. Today we have V200, GV200, MI series all the way up to three fifty, And we have the current system setups. We're looking to expand more. So what is the methodology today? We basically have three flavors of input output sequence length. We have two models currently, and we have FP4, FP8, those different quantization. And then we measure all of these unit economic metrics across a sweep, all the sweeps of hardware across an array of what we call interactivity. Interactivity being tokens per second per user. Right? So basically what you as a user experience. And you can see this beautiful economic trade off in terms of the throughput you can get out of your GPU versus the throughput a user expects. Right? But you mentioned shortcomings or what it could be. Opportunities to improve. Sorry. Opportunities to improve. Yeah. We'll connect what we just talked about in terms of the input output token ratios we're seeing in the real world today to how we want to use this really amazing foundation of infraSMAX to represent that? Yeah. So a lot of people when we talk to actual endpoint providers, they're like, I just look at the eight to one k because that's what we see. We see eight to one. And others are like, well, actually, I'm seeing a lot of agentic use cases that are fifty or one hundred. You're like, okay, well, it's a different discussion. So there's things we can do to improve the different input output lengths we use. The other thing right now, kind of using random sequences right now, there is no KV cache hit. So we're not really measuring, let's call it, like, the the real life performance of these machines if we System prompts and all sorts of things in there. Yeah. System prompts and things like that, but also just, like, we're just not, maybe fully capturing what a what a real use case would look like because we don't have that yet. But, you know, those are all on the roadmap. And those are, again, to be able to take, you know, inference max in this direction, it's not a commitment, obviously, but it's very resource intensive. Right? So what are your thoughts around that? I've got some proposals that we can just openly talk about here and see if we can start something and recruit some sponsors. But what are your thoughts around the requirements to go from InfraSMAX one to this kind of version two? Yeah. I think the first thing is you gotta bribe me. No. I mean so we we've been fairly open with our our compute partners on this, and, you know, AMD and NVIDIA have been huge supporters of this, both from a software engineering resources and also hardware resources side. And, you know, we we utilize some, I think, some Neo Clouds as well for compute resources. I think the biggest thing that can help with us is engineering resources, right, on our end. So if there are people out there that are very interested in producing what we think will be the definitive benchmarking for AI performance, you should reach out to our careers page on semi analysis. We are always hiring. But beyond that, you know, I think yeah. I think it would be nice to bring in additional hardware partners because this is live benchmarking done every day, and it's it's super transparent, super open. There's no decision by committee. It's open source. Everyone can look at the code and make your own judgments. So that's the cool thing here is, again, there's lessons to be learned because some of the resource requirements, those GPU to complement software, the the GPU resource requirements are pretty significant. They're, like, bordering ten thousand dollar a week type of requirements. Sorry. Ten thousand a day, you know, about fifty thousand a week. And some of these benchmarks even exceed, like, you know, a day's worth of runtime. So one of the things we can consider doing is what worked with Kubernetes in in the early days when I was part of the Cognitive Compute Foundation and helping form that is when you have a good governance model with benchmarks that are, a, you know, relevant in the real world, credible by actual end users and customers that all the vendors who this strong governance model can agree upon, there's a way to draw a lot of sponsorship and a lot of funding to afford this kind of infra smacks too. So that's that's something I think is really worth pitching here is, you know, can we make this happen? Yeah. No. That's a good point. And, you know, while we do appreciate the offer of money being thrown at us, I think for us, what's important is to remain unbiased and strike a healthy balance, right, between securing these resources we need to run this benchmark, but also to produce something that the whole industry can agree upon as being, yeah, a fair balance assessment. And so, yeah, it'd be a nonprofit foundation. People would throwing money at. Right? And you could be chairing it, etcetera. Yeah. And that's the point. So And maybe we can convert it into a PBC one day and raise Exactly. Do the Sam Waltman thing. That's a fun note to end it on. Cheers.

WEKA’s Val Bercovici sits down with Wei Zhou, head of AI utility research at SemiAnalysis, during AI Infra Summit 2025 to discuss the evolution of token economics and its impact on costs as AI deployments become more sophisticated. Below is a lightly edited transcript of their conversation.

Transcript

00:00

What Are Token Economics in AI? How Software Shifted from Zero to High Marginal Costs in 2025

Val Bercovici: Tokenomics, you know, the hottest topic that we both love and share a passion for. What’s the latest and greatest in terms of the last few months? Because nothing else matters before that, really.

Wei Zhou: That’s a really good question. Just for background, I joined SemiAnalysis maybe half a year ago now basically to bring the vision of SemiAnalysis to its fruition from hardware all the way down to software, the models, and the tokens. I think people have talked about this ad nauseam, but we’re basically entering a different age of software where software is going from maybe zero or minimal marginal costs to relatively high marginal costs.

So the economics of serving tokens, the economics of these models in terms of the cost per token they produce, is extremely important. And there’s a lot of business model considerations when you are serving models as a service.

01:03

Who Pays for AI Tokens? Breaking Down GPU Providers, Model Builders, and Enterprise Users

Val: And the thing I love to dig into with you and colleagues of yours is the layer cake, right? So there’s a GPU provider view of tokenomics. There’s definitely the model provider view of tokenomics, the model builder view. The agent consumer of tokenomics, the end user, whether it’s an enterprise or consumer end user. So different audiences have different interests or priorities. How are you seeing that play out in the industry today?

Wei: I think like any development of a new industry or business model, people fixate or focus on the easiest things first. A lot of businesses bought a bunch of GPUs, renting them out to model runners, people either training models or inferencing the models that they train, and they pay by the hour for GPU.

What are they looking for? They’re looking for uptime. They’re looking for service level, service quality. They’re looking for the user experience. What kind of virtualization software you have sitting on top of bare metal. All those things. As you mentioned in the layer cake, we are definitely seeing more and more companies focus on the layer below that, which is once you do have a model, ideally an open-source model that you’re hosting where you basically don’t have to pay the API tax to anyone. How do you serve it efficiently and how do you design your infrastructure setup for the workloads you expect?

I think the really interesting thing about tokenomics — and I think you and I have talked about this and understand — is you can almost print any number you want. You want to produce a two-cent token, you can produce a two-cent token; you want to produce a $10 token, you produce a $10 token. It all really depends on a few things, which is how fast the user is experiencing the token. So that’s tokens per user, per second. And then how long you’re making your user wait for that first token out, which is time to first token. There’s this great economic tradeoff between how fast your user experience is vs. how many tokens are output by a given accelerator GPU per second. I think that’s kind of what businesses are trying to figure out today.

03:06

How Does Prompt Caching Reduce AI Costs? Comparing Anthropic and OpenAI's 2024-2025 Pricing Models

Val: Exactly. I think the ends of that spectrum are interesting. You can have the private jet experience for tokens at a certain cost, or you can have the public transit bus experience at a much lower cost, but definitely different quality of service, different experience.

Now, one of the underlying mechanisms for this is prompt caching. We talked just beforehand about this. Let’s dig into this a little bit. We see at least on the surface that the commercial labs, the Anthropics, the OpenAIs have distinct prompt caching pricing with different levels of complexity, which you talk about. It seems like the open models and the open model providers aren’t implementing or pushing prompt caching as much. What’s your take on that? Because you’re seeing a final level of granularity on this.

Wei: It’s a great observation. Just to level-set everyone, I think prompt caching as a price discriminator or a discounting mechanism really only appeared maybe in the second half of last year. This is really novel, but just taking two or three steps back:

What is token pricing today? Token pricing today actually looks a lot like the hardware utilization required to produce a specific token. I’m sure if you go on Anthropic or OpenAI or even OpenRouter, you can see there’s a different price for an input token vs. an output token. And then there’s this new token that people are classifying, which is a cached token. From that perspective, what all these prices really reflect is the different kind of compute intensity of each token. And prompt caching is an extremely important and extremely influential economic factor when it comes to the total cost of any given stream of tokens that you’re consuming.

And I’m sure you have a lot to say on this, but it’s basically a way to incentivize users to craft their prompts or craft their workloads to make the most economic use of the hardware.

05:05

What If Prompt Caching Lasted Days Instead of Hours? Extended Cache Benefits for Agent Swarms

Val: To that point, there’s a lot of, I would almost say ironically, manual cache management in these agentic systems as we’re developing them, especially as we’re deploying them. What are your thoughts? You know our opinions on this around augmented memory and the impact it could have, but what are your thoughts on the practical benefits of taking what seems to be the state-of-the-art of prompt caching, which is maybe paying a premium for an hour’s worth of caching?

What if we were able to offer a day? What if we’re able to offer a week or a month of prompt caching at, pick your cost point, higher or lower than today? What would be the benefits of that for the workloads you’re seeing and maybe ideally, agentic swarm workloads?

Wei: I think it’s a good question. Basically right now the way it works is the API pricing and these models kind of grew up in this world of chat. In chat, you can get a few thousand tokens in, maybe a few thousand out, or more with reasoning. Users can or can’t. Sometimes they want to keep what they were talking about, but mostly people just press the new chat button. You basically refresh everything. You get to repeat.

Val: One or two turns, a few turns.

Wei: Exactly.

When it comes to agentic workloads or what we describe as swarms, where you have different models operating on different parts of the context and doing different things with it, there’s an enormous corporate value to not constantly refresh all these tokens, re-prefill. Because for every token that has to be prefilled, that costs money. For every token that has to be decoded, that also costs money — more money, in fact. To the extent you can basically get tokens for free with cached tokens with prefix hitting, prefix cache hits, then you don’t need to pay for those tokens.

As context windows explode, as the token length per query explodes, there’s actually increasing value to being able to find all those tokens with a prefix hit, rather than with fresh prefill. And what you describe, which is storing the tokens over multiple days and months, I think that’s super helpful for enterprises because then they can basically just pick up where they left off.

07:13

Why Is KV Cache Hit Rate the #1 Metric for AI Agent Success? 50% vs 90% Performance Impact

Val: Manus, a very popular agent now, published a pretty popular context engineering blog a couple of months ago. They said the No. 1 indicator of success for agents is KV cache hit rate. And what we’ve seen in our models is the ability to basically take that incremental difference between 50-60%, 60-70%, and so forth, all the way up to 90% has an exponential benefit. It’s a classic exponential graph. So what are your thoughts around how systems of the future, agents of the future are going to depend on these kinds of cache hit rates on agent swarms? And then we’ll table a reinforcement learning question for after.

Wei: Like I said, free tokens, right?

Val: Yeah.

Wei: I think one common misconception people have is that because they look at the prices of these tokens and think, “Oh, output tokens are really expensive, like 10x expensive.” That must be the thing that really drives things. But if you look at actual workloads for API services and stuff like that, actually it’s the input workloads that dominate.

Val: Absolutely. The ratios are hundreds of input tokens to one output, it’s crazy.

Wei: For a lot of these agentic workloads, yeah. I think people are still anchoring on this chat interface, which is 2:1, 3:1, 4:1. And nowadays we see 10:1, 50:1, 100:1, and the total sequence length is very long. So the input token price has huge influence on the total price of your token on average. And the more of those that are at a 90% discount and are essentially free if you own your own hardware, the lower your overall token cost would be.
Val: Yeah, that’s a cool point.

08:47

How Do Reinforcement Learning Costs Scale in 2025? Memory-Bound Inference vs. Pre-Training

Val: At the keynotes this morning it was really refreshing to see Marco, who’s the PyTorch product manager at AMD, really mapped the reinforcement learning challenge as FLOPS and memory, showed how again, it’s very, very memory-bound. What are your thoughts on that in terms of, it seems like today the latest trends as we wrap up 2025 is that reinforcement learning seems to be the next step to the next milestone, the next step function of AI, perhaps even AGI. So given that it’s really inference-bound and decode-bound, what are your thoughts on tokenomics as really advancing the whole field and science of AI?

Wei: Well, I don’t know if we have enough time for this, but the summary is there are three ways of scaling model performance, roughly speaking. There’s the pre-training, the classical, “More FLOPS, more data.” There’s been post-training with all kinds of techniques, supervised fine-tuning, RL. And these methodologies have actually been a bigger driver of gains of the models recently because you can basically mold the model against a set of verifiable tasks. You can either verify it with a grader or with the LLM as a rubric and make the model more practically useful.

And then there’s test-time compute. The more time you spend testing or you can see some of these models shoot out multiple different prompts at the same prompt and try to find the best answer. Reinforcement learning requires four passes. Basically, you’re producing a lot of traces, a lot of rollouts, and this is very much an inference workload. Inference workloads, as we’ve discussed, are very much memory-bound because the decode tokens being produced are typically bound by the memory communication rate between chips.

So I think with better memory-caching techniques, you can drive the cost down of this technique and you can reduce your overall inference spend for essentially training.

Val: I would say accelerate the RL loops as well. And just bring new checkpoints, key checkpoints to market faster, which is really cool.

Wei: Yeah, it’s all about throughput at the end of the day.

11:03

What Is SemiAnalysis Inference Max? Benchmarking B200, H200, and MI350 GPU Performance

Val: So let’s rap on Inference Max — an amazing product service and not sure exactly how you categorize it — becoming really widely followed by the industry. You had a great first impression, great reception by the industry. We know it’s version one and there’s so much more we can do. Why don’t you just briefly outline what it is, and especially what the challenges are in terms of what we discuss, where we want to take it in version two, and how we can fix some of those right here, right now.

Wei: I don’t know about right here, right now, but Inference Max is really cool. SemiAnalysis, as a firm, started off doing research on semiconductors, but recently we’ve really started focusing on the things that matter for people for deploying AI today. Both from a bare-metal rental perspective, and now for Inference Max, from a model deployment perspective: What hardware setup is the most efficient for delivering the most software or model.

Val: And there’s so many options there, right?

Wei: So many options.

Val: It’s quite a recipe. You have to figure it out.

Wei: We try to pick a few models that we think are representative of the market today. So we have GPT-4-OSS-120B, which is a 5- or 6-billion active parameter model, then we have DeepSeek-R1, which is a 37-billion active parameter model. So we try to match the more popular model architectures with what we think are the most popular accelerators today.

Today we have B200, GB200, the MI series all the way up to 350. And we have the current system setups, we’re looking to expand more. So what is the methodology today? We basically have three flavors of input-output sequence length, we have two models, and we have FP4 and FP8, different quantizations.

Then we measure all of these unit economic metrics across a sweep, all the sweeps of hardware across an array of what we call interactivity: tokens per second per user. Basically what you as a user experience. And you can see this beautiful economic trade-off in terms of the throughput you can get out of your GPU vs. the throughput a user expects.

13:15

How to Model Real AI Workloads: 8:1 vs 100:1 Input-Output Ratios in Production Systems

Wei: You mentioned shortcomings, or what it could be.

Val: Opportunities to improve.

Wei: Sorry. Opportunities to improve.

Val: Yeah, let’s connect what we just talked about in terms of the input-output token ratios we’re seeing in the real world today to how we want to use this really amazing foundation of Inference Max to represent that.

Wei: When we talk to actual endpoint providers, they say, “I just look at the 8:1 because that’s what we see.” Others say, “Actually, I’m seeing a lot of agentic use cases that are 50:1 or 100:1.” That’s a different discussion. So there are things we can do to improve the different input-output lengths we use.

The other thing right now is we’re using random sequences, so there is no KV cache hit. So we’re not really measuring the real-life performance of these machines.

Val: System prompts and all sorts of things in there.

Wei: System prompts and things like that, but also we’re just not maybe fully capturing what a real use case would look like because we don’t have that yet. But those are all on the roadmap.

Val: And those are, again, to be able to take Inference Max in this direction. It’s not a commitment, obviously, but it’s very resource-intensive. What are your thoughts around that? I’ve got some proposals that we can just openly talk about here and see if we can start something and recruit some sponsors. But what are your thoughts on the requirements to go from Inference Max 1 to version 2?

Wei: Yeah, I think the first thing is you’ve got to bribe me.

[Laughter]

No, we’ve been fairly open with our compute partners on this, and AMD and NVIDIA have been huge supporters of this, both from the software engineering resources and hardware resources sides. And we utilized some neoclouds as well for compute resources.

I think the biggest thing that can help us is engineering resources on our end. If there are people out there that are very interested in producing what we think will be the definitive benchmarking for AI performance, you should reach out to our careers page at SemiAnalysis. We’re always hiring. Beyond that I think it would be nice to bring in additional hardware partners because this is live benchmarking done every day. And it’s super transparent, super open, there’s no decision by committee. It’s open source. Everyone can look at the code and make your own judgments.

15:27

Can AI Benchmarking Follow the Kubernetes Foundation Model? $50K/Week Infrastructure Costs

Val: So that’s the cool thing. Again, there are lessons to be learned because some of the resource requirements, those GPUs to complement the software, the GPU resource requirements are pretty significant. They’re bordering $10,000-a-day type of requirements, about $50,000 a week. And some of these benchmarks even exceed a day’s worth of runtime.

So one of the things we can consider doing is what worked with Kubernetes in the early days when I was part of the Cloud Native Compute Foundation and helping form that: When you have a good governance model with benchmarks that are relevant in the real world, credible by actual end users and customers, that all the vendors through this strong governance model can agree upon, there’s a way to draw a lot of sponsorship and a lot of funding to afford this kind of Inference Max 2. So that’s something I think is really worth pitching here: Can we make this happen?

Wei: That’s a good point. And while we do appreciate the offer of money being thrown at us, I think for us what’s important is to remain unbiased and to strike a healthy balance between securing the resources that we need to run this benchmark, but also to produce something that the whole industry can agree upon as being a fair, balanced assessment.

Val: And so it’d be a nonprofit foundation people would be throwing money at, and you could be chairing it, etc, and that’s the point.

Wei: So maybe we can convert it into a PBC one day —

Val: Exactly, do the Sam Altman thing.

[Laughter]

Val: That’s a fun note to end it on.

Wei: Cheers.

Related Resources

Blog

The Context Era Has Begun

Read the Blog

Web

Maximize AI Token Efficiency

Learn More

Blog

Democratizing AI Inference: The Future of AI Should Take Minutes, Not Weeks

Read the Blog

PRODUCTS

DEPLOYMENT OPTIONS

USE CASES

INDUSTRIES

ARCHITECTURES

Learn AI

RESOURCES

TECHNICAL RESOURCES

ABOUT US

JOIN US