What AI Infrastructure Actually Costs (And Why Most Teams Get It Wrong)

WEKA. March 18, 2026

How to Rethink AI Infrastructure to Maximize ROI

TL;DR

Hidden costs — engineering time, power and cooling, opportunity cost — are just as significant as GPUs and tokens, and far less visible.
No single metric captures AI performance. The winning approach: measure everything, triangulate, and stay skeptical of any one number.
Flexibility isn’t a nice-to-have. With new chip releases every six months, multi-tier strategies and multi-cloud options are survival basics.

Picture a room full of people who build and buy AI infrastructure for a living. When WEKA CMO Lauren Vaccarello asked them point-blank — “Do you know how much you are spending on your AI infrastructure?” — nobody had a quick answer.

The silence said everything.

Lauren brought together leaders from Meta, Lambda, and Silicon Data to dig into the question the industry keeps dodging: what does AI infrastructure actually cost, and how do you prove it’s worth it?

Below are takeaways from the conversation.

What Are the Hidden Costs of AI Infrastructure and How Do You Calculate ROI?

I love that you brought up costs. I’m going to do an audience question right now. Do you know how much you are spending on your AI infrastructure? Do you know what the relative costs of the output of what you are, the models you’re running, the programs you’re running, and the AI you’re building. So what’s fascinating about the world we’re in right now is how do you measure cost? What is cost? What is the value of what you’re doing? Is the cost your GPUs? Is the cost your tokens? Is the cost the number of queries that you’re able to do? So I would love to talk about that. First big macro question, how should I even think about cost? I actually have a talk later, talk about the cost. Again, really agnostic side, right? We don’t host any models. It’s really purely from, you know, pricing point of view because we have token index is one thing we do keep track. Like, do you always need the latest or greatest model? Maybe it’s, maybe not. Maybe you want to use, you know, those open source models where it give you more deterministic, faster. There’s no thinking involved at all times, right? Not much thinking involved. So it really depends on, right, the total cost aside from a token, other, there’s human costs, right? There’s other, but everything else is much, much lower compared to actual token costs and the input of a token, I’m going talk about all those things, but in general. So I think another thing people don’t consider is the human piece. If you keep running something, the human being, you’re expensive, engineers are very expensive. So that’s one of the things I want to highlight. She’s very expensive. Well, maybe I’ll be replaced by AI in a few years. Yeah, I think that makes a lot I agree with that. There’s a lot of like hidden costs that are not necessarily discussed. Just maybe just to echo on that, there’s also a lot a lot of, like, fixed costs, let’s say, in power and cooling, like running data centers. Those are really expensive as well that are not necessarily accounted for when you’re but this might also be applicable only to bigger companies as well. Yeah. So for wearing my pre Lambda hat, cost is opportunity cost. Okay? So in the the mode that I am seeing conversations happen right now, and this is not me speaking from direct engagement with Lambda’s customers, at least not yet. When you have a clear value proposition, it becomes much easier to justify what you need and where. But I think our company has to be very conscious of the costs that, Elisa mentioned, right? Building a working with our colo providers, work to get our capacity, potentially building, in the future. All of those things are, a financial game as much as they are a physical constraints game. And so, being intelligent in terms of picking the right infrastructure at the right time in the right place, that’s our best bet in terms of managing cost, because our customers are then going to influence what we need to do next. Maybe just to add on to that, I really like that idea. I think something else that we don’t have a good sense of is like what is the ROI of our capacity? Like how do we measure the sort of like upside of using our capacity that is very expensive? That’s not always easy to measure either. I would say margin is hard to measure to begin with. The ROI is even more challenging, we don’t have costs and yeah, it is very, the whole chain is very murky at this point. Yeah. And from an ROI perspective, should I be thinking about is it the number of tokens that I’m that we are that we’re generating if we’re doing inferencing? Should I be thinking about how many users I can support simultaneously? Should I be thinking about GPU utilization? And this idea of what is the currency of AI? And if I’m an enterprise, how do I know how to understand if this is good or bad? I think so it does vary, I think, depending on the machine itself as well. I I think token is, like, a very common Costa, I would say, unit here. But then you also do have, let’s say, like, your, you know, network cost, IO cost, or your storage cost, which could be measured in terms of, like, know, your petabytes. So I think the unit varies a lot depending on what type of machines we are talking of. Yeah, I’m going go back to the opportunity cost discussion. At least I learned I did my basic training in that at Google. So, but I think the issue there for an enterprise is understanding, I don’t know, it sound like a broken record, understanding what you need to do. What is AI going to accomplish for you? That dictates effectively your input understanding of what it’s worth and then what you’re prepared to spend. And I would just caution, I think, doesn’t matter what metrics you pick. Just know that what doesn’t any metrics have blind spots will lead you to a path that could not be the right may not be the right path. If you do token outputs, not all token the same quality, and the precision can be different, right? Like the longer the output doesn’t mean it’s better, right? Another thing I will caution is utilization. How do you precisely measure that as well, right? So, if I were anyone, I would try to measure, I will be a hoarder. I’ll measure everything, and then we’ll try to triangulate and see what kind of thesis they come up with. As a business owner, right, what I do when I have an AI product is I don’t I’m not concerned, maybe it’s the wrong thing to do, I’m not concerned about cost. I’m concerned about output. I’m concerned about will my user pay for this, right? Once they prove its validation, I was like, okay, let me try to lower the cost. My GAC, my default insights use the best, the latest and greatest. Once I achieve what I want, we’ll we’ll break the workflow down to different models, we’ll try to do different things, but right now I’m not concerned about cost. So you brought up a couple of interesting things, this idea of, like, what are what are our blind spots? What are the things we don’t know yet? And there’s also a piece of this that is there’s only so much new here. There’s only so much new. We are building infrastructure for AI in the world that did not exist years ago. So there’s some blind spots in that, but there’s only so much that’s new. There’s a lot that’s pattern recognition.

Measuring AI infrastructure costs is genuinely complex. Do you measure GPUs? Tokens? Queries? The answer is all of the above, and there are also “hidden” factors most organizations aren’t acknowledging.

For example, token costs and GPU expenses grab the most attention when considering how much an organization is spending on AI, but other items such as human engineering time, data center operations, and the opportunity cost of future projects are significant considerations that often are not included in ROI measurements.

Panelist Elisa Chen, a data scientist on Meta’s AI infrastructure team, added: “There’s also a lot of fixed costs in power and cooling, like running data centers. Those are really expensive as well.”

Rebecca “Bink” Naughton, who leads data center capacity strategy at Lambda, put AI cost through an additional lens. “Cost is opportunity cost… When you have a clear value proposition, it becomes much easier to justify what you need and where.” This reframes the conversation beyond how much something costs to thinking about what is the cost of not having capacity when you need it.Due to these factors and the evolving industry landscape, measuring ROI for AI infrastructure remains elusive for many organizations. Carmen Li, founder and CEO of Silicon Data and CEO of Compute Exchange, hit this point directly: “Margin is hard to measure to begin with. The ROI is even more challenging… The whole chain is very murky at this point.”

What Metrics Should You Use to Measure AI Infrastructure Performance?

One of AI’s billion-dollar questions revolves around the metrics that will best measure AI performance. The panel’s answer? Triangulate across multiple metrics while acknowledging each one has its limitations.

Carmen cautioned: “It doesn’t matter what metrics you pick… Any metrics have blind spots that will lead you to a path that could not be the right path.” Her recommendation? Be a hoarder of measurements: Measure almost everything, and try to triangulate.

Elisa also noted that cost units vary significantly, with tokens being a common cost unit, but you also have to consider network costs, I/O costs, and storage costs, which could be measured in petabytes.

Tips for Building and Measuring Efficient AI Infrastructure

Four leading companies, one consensus: Measuring performance and calculating ROI is a complex endeavor. Efficient AI requires thoughtful planning. Key recommendations emerged from the panel, including:

1. Audit and Segment Your Workloads

Identify user behavior and recomputation patterns and apply model-to-hardware mapping:

Foundation model training: Dense GPU clusters, reserves long-term capacity;

Fine-tuning: Mid-tier hardware (A100s), offers more flexible capacity;

Inference: Less powerful hardware with maximum caching benefits for highly elastic demand.

2. Build Flexibility Into Everything

In a world where new chip releases every six months trigger cascading changes, flexibility isn’t optional; it’s the key to survival.

Deploy a multi-tier strategy that reserves contracts for critical workloads, maintains on-demand capacity for variables, implements multi-cloud options, and establishes verification protocols before committing.

3. Measure Comprehensively, Then Optimize

Cast a wide net for tracking numerous costs including tokens, GPU utilization rates (actual vs. peak), engineering time, power and cooling use, and the opportunity costs of delayed projects.

But costs aren’t the sole consideration—output is also important. Will your customers pay for it? Prove value first, then optimize for efficiency through caching, workload segmentation, and smart hardware allocation.

What Lessons from Tech History Apply to AI Infrastructure?

Bink offered a useful frame: “History doesn’t repeat itself; it rhymes.”

Caching will reshape AI infrastructure the same way it reshaped the early internet. But AI also presents genuinely new challenges — non-deterministic outputs and what Elisa described as a “counterbalancing act,” where memory efficiency gains are happening simultaneously with long-context models demanding more memory resources than ever.

The organizations that win won’t be the ones applying old playbooks wholesale. They’ll be the ones who recognize familiar patterns while staying alert to genuine novelty.

The Bottom Line

We’re not in the middle of the AI era. We’re not even at the beginning. As Lauren put it: “We are in the first minute of the first day of this… We are building the future of AI together.”

That framing should change how you think about infrastructure decisions. The choices you make now — how you measure, how you allocate, how you build in flexibility — aren’t just operational details. They’re the foundation everything else gets built on. The organizations treating infrastructure as a strategic asset today are the ones that will be positioned to move fastest when the next wave arrives.

That future is being built right now. The question is whether you’re building it on solid ground.

For additional insights from the panelists, be sure to watch the full-length video.

PRODUCTS

DEPLOYMENT OPTIONS

USE CASES

INDUSTRIES

ARCHITECTURES

Learn AI

RESOURCES

TECHNICAL RESOURCES

ABOUT US

JOIN US

What AI Infrastructure Actually Costs (And Why Most Teams Get It Wrong)

What Are the Hidden Costs of AI Infrastructure and How Do You Calculate ROI?

What Metrics Should You Use to Measure AI Infrastructure Performance?

Tips for Building and Measuring Efficient AI Infrastructure

1. Audit and Segment Your Workloads

2. Build Flexibility Into Everything

3. Measure Comprehensively, Then Optimize

What Lessons from Tech History Apply to AI Infrastructure?

The Bottom Line

Popular Blogs From WEKA

What AI Infrastructure Actually Costs (And Why Most Teams Get It Wrong)

What Are the Hidden Costs of AI Infrastructure and How Do You Calculate ROI?

What Metrics Should You Use to Measure AI Infrastructure Performance?

Tips for Building and Measuring Efficient AI Infrastructure

1. Audit and Segment Your Workloads

2. Build Flexibility Into Everything

3. Measure Comprehensively, Then Optimize

What Lessons from Tech History Apply to AI Infrastructure?

The Bottom Line

Share On Social:

Popular Blogs From WEKA

Related Assets

Breaking Down the Memory Wall in AI Infrastructure

The Impact of Storage on the AI Lifecycle

The AI Factory Blueprint: Designing for Scalable, Efficient Inference