How to Rethink AI Infrastructure to Maximize ROI
TL;DR
  • Hidden costs — engineering time, power and cooling, opportunity cost — are just as significant as GPUs and tokens, and far less visible.
  • No single metric captures AI performance. The winning approach: measure everything, triangulate, and stay skeptical of any one number.
  • Flexibility isn’t a nice-to-have. With new chip releases every six months, multi-tier strategies and multi-cloud options are survival basics.

Picture a room full of people who build and buy AI infrastructure for a living. When WEKA CMO Lauren Vaccarello asked them point-blank — “Do you know how much you are spending on your AI infrastructure?” — nobody had a quick answer.

The silence said everything.

Lauren brought together leaders from Meta, Lambda, and Silicon Data to dig into the question the industry keeps dodging: what does AI infrastructure actually cost, and how do you prove it’s worth it?

Below are takeaways from the conversation.

What Are the Hidden Costs of AI Infrastructure and How Do You Calculate ROI?

Measuring AI infrastructure costs is genuinely complex. Do you measure GPUs? Tokens? Queries? The answer is all of the above, and there are also “hidden” factors most organizations aren’t acknowledging.

For example, token costs and GPU expenses grab the most attention when considering how much an organization is spending on AI, but other items such as human engineering time, data center operations, and the opportunity cost of future projects are significant considerations that often are not included in ROI measurements.

Panelist Elisa Chen, a data scientist on Meta’s AI infrastructure team, added: “There’s also a lot of fixed costs in power and cooling, like running data centers. Those are really expensive as well.”

Rebecca “Bink” Naughton, who leads data center capacity strategy at Lambda, put AI cost through an additional lens. “Cost is opportunity cost… When you have a clear value proposition, it becomes much easier to justify what you need and where.” This reframes the conversation beyond how much something costs to thinking about what is the cost of not having capacity when you need it.Due to these factors and the evolving industry landscape, measuring ROI for AI infrastructure remains elusive for many organizations. Carmen Li, founder and CEO of Silicon Data and CEO of Compute Exchange, hit this point directly: “Margin is hard to measure to begin with. The ROI is even more challenging… The whole chain is very murky at this point.”

What Metrics Should You Use to Measure AI Infrastructure Performance?

One of AI’s billion-dollar questions revolves around the metrics that will best measure AI performance. The panel’s answer? Triangulate across multiple metrics while acknowledging each one has its limitations.

Carmen cautioned: “It doesn’t matter what metrics you pick… Any metrics have blind spots that will lead you to a path that could not be the right path.” Her recommendation? Be a hoarder of measurements: Measure almost everything, and try to triangulate.

Elisa also noted that cost units vary significantly, with tokens being a common cost unit, but you also have to consider network costs, I/O costs, and storage costs, which could be measured in petabytes.

Tips for Building and Measuring Efficient AI Infrastructure

Four leading companies, one consensus: Measuring performance and calculating ROI is a complex endeavor. Efficient AI requires thoughtful planning. Key recommendations emerged from the panel, including:

1. Audit and Segment Your Workloads

Identify user behavior and recomputation patterns and apply model-to-hardware mapping:

Foundation model training: Dense GPU clusters, reserves long-term capacity;

Fine-tuning: Mid-tier hardware (A100s), offers more flexible capacity;

Inference: Less powerful hardware with maximum caching benefits for highly elastic demand.

2. Build Flexibility Into Everything

In a world where new chip releases every six months trigger cascading changes, flexibility isn’t optional; it’s the key to survival.

Deploy a multi-tier strategy that reserves contracts for critical workloads, maintains on-demand capacity for variables, implements multi-cloud options, and establishes verification protocols before committing.

3. Measure Comprehensively, Then Optimize

Cast a wide net for tracking numerous costs including tokens, GPU utilization rates (actual vs. peak), engineering time, power and cooling use, and the opportunity costs of delayed projects.

But costs aren’t the sole consideration—output is also important. Will your customers pay for it? Prove value first, then optimize for efficiency through caching, workload segmentation, and smart hardware allocation.

What Lessons from Tech History Apply to AI Infrastructure?

Bink offered a useful frame: “History doesn’t repeat itself; it rhymes.”

Caching will reshape AI infrastructure the same way it reshaped the early internet. But AI also presents genuinely new challenges — non-deterministic outputs and what Elisa described as a “counterbalancing act,” where memory efficiency gains are happening simultaneously with long-context models demanding more memory resources than ever.

The organizations that win won’t be the ones applying old playbooks wholesale. They’ll be the ones who recognize familiar patterns while staying alert to genuine novelty.

The Bottom Line

We’re not in the middle of the AI era. We’re not even at the beginning. As Lauren put it: “We are in the first minute of the first day of this… We are building the future of AI together.”

That framing should change how you think about infrastructure decisions. The choices you make now — how you measure, how you allocate, how you build in flexibility — aren’t just operational details. They’re the foundation everything else gets built on. The organizations treating infrastructure as a strategic asset today are the ones that will be positioned to move fastest when the next wave arrives.

That future is being built right now. The question is whether you’re building it on solid ground.

For additional insights from the panelists, be sure to watch the full-length video.