Why the Infrastructure That Built Your AI Won’t Run It

WEKA. March 24, 2026

Why the Infrastructure That Built Your AI Won't Run It

TLDR:

AI inference is overtaking training as the dominant GPU workload — and it demands a completely different infrastructure approach
KV cache management is the key lever for controlling inference costs and latency at scale
Specialized “subscaler” cloud providers are outcompeting hyperscalers for GPU workloads
Self-healing infrastructure — not GPU specs — is becoming the real competitive differentiator

Well, thanks for joining us, Kyle. Sure. So first question right off the bat for fun. Do you like the term neocloud or you not like the term neocloud? What do you prefer? So I’m actually one that doesn’t like the term neocloud. I think it’s too Pokemon for me, right? I think the one I got recently on a talk with someone was subscaler, right? I hadn’t heard that one I actually hadn’t heard that one before, I think it has a little bit more character to it, but neocloud always made me feel a little bit goofy when I was saying that. You’re a neopet, is what you’re thinking of. Yeah, neopet, kind of Tamagotchi-ish, really, if you will. Exactly, exactly. No, I’m glad I asked because, yeah, I always think about that as well. So for a subcloud, before we get to a day in a life, what typically do you do? And what do your typical customers look like? And what are they asking you for? Yeah, exactly. I mean, I think it’s a wide range, honestly. Crusoe is focusing on digitally native businesses, AI native companies, and enterprise companies alike. We’re making a big bet on AI native and enterprise. I think that more and more companies are looking to diversify their cloud away from a hyperscaler into other subscalers, into better access of hardware and more modern features and stuff that is a little bit more approachable and easier to manage, and that’s where we excel. So for us, it’s companies that are looking for hands off management of their infrastructure and stuff that just works. Yeah, and so having experience in cloud, like in CPU based workloads, it’s so different, right, in the GPU world and AI native companies. What are some of the biggest differences that are to you that might not be obvious to other people? Because a cloud is a cloud really or not really an AI world? Yeah, I think it’s not really an AI world, right? It’s a very different landscape when you’re focusing on GPU workloads versus more broader application infrastructure, right? And so recently Crusoe purchased a company called Atero and Atero is focused on optimizing for those memory bound workloads, right? And that’s the area that, you know, we see the market kind of heading towards to in twenty twenty six and beyond. Know, the training is, you know, had its heyday and, you know, people are really focusing on inference and we expect that to really expand. And so intelligent KV cache management, sharing GPU across the entirety of the cluster, things that really matter to be able to service inference at a very high scale. Love that. So you’re one of the first to sort of acknowledge that we’re past this inertia right now, mostly training. For me, twenty twenty five has really been the year of the shift to inference, the beginning of the big shift to inference. Agents have really been a catalyst for that, perhaps even reinforcement learning now. Any other catalysts, or are those the main catalysts you’re seeing for this shift? Cause it’s a big shift. Yeah. Yeah. I think that these low code, vibe codes, warm code platforms are making it a lot easier for people to build applications on top of them. And because of that, the inference demand is going off the charts. There are diminishing returns from tuning the frontier models. You can get some pretty good stuff off the shelf with open source models today. So whether if you’re tuning for a specific use case, that’s not really the barrier for high quality application development as much as it is for, you know, how can we optimize agentic flows to be really low on the time to first token, to be highly optimized for cost, right model, right solution, right? And you kind of see this a lot, especially in GPT-5, which I think is its hallmark release wasn’t so much just the quality of the model, but the routing of it behind Exactly. How quickly can you serve it for a certain answer versus how much longer do you need for a more complicated one? That’s fascinating. I was actually, you’re sitting in Wei Zhou’s seat when I was talking to him from SemiAnalysis just about an hour ago, and one of the cool topics we were talking about is the fact that we’re seeing just an amazing richness, almost an embarrassment of richness in open model quality right now. Yeah. But we tend to see not as much prompt caching from the open model providers as compared to the commercial providers. So that price per token comparison and, you know, the volume of input tokens is really material compared to output tokens for for coding agents. What are you seeing now? Are you seeing more prompt caching coming for the open models, better, more aggressive KV cache management, or something else? I mean, I would expect so, right? I think that, like I said before, the next scale of inference is not CPU bound, it’s memory bound, Exactly. Right? And so it is unavoidable that the focus from the clouds and from model providers and from the frontier companies is all on optimizing for that solution. So, Crusoe particularly, our focus was making sure we get ahead of that curve and joining, you know, getting the Atero folks to join over the past couple of months. And it was an immensely talented group of people there that have a really sophisticated solution that is a high powered injection into our inference And we’re very excited to get that integrated. So, yeah, I mean, I expect across the board the investment to be in optimization for memory bound hardware. It’s funny, we were just talking to LinkedIn and they were kind of predicting the same thing going forward. And they’re very mature obviously as an AI/ML user. They developed a whole suite of models for fit for purpose and stuff like that. So I think it’s cool that Crusoe is probably the only sub cloud I can think of that’s actually put your money where your mouth is with regards to really focusing on KV cache management and prompt caching memory optimization? You know, is that sort of a strategic direction for the company? Does that really be leading edge in terms of the material infrastructure for this market, or are you looking to just get into up the stack into just different developer experience? I think it’s moving quickly and with intention. The hypothesis is that the inference battle is won at the user experience layer and the performance layer. Those two hand in hand really matter. You see a lot of investment into the types of software that allows for simple turnkey management of the infrastructure. A lot of the teams that go multi cloud don’t have the time or the investment to be able to hands-on a management of their cluster in AWS and GCP and in Crusoe, right. So turnkey management for that is important. Very simple, easy to understand observability layers, right? That is stuff that is very important. And then in addition to that, performance, right? There are a lot of inference players out there and they’re doing some really interesting things. And so to stay ahead of the curve, making big bets like this, and intentional and well researched bets, but making big bets like this are important for us. That’s really cool. So yeah, UX experience was something that was really important, right? Yeah. Like if we decompose that, I don’t want to lead the question too much. What’s really important? I got my opinions on what’s really important in UX experience in this. Yeah, so for us, one of the areas we’re spending a lot of time is self healing infrastructure. So GPUs fail. They should be expected to fail. You know, hardware fails in general, and so having to deal with XID errors coming off and having to remediate that yourself, shut off VMs, migrate workloads, deal with capacity management, it’s all very sophisticated and more or less time consuming. So Crusoe’s investing in infrastructure that allows us to do that for you, right? So you can not worry about your cluster and we’ll do both passive and active health checking. The ability to say like, hey, there is a problem with this node, let’s take it out of rotation, let’s pull something from your capacity pool and bring it back in, and then start to do deeper integrations with your workloads and your jobs. So there’s a lot of things there that make it so it’s, you know, it just works kind of mentality and I think it’s a, you know, that is where I see the user experience going. It enables AI native businesses, it enables enterprise companies to go multicloud very easily and just access hardware, and more so like you get farther or farther away from the actual GPUs themselves, right. Like people become less concerned with which GPU is servicing their request, they’re just all performance based, right, and that’s where we can deliver. Yeah, makes a ton of sense. All right, so we’re going wrap it up here. That’s a good question, what do you want to wrap up on? Yeah, I mean, I don’t know, for me it’s always exciting kind of being at these events and seeing all the stuff that people are building and you know it’s crazy to me how fast the application layer is moving. When you think about it yeah like just past 10 months alone is night and day. Yeah exactly and so you know I’d see we’re heads down trying to build the most reliable and fault tolerant infrastructure that we can, and just seeing the diversity and stuff that’s being built to the application layer is always exciting to me, so for me, it’s just getting the exposure to these companies out here and just seeing all the cool stuff people are doing and just trying to stay up. And I think one of the leads you’re burying that I’m appreciating is just the latency that you’re able to offer, obviously, with these memory KV cache optimizations. Totally. I mean, that’s a big part of the user experience. And we take resiliency for granted, but we really notice the latency. Yeah. Yeah, you don’t notice it until it starts breaking, right? Exactly. We want you to not think about it. Then it’s a fire drill. Yeah, exactly. So yeah. Love to chat. Hope I have another one. All right. Thank you so much. Pleasure.

The AI infrastructure playbook is being rewritten. What got companies through the training era — raw GPU availability, hyperscaler defaults, brute-force compute — isn’t what will carry them through the inference era.

In a recent conversation, Kyle Sosnowski, Vice President of Software Engineering at Crusoe, and WEKA’s Val Bercovici broke down exactly why that gap matters, and what organizations need to do about it. Their core message: inference isn’t just a new workload. It’s a fundamentally different problem that requires rethinking cloud strategy, cost optimization, and infrastructure reliability from the ground up.

Why GPU Workloads Require Different Cloud Infrastructure Than CPU Applications

When asked whether a cloud is simply a cloud in the AI world, Kyle’s response was definitive: “It’s a very different landscape when you’re focusing on GPU workloads vs. more broader application infrastructure.”

This distinction matters more than most organizations realize. Traditional CPU-focused cloud platforms were built for stateless applications and batch processing. AI inference demands something different altogether: memory-bound optimization, intelligent caching, and microsecond latency at scale.

Crusoe recently put resources behind this thesis by acquiring Atero, a company focused specifically on memory-bound workload optimization. “That’s the area we see the market heading towards in 2026 and beyond,” Kyle explained. “Training had its heyday and people are really focusing on inference.”

How “Subscalers” Deliver Better Performance for AI Inference Than Hyperscalers

The conversation revealed a critical shift in how enterprises are thinking about cloud strategy. Companies are no longer defaulting to hyperscalers for GPU workloads. Instead, they’re diversifying their cloud mix to include “subscalers,” which are more specialized AI cloud providers that can deliver what hyperscalers struggle with: modern features, better hardware access, and truly hands-off infrastructure management.

“More and more companies are looking to diversify their cloud away from a hyperscaler into other subscalers,” Val said. The reason? “More modern features … a little bit more approachable and easier to manage.”

This isn’t about abandoning hyperscalers entirely. It’s about recognizing that different workloads have different infrastructure requirements,where some specialization is needed.

Why KV Cache Management Is Critical for Reducing AI Inference Costs

Among the most technical—and most critical—elements of the conversation centered on KV cache (Key-Value cache) management.

“The next scale of inference is not CPU-bound, it’s memory-bound,” Kyle emphasized. “It is unavoidable that the focus from the clouds and from model providers and from the frontier companies is all on optimizing for that solution.”

For organizations deploying AI at scale, this translates to real dollar savings. Efficient KV cache management can dramatically reduce the volume of input tokens processed, lowering costs while improving latency. It’s the difference between an AI application that’s economically viable and one that isn’t.

How Self-Healing Infrastructure Enables Multi-Cloud AI Deployments at Scale

When the conversation turned to user experience, Kyle highlighted an often-overlooked reality: “GPUs fail. They should be expected to fail. Hardware fails in general, so having to deal with XID errors coming off and remediating that yourself, shut off VMs, migrate workloads, deal with capacity management, it’s all very sophisticated.”

The question isn’t whether hardware will fail, but whether your infrastructure can handle failures without human intervention. Crusoe’s investment in self-healing infrastructure represents a bet that “the inference battle will be won at the user experience layer and the performance layer.”

Here’s what this looks like in practice:

Active health checking that identifies problematic nodes before they impact workloads;
Dynamic capacity management that pulls from reserved pools to replace failed hardware;
Workload-aware migration that moves jobs seamlessly when issues arise.

“You can not worry about your cluster, and we’ll do both passive and active health checking,” Kyle explained. This “it just works” mentality enables AI-native companies and enterprises to go multi-cloud “very easily and just access hardware.”

Why Performance Metrics Matter More Than GPU Types for AI Infrastructure

“People become less concerned with which GPU is servicing their request; they’re just all performance-based,” Kyle said.

This represents the next step in the evolution of the AI infrastructure market. Early adopters obsessed over GPU models and configurations. Today’s sophisticated users care about outcomes: latency, throughput, cost per token, and reliability.

As Val noted in closing, “We take resiliency for granted, but we really notice the latency.” For infrastructure providers, this means the competitive battleground has shifted from GPU availability to holistic performance optimization.

What This Means for Your AI Strategy

The insights from this conversation point to three actionable takeaways for organizations scaling AI workloads:

Audit your inference economics. If you haven’t mapped your cost per token and identified optimization opportunities in KV cache management, you’re likely overpaying significantly.
Evaluate subscalers for GPU workloads. The hyperscaler-first strategy made sense when training dominated. For inference, specialized providers may offer better economics and performance.
Demand self-healing capabilities. Manual GPU cluster management doesn’t scale. Infrastructure that handles failures automatically is foundational for production AI systems.

The shift from training to inference isn’t simply a technical evolution; it’s an economic imperative. Organizations that recognize this and adapt their infrastructure strategy accordingly will have a decisive advantage in the AI-native era.

The Bottom Line

The shift from training to inference is already underway — and the organizations that treat it as a simple operational change will find themselves overpaying and underperforming. The winners in the AI-native era will be those who adapt their infrastructure strategy now: auditing inference costs, exploring specialized cloud providers, and demanding self-healing systems that can operate at scale without constant human intervention.

The infrastructure battle for AI has entered a new phase. The question is whether your stack is ready for it.

If you want to learn how WEKA can help you break through inference barriers, click here to learn about our solutions and request an infrastructure cost analysis.

PRODUCTS

DEPLOYMENT OPTIONS

USE CASES

INDUSTRIES

ARCHITECTURES

Learn AI

RESOURCES

TECHNICAL RESOURCES

ABOUT US

JOIN US

Why the Infrastructure That Built Your AI Won’t Run It

Why GPU Workloads Require Different Cloud Infrastructure Than CPU Applications

How “Subscalers” Deliver Better Performance for AI Inference Than Hyperscalers

Why KV Cache Management Is Critical for Reducing AI Inference Costs

How Self-Healing Infrastructure Enables Multi-Cloud AI Deployments at Scale

Why Performance Metrics Matter More Than GPU Types for AI Infrastructure

What This Means for Your AI Strategy

The Bottom Line

Popular Blogs From WEKA

Why the Infrastructure That Built Your AI Won’t Run It

Why GPU Workloads Require Different Cloud Infrastructure Than CPU Applications

How “Subscalers” Deliver Better Performance for AI Inference Than Hyperscalers

Why KV Cache Management Is Critical for Reducing AI Inference Costs

How Self-Healing Infrastructure Enables Multi-Cloud AI Deployments at Scale

Why Performance Metrics Matter More Than GPU Types for AI Infrastructure

What This Means for Your AI Strategy

The Bottom Line

Share On Social:

Popular Blogs From WEKA

Related Assets

See NeuralMesh in Action

The NAND Flash Shortage Survival Guide

The AI Factory Blueprint: Designing for Scalable, Efficient Inference