VIDEO

Why Inference Will Drive AI Infrastructure in 2026 with Crusoe

Kyle Sosnowski from Crusoe sits down with WEKA’s Val Bercovici to discuss the critical shift from training to inference workloads, memory-bound optimization strategies, and how self-healing infrastructure will reshape AI cloud services in 2026.

Speakers:

  • Kyle Sosnowski - VP, Software Engineering - Crusoe
  • Val Bercovici - Chief AI Officer, WEKA

Below is a transcript of the conversation, which has been lightly edited for clarity.

Transcript

00:00

What “Subscalers” — a.k.a. Neoclouds — Offer That Hyperscalers Can't

Val Bercovici: First question right off the bat, for fun. Do you like the term neocloud or do you not like it? What do you prefer?

Kyle Sosnowski: So I’m actually one that doesn’t like the term neocloud. I think it’s too Pokemon for me. Another one I got recently on a talk with someone was subscaler. I hadn’t heard that one before. I think it has a little bit more character to it, but neocloud always made me feel a little bit goofy when I was saying it.

Val: You’re a neopet, is what you’re thinking of.

Kyle: Yeah, neopet, kind of Tamagotchi-ish, really, if you will.

Val: Exactly. No, I’m glad I asked because I always think about that as well. So for a subcloud, before we get to a day in a life, what typically do you do? And what do your typical customers look like? And what are they asking you for?

Kyle: I think it’s a wide range, honestly. Crusoe is focusing on digitally native businesses, AI-native companies, and enterprise companies alike. We’re making a big bet on AI-native and enterprise. I think more and more companies are looking to diversify their cloud away from a hyperscaler into other subscalers, into better hardware access and more modern features and stuff that is a little bit more approachable and easier to manage. That’s where we excel. So for us, it’s companies that are looking for hands-off management of their infrastructure and stuff that just works.

01:24

Why 2025 (and 2026) Will See a Decisive Shift from AI Training to Inference

Val: Having experience in cloud, in CPU-based workloads, it’s so different in the GPU world and AI-native. What are some of the biggest differences that are obvious to you but might not be obvious to other people? Because a cloud is a cloud? Or not really in an AI world?

Kyle: I think it’s not really in an AI world. It’s a very different landscape when you’re focusing on GPU workloads vs. broader application infrastructure. Recently Crusoe purchased a company called Atero, which is focused on optimizing for those memory-bound workloads. And that’s the area that, you know, we see the market heading toward in 2026 and beyond. Training had its heyday and people are focusing on inference, and we expect that to really expand. Things like intelligent KV cache management, sharing GPUs across the entirety of the cluster, things that really matter to be able to service inference at a very high scale.

Val: Love that. You’re one of the first to acknowledge that we’re past this inertia right now of mostly training. For me, 2025 has really been the year of the shift to inference, the beginning of the big shift to inference. Agents have been a catalyst for that, perhaps even reinforcement learning now. Any other catalysts, or are those the main catalysts you’re seeing for this shift? Because it’s a big shift.

Kyle: Yeah, I think these low-code, vibe codes, warm-code platforms are making it a lot easier for people to build applications on top of them. Because of that, the inference demand is going off the charts. There are diminishing returns from tuning the frontier models. You can get some pretty good stuff off the shelf with open-source models today.

So whether you’re tuning for a specific use case, that’s not the barrier for high-quality application development as much as it is for, you know, how can we optimize agentic flows to be really low on the time to first token, to be highly optimized for cost, right model, right solution. And you see this a lot, especially in GPT-5, which I think its hallmark release wasn’t so much just the quality of the model, but the routing of it behind. How quickly can you serve it for a certain answer vs. how much longer do you need for a more complicated one?

03:33

How Memory-Bound Optimization and KV Cache Management Win the Inference Battle

Val: That’s fascinating. I was talking to Wei Zhou from SemiAnalysis just about an hour ago, and one of the cool topics we were talking about is the fact that we’re seeing just an amazing richness, almost an embarrassment of richness in open model quality right now. But we tend to see not as much prompt caching from the open model providers as compared to the commercial providers. So that price per token comparison and, you know, the volume of input tokens is really material compared to output tokens for coding agents. What are you seeing now? Are you seeing more prompt caching coming for the open models, better, more aggressive KV cache management, or something else?

Kyle: I would expect so. As I said before, the next scale of inference is not CPU-bound, it’s memory-bound. So it is unavoidable that the focus from the clouds and from model providers and from the frontier companies is all on optimizing for that solution. So, at Crusoe particularly, our focus was making sure we get ahead of that curve and get the Atero folks to join over the past couple of months. And it was an immensely talented group of people there who have a really sophisticated solution that is a high-powered injection into our inference. We’re very excited to get that integrated. I expect across-the-board the investment to be in optimization for memory-bound hardware.

04:52

Self-Healing GPU Infrastructure and the Future of Cloud User Experience

Val: It’s funny, we were just talking to LinkedIn and they were predicting the same thing going forward. And they’re very mature, obviously, as an AI/ML user. They developed a whole suite of models fit for purpose. So I think it’s cool that Crusoe is probably the only subcloud I can think of that’s actually put your money where your mouth is with regard to focusing on KV cache management and prompt-caching memory optimization. Is that a strategic direction for the company? Does that really mean leading-edge in terms of the material infrastructure for this market, or are you looking to just get up the stack into a different developer experience?

Kyle: I think it’s moving quickly and with intention. The hypothesis is that the inference battle is won at the user experience layer and the performance layer. Those two, hand-in-hand, really matter. You see a lot of investment into the types of software that allows for simple turnkey management of the infrastructure. A lot of the teams that go multicloud don’t have the time or investment for hands-on management of their cluster in AWS and GCP and in Crusoe.

So turnkey management for that is important. Very simple, easy-to-understand observability layers. That is very important. In addition, performance. There are a lot of inference players out there and they’re doing some really interesting things. So to stay ahead of the curve, making big bets — and intentional and well researched bets — but making big bets is important for us.

Val: That’s really cool. So yeah, UX experience was something that was really important. If we decompose that, and I don’t want to lead the question too much, but what’s really important? I’ve got my opinions on what’s really important in UX experience for this.

Kyle: For us, one of the areas we’re spending a lot of time is self-healing infrastructure. GPUs fail. They should be expected to fail. Hardware fails in general, so having to deal with XID errors coming off and remediating that yourself, shut off VMs, migrate workloads, deal with capacity management, it’s all very sophisticated and more or less time consuming. Crusoe is investing in infrastructure that allows us to do that for you so you can not worry about your cluster; we’ll do both passive and active health checking. The ability to say there is a problem with this node, let’s take it out of rotation, let’s pull something from your capacity pool and bring it back in, and then start to do deeper integrations with your workloads and your jobs.

There’s a lot of things there that make it an “it just works” mentality, and I think that is where I see the user experience going. It enables AI-native businesses, it enables enterprise companies to go multicloud very easily and just access hardware so you get farther and farther away from the GPUs themselves. People become less concerned with which GPU is servicing their request; they’re just all performance based, and that’s where we can deliver.
Val: Makes a ton of sense.

07:36

How Fast is the AI Application Layer Moving?

Val: We’re going to wrap it up here. What do you want to wrap up on?

Kyle: For me it’s always exciting being at these events and seeing all the stuff people are building, and it’s crazy to me how fast the application layer is moving. When you think about it, the past 10 months alone is night and day. We’re heads down trying to build the most reliable and fault-tolerant infrastructure we can, and just seeing the diversity that’s being built to the application layer is always exciting to me, so for me, it’s just getting the exposure to these companies out here and seeing all the cool stuff people are doing and trying to stay up.

Val: And I think one of the leads you’re burying that I’m appreciating is just the latency that you’re able to offer, obviously, with these memory KV cache optimizations. That’s a big part of the user experience. We take resiliency for granted, but we really notice the latency.

Kyle: Yeah, you don’t notice it until it starts breaking. We want you to not think about it.

Val: Then it’s a fire drill.

Like This Discussion? There’s More!

Val Bercovici also spoke with Elisa Chen from Meta at the AI Infra Summit 2025 about how to balance rapid AI innovation with hardware procurement cycles, sharing strategies for GPU utilization, workload optimization, and regional capacity planning.