VIDEO

Inside the AI Capacity Crunch: Solving Latency, Memory Limits, and Multi-Agent Scaling | VentureBeat AI Impact Tour NYC

00:00

Intro & Setup

Matt: Welcome, Val. We did one of these in San Francisco; great to be here in New York City. Thanks for making this event possible. You’re too modest to list your own contributions across Kubernetes, protocols, storage—even Windows—but I’ll just say: there are a lot. I’m curious about your reactions to the earlier comments. Anything unexpected or provocative? You see a lot of customers investing in GPUs—what stood out?

00:45

When will real market rates hit AI inference?

Val: Not provoked—more inspired. Ben and James are representative of the market: foundational model builders enabling so much, and ultra-modern cloud-native applicators. The last question hit me hardest. I’m a technologist, but lately I’m focused on unit economics and gross margins. Think Uber: after Dara came in, surge pricing and real market rates appeared.

We don’t have “real market rates” in AI yet—especially in inference. Prices are subsidized to fuel innovation. But with trillions in CapEx and finite energy OpEx, real market rates will show up—possibly next year, certainly by 2027. That will force an even keener focus on efficiency.

02:53

More tokens, more value—at what latency and cost?

Matt: Let’s walk through token growth. There’s no bubble in demand for tokens; the question is the economics to generate them and the real use cases. You’ve said enterprises are moving from “vibe-coding” to agent-based application building. How does token usage scale from here?

Val: Rule one: in this industry, more is more. More tokens → exponentially more business value. The challenge is sustainability. There’s a classic triad: accuracy (non-negotiable), latency, and cost.

  • Accuracy requires lots of high-quality tokens and, with guardrails/security, even more.
  • You can trade between latency and cost. For some consumer cases, you might tolerate latency to save cost.
  • But with agents, nobody wants hours-long responses. Agents typically operate in swarms—hundreds or thousands of multi-turn prompt/response cycles. Compound delay across those turns becomes untenable, so ultra-low latency matters—and today that’s often subsidized. Those costs must come down.
04:34

What is an agent swarm? From orchestrator to committee

Matt: “Agent swarm” isn’t universally understood. Can you unpack it?

Val: A single ChatGPT prompt/response isn’t an agent. The best current example of real agents is in software generation. Take Anthropic’s recent demo with Claude 3.5: “Write the entire Slack app.” The agent spent ~30 hours and tens of thousands of dollars, decomposing the goal into a swarm of subtasks.

  • It starts with an orchestrator agent (smartest model) deciding architecture, cloud/on-prem, performance, privacy, and security.
  • Sub-agents execute in parallel—hundreds of sessions from an inference standpoint.
  • Evaluator models check each task against spec. You’ll often get ~10 final candidate builds; an agent committee votes—8 fail, one is slow, one is fast/efficient/correct. That one “wins.”
06:52

The new season: reinforcement learning as the critical path

Matt: Let’s hit reinforcement learning (RL). Enterprises want ways to inject domain knowledge; RL is the new paradigm. How does that affect token economics?

Val: I think in “seasons.” Early this year was the reasoning season—reasoning models became useful and, with open source, affordable. By late spring, the agent season arrived—context windows got large enough; GPUs were available enough that coding agents now generate as much as ~90% of software in some workflows (that wasn’t true 12 months ago).

Right now we’re in the reinforcement learning season. RL blends training + inference into a unified, loop-heavy workflow—thousands of loops. Leading labs (OpenAI, Anthropic, Google) see RL as critical path—the latest scaling law toward AGI. To do RL well, you must apply best practices from both training and inference at speed and scale.

09:05

Pragmatic takeaways: hybrid deployment and transaction-level economics

Matt: One-minute takeaway?

Val: There’s no single answer. All-on-prem can be right for model builders; being cloud-native can be right for agility and innovation. Many will do both.

We’re in a boom—some subsidies are masking real costs—but rising token prices won’t stop usage; they’ll force fine-grained usage. Measure unit economics by transaction more than by raw token price. That’s the lens that matters.

10:38

Regulation: enable innovation now, refine rules later

Audience (attendee from, World Economic Forum): The U.S. is moving fast; Europe is more regulated. How do you see ceilings on regulation? Could elections or EU changes slow AI’s growth?

Val: Recent coverage suggests OpenAI traffic has tailed off in Europe; speculation is that heavier regulation limits access to the newest features (software dev, life sciences, personal tools), dampening usage. Personal view: this industry is embryonic; early, heavy regulation is premature. The U.S. (and China, in its way) is doing the right thing by spurring innovation. We’re not at a “Skynet” moment—there’s still an off-switch.

That said, AI progresses exponentially. We’ll need better regulation later—but the how is hard. Laws that try to regulate FLOPs, token counts, or parameter sizes aren’t effective yet.

13:21

Measuring value beyond tokens: toward pay-for-outcome

Audience (attendee from UBS Private Wealth): Not every model is transactional. I don’t care how many tokens an analyst used if outcomes are good. How should we measure marginal utility per token—or something more holistic?

Val: It’s case-by-case. What granularity do you measure? Token level? Output level? I like models that charge pay-for-outcome vs. pay-for-use. Risky for startups today, but essential, because variable token costs at the highest quality aren’t sustainable. Business outcomes beat supply-chain unit costs.

We’re in year 1–2 of what could be a 15-year maturity curve (cf. cloud). We’ll eventually harmonize metrics—but not yet.

15:17

WEKA’s angle: delivering memory performance at storage economics

Audience: Give us the 90–120 second elevator pitch. What’s WEKA’s secret sauce?

Val: WEKA is a storage and memory company for AI/ML. Long before the November 2022 “ChatGPT moment,” WEKA served top self-driving and drug-discovery orgs (e.g., MSK) as a high-performance storage system. I joined after seeing we could apply storage economics to deliver memory-class performance—a radical shift for inference and RL.

AI lives at the intersection of abundance (what AI can do) and scarcity (GPUs, power, especially HBM on GPUs). That high-bandwidth memory is the most valuable “real estate” in the world on a per-unit basis—spot market prices jumped ~10% yesterday. If you can augment GPU memory at storage-level costs, you transform inference economics.

Matt (aside): The GPU has very little HBM; WEKA figured out how to serve the KV cache the model needs—fast. It’s pretty cool technology.

17:50

Will models collapse on synthetic data?

Audience: If models train on synthetic outputs (less human data), do they collapse?

Val: That was my concern months ago. But free tiers are turning us into the product (in the classic internet way): interactions provide real RL data—human feedback or verifiable rewards—feeding the next training cycles. Meanwhile, synthetic datasets are being curated to avoid prior pitfalls. We’re in a phase where our usage—if we’re not paying—helps generate the training data.

19:17

Inference overtakes training: solving the prefill bottleneck

Audience: Two questions. (1) How will cost-per-token and throughput reshape data-platform design? (2) How will capital allocation shift between compute and data infrastructure over 12–24 months?

Val: On (2): CapEx/OpEx poured into training the last few years. In 2025 we’re seeing a turn toward inference. Many analysts (and I agree) think we’re moving from 4/5 dollars on training to 4/5 on inference.

Implication: NVIDIA pre-announced Rubin CPS X, the first dedicated inference processor—and even more specifically, focused on the prefill phase (the inner loop). Prefill is the fundamental bottleneck in transformer inference. If you reduce required prefill (WEKA helps by augmenting memory), you unlock far more tokens at sustainable power and tokenomics.

Matt: Prefill = the context the model must load after your prompt before generating. Big challenge.

Val: Exactly—the inner loop of AI. Solve it, and you open a whole new era.

21:41

Close

Matt: We’re out of time, but we could keep going. Big round of applause for Val.

Val: Thank you!

Like This Discussion? There’s More!

Watch Matt and Val discuss AI inference, agent swarms, and token economics during another VentureBeat AI Impact Tour stop in San Francisco.