VIDEO

The Inference Era: Building Scalable Data Infrastructure for AI with Nand Research

Learn how data infrastructure is evolving and why the future of enterprise AI relies on high-performance, scalable solutions in Nand Research feature.

Transcript

00:00

Introductions

Steve McDowell: I’m Steve McDowell, founder and chief analyst at NAND Research. I’m here today with Val Bercovici, Chief AI Officer at WEKA. If you don’t know WEKA, they’re one of the leading innovators in high-performance, massively scalable data infrastructure. They provide the data backbone for some of the largest media, entertainment, HPC, and AI clusters in the world. One thing to say about WEKA is—they know data.

Which leads us to today’s topic: how storage and data infrastructure must evolve to meet the demands of generative AI. This is critical as AI moves from a few foundation-model players into the enterprise, where it’s becoming business-critical. Thanks for being here, Val.

Val Bercovici: A real pleasure, Steve. I can’t wait to get into this topic.

Steve: Before we nerd out on storage, let’s talk about the economics of AI. At the end of the day, AI is about creating business value, and WEKA was early in thinking about how data infrastructure needs to evolve to deliver that value. You started talking about something called tokenomics. Explain it like I’m five — what is tokenomics?

01:03

Understanding Tokenomics and the Cost of AI

Val: One way to explain it would be to delve into the spreadsheet accounting aspects or the various incentives involved, but a better way is to look at recent headlines. Cursor — probably the most popular AI app today by revenue — reportedly has $500 million in annual recurring revenue after less than two years, faster growth than OpenAI itself.

As the killer app of AI, Cursor is now rate-limiting its top-tier users on “unlimited” token plans because its own internal cost for tokens from model providers like Anthropic, Claude, OpenAI, and Gemini has become too high at scale. So we’re seeing an end-user revolt as these agents throttle usage.

Anthropic and others are facing similar issues with their coding agents. Even with preferential pricing, their agent swarms — dozens of agents in parallel solving a task — are consuming more tokens than the providers can supply. The question becomes: how can people afford as many tokens as modern AI apps consume?

Steve: Who cares about tokens — who’s actually paying the price?

Val: There are three or four layers. At the base are the GPU holders — hyperscalers and a new category called AI or Neo Clouds that NVIDIA allocates GPUs to. They buy real estate and energy contracts, host GPUs, and rent them hourly. Their primary tenants are model providers like OpenAI and Anthropic, and open-weights hosts such as Together AI.

Pricing appears at several levels — dollars per input token, per output token, and now per cached input token. Developers pay those prices to build apps like Cursor or Manus AI and must make gross margins by turning unit costs into monthly plans ($20 or $200 tiers). Ultimately it’s end users deciding what they’ll pay. Some reports suggest domain-specific agents could cost $20,000 a month for Nobel-level results.

Steve: So tokens are the basic units that flow in and out of generative AI.

Val: Right. Roughly three-quarters of a word per token — varies with modality. It’s the common currency of AI. Each token gets converted into a multi-dimensional vector inside the model’s key-value cache (KV cache), which balloons from kilobytes to gigabytes in memory and puts pressure on GPUs that simply don’t have enough memory to serve them all.

06:25

From Tokenomics to Data Infrastructure: Measuring AI Infrastructure Economics

Steve: As a practitioner delivering AI services, what do tokens have to do with storage and data platforms?

Val: It’s somewhat existential. In AI inference, especially enterprise apps, you’re asking questions about documents, code bases, or content libraries using retrieval-augmented generation (RAG). That’s storage-intensive, but vector databases minimize storage impact by working well on commodity hardware. Most inference workloads don’t involve RAG — so how does storage impact memory? With the right configuration, storage can actually deliver memory value for AI inference.

Steve: Are there metrics for forecasting AI economics at enterprise scale?

Val: Developers see three classes of pricing becoming standard:

Price per million input tokens.
Price per million output tokens — up to 10× more expensive.
Cached input tokens — around a 75% discount.

These rates determine everything from SaaS pricing to rate-limit policies.

Steve: What’s “cached” versus “uncached”?

Val: It refers to how inference works. Autoregressive models predict the next token based on previous ones. The larger the context window, the more computation required. That’s why input token cost is a function of content size and query volume. Reasoning tokens in modern large reasoning models (LRMs) add further intermediate processing. These don’t map directly to processor metrics like flops or bandwidth — it’s a new way to think about infrastructure.

12:41

Why Traditional Storage Fails AI Workloads

Steve: Why can’t I just buy a traditional array from CDW? Does it fail for AI training and inference?

Val: Not technically fail — you can connect it — but it’s like trying to suck a basketball through a straw. You need storage that understands its environment. CPUs process serially; GPUs are massively parallel. A single GPU has ~17,000 cores; servers often have eight GPUs, so ~100,000 cores hungry for data. If you’re not feeding them at wire speed, you’re starving a $3 million asset. That’s doing AI infrastructure wrong.

15:01

Training vs. Inference Bottlenecks

Steve: Where does storage get involved — what are the bottlenecks for training and inference?

Val: For training, the cycle is driven by epochs and checkpoints. You must persist checkpoints quickly to avoid losing work if GPUs fail. Each checkpoint can represent terabytes of memory. Fast checkpointing means faster time-to-market for models.

For inference, the challenge is scale and efficiency. Load-balancing thousands of LLM sessions across GPUs is non-deterministic and probabilistic, requiring high-performance compute networks (terabytes per second) and KV cache transfers between GPUs. This stress would break traditional storage, so storage must evolve to operate on the compute network.

19:29

The Memory Wall and KV Cache Bottlenecks

Steve: Storage typically doesn’t get involved in caching AI data, but you’re saying with inference and KV cache it does. Explain.

Val: Inference has two phases: prefill and decode. Prefill takes your prompt, converts it to key-value vectors, and expands it from hundreds of KB to tens of GB in memory. Loading model weights already consumes most GPU memory before processing begins. Memory is the bottleneck for inference, and memory supply isn’t keeping pace with GPU progress.

The solution: use storage as extended memory by attaching it over the compute network (“east-west” ports at 400–800 Gb/s). If storage can saturate that bandwidth, it delivers DRAM-class performance at 10–20× lower cost and 1,000× the density — radically improving tokenomics.

23:27

Augmented Memory Grid and GPU Direct Storage

Steve: Does the storage system have to be aware that it’s an extension for KV cache?

Val: With WEKA, you don’t re-engineer storage — you configure it differently. Our Neural Mesh architecture is a mesh of containerized microservices. Configured over the compute network with RDMA and GPU Direct Storage (GDS), it appears as memory to the GPU software. That combination unlocks memory-class performance from NVMe devices at a fraction of the cost.

Steve: Every storage company claims GPU Direct support. How does WEKA approach this differently?

Val: Most “disaggregated” systems still use controller-based architectures that bottleneck on the motherboard. WEKA takes a true mesh approach. The network is faster than the motherboard, so we access storage over more PCI lanes and leverage that performance disaggregated across resources. It’s a many-to-many configuration with no inherent bottleneck.

We measure quantitatively — not marketing claims. To function for optimal inference, you need 16× more bandwidth than traditional systems and consistent microsecond-level latency. Only WEKA has published benchmarks showing random I/O at hundreds of microseconds versus milliseconds elsewhere.

30:08

The Future of Enterprise AI Infrastructure

Steve: What emerging infrastructure technologies will be game-changers for the next wave of AI?

Val: The network is now faster than the server in an AI factory. With NVIDIA’s GB300 Grace Blackwell systems and 800 Gb ports (1.6 TB bidirectional), next-gen infrastructure must fully utilize that bandwidth to avoid idle capital assets. Maximizing compute in phase one of inference and memory in phase two lowers OpEx and boosts gross margins — even against price leaders like DeepSeek or Kimi.

Steve: Enterprise AI workloads aren’t static — data is constantly flowing. Do you design storage for the full data lifecycle?

Val: Absolutely. NVIDIA calls it the data flywheel. During training, data arrives as billions of small files that must be labeled and transformed into epochs. Then you enter inference and fine-tuning, feeding new patterns back to training for efficiency.

We call this the “I/O blender.” You need a system that handles every I/O type — large, small, structured, unstructured — and optimizes globally across exabytes for minimum latency. That’s how you keep the flywheel profitable at massive scale.

35:00

Closing Thoughts: Think Differently About AI Infrastructure

Steve: My takeaway: AI forces us to think differently about storage and data infrastructure. It’s about the economics of AI and ROI for enterprise AI efforts.

Val: Exactly. To borrow a cliché — think differently. GPU supercomputing is the most parallel computing we’ve ever done, and it requires fundamentally different infrastructure up and down the stack — from silicon to software. Stay on the leading edge so you stay cash-flow positive and avoid rate limits and throttling. It’s a delicate dance — but a fun one with the right ecosystem.

Steve: Perfect place to end. For more information, visit weka.io for technical blogs and resources, including a paper NAND Research wrote with WEKA on data bottlenecks and enterprise AI. Thanks for joining us, Val.

Val: Thank you, Steve — looking forward to the next one.

Explore more from Nand Research

Steve recently wrote a Research Report called “Storage Impact on the AI Lifecycle”. This paper examines storage requirements across data ingestion, training, inference, and ongoing lifecycle management — with emphasis on inference economics and the “memory wall” now limiting production deployments. Understanding these requirements determines whether your AI infrastructure delivers competitive advantage or becomes a bottleneck.

Read the Report

PRODUCTS

DEPLOYMENT OPTIONS

USE CASES

INDUSTRIES

ARCHITECTURES

Learn AI

RESOURCES

TECHNICAL RESOURCES

ABOUT US

JOIN US