WEKA Accelerates AI Inference with NVIDIA Dynamo and NVIDIA NIXL

Betsy Chernoff, Callan Fox. July 22, 2025

At WEKA, we believe that fast, frictionless data movement is critical to scaling AI workloads—especially in GPU-intensive environments where every microsecond matters. One way we’ve addressed this crucial challenge is with WEKA’s Augmented Memory Grid™, a revolutionary technology that extends GPU memory directly into WEKA’s token warehouse™. This approach enables near-memory-speed transfers of key-value (KV) cache, resulting in dramatically faster time to first token (TTFT)—significantly enhancing the user experience—as well as notably increased overall token throughput.

Continuing our mission to innovate high-performance data pathways between storage and compute, we’ve deepened our strategic partnership with NVIDIA. Through close collaboration with NVIDIA, we aim to bring meaningful advancements in AI inference to life and to open them up to a broader ecosystem. In doing so, we’ve innovated heavily in an effort to support the open source community, including our recent: LM Cache with NVIDIA Magnum IO GPUDirect Storage (GDS) support and NVIDIA TensorRT-LLM with GDS support integrations. Now, we’ve taken this a step further and open‑sourced a dedicated plugin for the NVIDIA Inference Transfer Library (NIXL), a part of NVIDIA Dynamo. Early testing of NIXL is already demonstrating promising performance.

NVIDIA Dynamo and NVIDIA NIXL

What Is NVIDIA Dynamo? Dynamo is a high-throughput, low-latency open-source inference serving framework for deploying generative AI and reasoning models in large-scale distributed environments. It supports open-source inference libraries and frameworks, including PyTorch, SGLang, NVIDIA TensorRT-LLM, and vLLM. It includes powerful features like disaggregated prefill and decode and KV cache offloading—key requirements for handling large-context, multi-GPU LLM inference efficiently.

What Is NVIDIA NIXL? NIXL (NVIDIA Inference Transfer Library) is a high-throughput, low-latency, communication library designed specifically for inference workloads. NIXL extends NVIDIA Dynamo’s capabilities by providing a unified API that simplifies and accelerates data movement across GPU memory and storage. These capabilities enable WEKA to stream KV-cache blocks directly from the Augmented Memory Grid at near-memory speeds, drastically reducing Time to First Token (TTFT).

NIXL optimizes data transfers among GPU HBM, CPU DRAM, local SSDs, and networked storage (block, file, and object). It supports communication over NVIDIA NVLink, NVLink Switch, NVIDIA Quantum InfiniBand, and NVIDIA Spectrum-X Ethernet. NIXL dynamically selects the optimal backend—such as UCX, GDS, S3, or custom plugins (like the custom plugin WEKA designed and open-sourced)—through Dynamo’s policy engine.

WEKA as an Accelerant

The advanced features in NVIDIA Dynamo—disaggregated prefill and decode, as well as KV‑cache offloading—deliver their full value only when the underlying data path can keep pace. This is where WEKA’s Augmented Memory Grid acts as a catalytic accelerant.

Token‑warehouse throughput at near-DRAM latencies: Augmented Memory Grid delivers KV cache at up to 252 GB/s per node, so Dynamo’s prefill kernels never stall waiting for data.
Zero‑copy RDMA pipelines: Our newly open‑sourced GPUDirect Storage layer uses RDMA to move data end‑to‑end without CPU involvement, eliminating latency jitter, the need for redundant copies, and freeing host cycles for scheduling.
Elastic scale‑out capacity: NeuralMesh™ by WEKA® leverages a fully-software defined filesystem that aggregates NVMe across hundreds of nodes, giving Dynamo virtually limitless KV‑cache space without code changes.

Together, WEKA and NVIDIA are building optimized plugins that connect WEKA NeuralMesh’s lightning-fast token warehouse to NIXL—bringing persistent data closer to the compute path without introducing latency.

WEKA’s Lab Testing with NIXL

We recently ran performance tests in the WEKApod (8 host, 72 drives) using DGX systems with 8x H100 GPUs. The WEKA token warehouse was made available over the compute network to simulate production-like conditions.

We tested: NIXL, using WEKA’s open-source plugin

Threads/ GPUs	I/O Pattern	Read (NIXL)	Write (NIXL)
128 threads 8 GPUs	2 GB @ 2048 × 1 MB I/Os	199.49 GB/s	104.3 GB/s
128 threads 8 GPUs	4 GB @ 512 × 8 MB I/Os	260.29 GB/s	117.28 GB/s
128 threads 8 GPUs	4GB @ 64 x 64 MB I/Os	179.86 GB/s	81.88 GB/s
384 threads 8 GPUs	2 GB @ 2048 × 1 MB I/Os	269.72 GB/s	123.44 GB/s

*Typically larger blocks of I/O means higher data transfer throughput, however WEKA splits larger files into smaller blocks in our system to make it more efficient.

What’s Next

Inference systems are evolving quickly, and WEKA is proud to be part of building the open-source infrastructure needed to keep up. Our contributions to the Dynamo community and our continued work with NVIDIA reflect our shared commitment to creating fast, flexible, and open AI platforms for the future.

The next round of testing will focus on evaluating NVIDIA Dynamo’s KV Block Manager, showcasing end-to-end performance when data is dynamically offloaded and recalled from WEKA’s token warehouse during live inference.

Stay tuned for more updates as NVIDIA and WEKA continue to collaborate on NVIDIA Dynamo. In the meantime, check out our other blogs at weka.io/blog.

PRODUCTS

DEPLOYMENT OPTIONS

USE CASES

INDUSTRIES

ARCHITECTURES

Learn AI

RESOURCES

TECHNICAL RESOURCES

ABOUT US

JOIN US

WEKA Accelerates AI Inference with NVIDIA Dynamo and NVIDIA NIXL

NVIDIA Dynamo and NVIDIA NIXL

WEKA as an Accelerant

WEKA’s Lab Testing with NIXL

What’s Next

Popular Blogs From Betsy Chernoff, Callan Fox

WEKA Accelerates AI Inference with NVIDIA Dynamo and NVIDIA NIXL

NVIDIA Dynamo and NVIDIA NIXL

WEKA as an Accelerant

WEKA’s Lab Testing with NIXL

What’s Next

Share On Social:

Popular Blogs From Betsy Chernoff, Callan Fox

Related Assets

2024 AI Trends: Scaling Innovation, Generative AI, and Infrastructure Challenges

The Evolution of Cloud Data Infrastructure

Supercharging GPU Cloud for AI Workloads