We’re constantly innovating to empower AI workloads, and at the core of that today is optimizing inference with WEKA’s Augmented Memory Grid™. Since announcing Augmented Memory Grid, it’s exciting to see the interest and enthusiasm from the community and our customers, as well as the exceptional performance gains across a variety of workloads.

Today, we’re excited to take an important step forward by open-sourcing the GPUDirect Storage (GDS) integration layer that connects inference servers to WEKA’s Token Warehouse. This integration is one of the foundational components used by Augmented Memory Grid to accelerate inference, and we’re making it available within two leading high-performance inference frameworks.

Not only does this integration allow us to support community-driven innovation, it also gives you an opportunity to benchmark the GDS integration directly and measure critical improvements in token throughput, time-to-first-token (TTFT), and effective GPU utilization. Powered by Augmented Memory Grid, WEKA’s GDS-enabled token warehouse computes each token just once, persistently stores it, and retrieves it at near-memory speeds—eliminating the latency and cost of redundant token recomputation.

Open Source Integration for Leading Inference Frameworks

Building on vLLM, you now have a way to test how GDS acceleration with WEKA’s token warehouse that performs by extending KV Cache TTL, and decreasing the impact of the prefill phase.

vLLM with LMCache: WEKA also integrates GDS for LMCache, providing seamless integration for those using the vLLM architecture. To set up WEKA’s GDS integration with WEKA’s Augmented Memory Grid for LMCache, check out this gist.

Performance Benchmarks: Fastest Token Warehouse for Inference

Maximizing the caching of the prefill phase to accelerate inference is an extremely bursty and data-intensive workload. When extending the memory hierarchy to include an NVMe layer, the data persistence layer must handle massive, short-lived spikes in data throughput, especially during large-scale deployments.

That’s where WEKA’s token warehouse excels, we give industry leading performance at a wide range of access patterns, block sizes and concurrency levels. Below are some peak numbers as a comparison point to our performance:

  • Ultra-fast Token Warehouse to GPU transfers: Achieves single client throughput of 308 GB/s.
  • Ultra-fast GPU to Token Warehouse transfers: Achieves single client throughput of 163 GB/s.

We captured these numbers using the GDSIO benchmark with an 8-way NVIDIA H100 GPU system connected to a 8-host WEKApod based on R6615 with 72 NVMe PCI Gen 5 drives connected over a NDR fabric. If you’d like to test this in your own environment to compare, you can use the LMCache gist.

Performance Benchmarks: Inference System Efficiencies

It’s one thing to see raw throughput numbers of WEKA’s token warehouse, it’s another to see the impact of that performance on your inference environment. Below are results from our lab for Llama-3.3-70B at FP16 at a wide variety of context lengths within vLLM with LMCache:

Context Prefill rate (vLLM) vLLM TTFT
50 0.084/s 0.046/s
1000 0.1396/s 0.0602/s
2000 0.2012/s 0.0557/s
8000 0.4212/s 0.09656/s
16000 0.8785/s 0.2588/s
24000 1.35269/s 0.2444/s
32000 1.7986/s 0.1944/s
64000 4.0928/s 0.497/s
96000 6.97/s 0.3995/s
128000 10.5378/s 0.51919/s

Get Involved & Join the Open Source Community

WEKA believes that open source fosters an ecosystem of innovation, allowing everyone—from startups to hyperscalers—to enhance AI inference infrastructures effectively. We invite you to integrate, evaluate, and contribute to WEKA’s GDS integration today. Try them out, benchmark them, and tell us what you think—we’ve even created our own channel under the vLLM Slack for your feedback, #WEKA-GDS-Integration.

Together, we can redefine what’s possible in AI inference, delivering unmatched performance, scalability, and efficiency. This is just the beginning. Stay tuned as we roll out more on WEKA Augmented Memory Grid — and continue delivering the future of scalable, high-performance AI storage.

Check Out More Insights, here