Open Sourcing GDS Integration from Augmented Memory Grid—See Results for Yourself

Betsy Chernoff. May 30, 2025

We’re constantly innovating to empower AI workloads, and at the core of that today is optimizing inference with WEKA’s Augmented Memory Grid™. Since announcing Augmented Memory Grid, it’s exciting to see the interest and enthusiasm from the community and our customers, as well as the exceptional performance gains across a variety of workloads.

Today, we’re excited to take an important step forward by open-sourcing the GPUDirect Storage (GDS) integration layer that connects inference servers to WEKA’s Token Warehouse ™. This integration is one of the foundational components used by Augmented Memory Grid to accelerate inference, and we’re making it available within two leading high-performance inference frameworks.

Not only does this integration allow us to support community-driven innovation, it also gives you an opportunity to benchmark the GDS integration directly and measure critical improvements in token throughput, time-to-first-token (TTFT), and effective GPU utilization. Powered by Augmented Memory Grid, WEKA’s GDS-enabled token warehouse computes each token just once, persistently stores it, and retrieves it at near-memory speeds—eliminating the latency and cost of redundant token recomputation.

Open Source Integration for Leading Inference Frameworks

Building on vLLM, you now have a way to test how GDS acceleration with WEKA’s token warehouse that performs by extending KV Cache TTL, and decreasing the impact of the prefill phase.

vLLM with LMCache: WEKA also integrates GDS for LMCache, providing seamless integration for those using the vLLM architecture. To set up WEKA’s GDS integration with WEKA’s Augmented Memory Grid for LMCache, check out this gist.

Performance Benchmarks: Fastest Token Warehouse for Inference

Maximizing the caching of the prefill phase to accelerate inference is an extremely bursty and data-intensive workload. When extending the memory hierarchy to include an NVMe layer, the data persistence layer must handle massive, short-lived spikes in data throughput, especially during large-scale deployments.

That’s where WEKA’s token warehouse excels, we give industry leading performance at a wide range of access patterns, block sizes and concurrency levels. Below are some peak numbers as a comparison point to our performance:

Ultra-fast Token Warehouse to GPU transfers: Achieves single client throughput of 308 GB/s.
Ultra-fast GPU to Token Warehouse transfers: Achieves single client throughput of 163 GB/s.

We captured these numbers using the GDSIO benchmark with an 8-way NVIDIA H100 GPU system connected to a 8-host WEKApod based on R6615 with 72 NVMe PCI Gen 5 drives connected over a NDR fabric. If you’d like to test this in your own environment to compare, you can use the LMCache gist.

Performance Benchmarks: Inference System Efficiencies

It’s one thing to see raw throughput numbers of WEKA’s token warehouse, it’s another to see the impact of that performance on your inference environment. Below are results from our lab for Llama-3.3-70B at FP16 at a wide variety of context lengths within vLLM with LMCache:

Context	Prefill rate (vLLM)	vLLM TTFT
50	0.084s	0.046s
1000	0.1396s	0.0602s
2000	0.2012s	0.0557s
8000	0.4212s	0.09656s
16000	0.8785s	0.2588s
24000	1.35269s	0.2444s
32000	1.7986s	0.1944s
64000	4.0928s	0.497s
96000	6.97s	0.3995s
128000	10.5378s	0.51919s

Get Involved & Join the Open Source Community

WEKA believes that open source fosters an ecosystem of innovation, allowing everyone—from startups to hyperscalers—to enhance AI inference infrastructures effectively. We invite you to integrate, evaluate, and contribute to WEKA’s GDS integration today. Try them out, benchmark them, and tell us what you think—we’ve even created our own channel under the vLLM Slack for your feedback, #WEKA-GDS-Integration.

Together, we can redefine what’s possible in AI inference, delivering unmatched performance, scalability, and efficiency. This is just the beginning. Stay tuned as we roll out more on WEKA Augmented Memory Grid — and continue delivering the future of scalable, high-performance AI storage.

Check Out More Insights, here

PRODUCTS

DEPLOYMENT OPTIONS

USE CASES

INDUSTRIES

ARCHITECTURES

Learn AI

RESOURCES

TECHNICAL RESOURCES

ABOUT US

JOIN US

Open Sourcing GDS Integration from Augmented Memory Grid—See Results for Yourself

Open Source Integration for Leading Inference Frameworks

Performance Benchmarks: Fastest Token Warehouse for Inference

Performance Benchmarks: Inference System Efficiencies

Get Involved & Join the Open Source Community

Popular Blogs From Betsy Chernoff

Open Sourcing GDS Integration from Augmented Memory Grid—See Results for Yourself

Open Source Integration for Leading Inference Frameworks

Performance Benchmarks: Fastest Token Warehouse for Inference

Performance Benchmarks: Inference System Efficiencies

Get Involved & Join the Open Source Community

Share On Social:

Popular Blogs From Betsy Chernoff

Related Assets

2024 AI Trends: Scaling Innovation, Generative AI, and Infrastructure Challenges

Scaling Smart: Future-Proofing Your AI Infrastructure

The Gorilla Guide to Data Platforms for AI