WEKA Sets A New Bar With 75X Faster Time to First Token (TTFT)

At WEKA, we’re committed to reimagining what’s possible for AI infrastructure. Today, we’re excited to share a major leap forward in large language model (LLM) inference performance. Recently, we shared our latest open source advancements with LM Cache and our NVIDIA Magnum IO GPUDirect Storage (GDS) integration for Augmented Memory Grid. In addition to LM Cache, we are happy to share that our open source integration now includes support for NVIDIA Tensor-RT LLM with GDS.
Breaking Through the Memory Barrier
As AI models incorporate more features—including coding agents (and overall, agentic AI), model context protocols (MCP), and reasoning—context windows are stretching into the hundreds of thousands of tokens. Memory and inference latency have become critical challenges in this landscape and organizations looking to deploy AI at scale are hitting a memory wall to try and navigate these use cases. Traditional approaches quickly hit hardware and efficiency limits, making large-context, high-throughput inference both expensive and complex.
WEKA’s Augmented Memory Grid changes the equation. By seamlessly extending GPU KV cache memory into a distributed, NVMe-backed storage layer, Augmented Memory Grid unlocks the ability to serve massive LLM workloads without the bottlenecks that have limited growth and innovation.
Next-Generation Performance—WEKA’s Latest Contribution for the Open Source Community
Augmented Memory Grid delivered breakthrough results for our latest benchmarks, using the Llama-3.1-70B model running on NVIDIA H100 systems. Augmented Memory Grid’s integration for NVIDIA TensorRT-LLM with GDS support, we observed dramatic improvements in both speed and scalability—delivering up to 75X faster prefill times for long-context prompts, and consistently lower latency across a wide range of token sizes and quantization strategies.
To see our open source contribution, check out our GPUDirect Storage (GDS) for KV Cache.
For the complete setup guide, here is the WEKA Augmented Memory Grid Setup.
For an FP16 test, we saw these results:
Tokens | Prefill Rate TensorRT-LLM (milliseconds) | WEKA GDS Prefill Rate (milliseconds) | % difference |
---|---|---|---|
50 | 27.925 | 25.328 | 10.25 |
1000 | 165.260 | 25.538 | 547.12 |
2000 | 304.856 | 29.350 | 938.69 |
8000 | 1199.311 | 38.787 | 2992.03 |
16000 | 2412.058 | 54.519 | 4324.28 |
24000 | 3648.305 | 61.598 | 5822.81 |
32000 | 4934.670 | 88.661 | 5465.75 |
64000 | 10369.098 | 157.749 | 6473.17 |
96000 | 16312.039 | 214.237 | 7514.03 |
128000 | 22998.583 | 301.481 | 7528.53 |
We believe in accelerating progress across the entire AI ecosystem. That’s why we’ve contributed our GDS integration for key-value (KV) cache management to the open source community, giving developers and researchers direct access to these same high-performance capabilities through popular optimization frameworks: LM Cache and now, NVIDIA TensorRT-LLM.
What Does This Mean for You?
- Dramatic Reductions in Latency: Achieve near-instant response times even with large, complex prompts—enabling new user experiences and business applications.
- Scalable, Efficient Infrastructure: Eliminate the need to over-provision GPUs, lower costs, and deploy bigger models, faster.
- Open Innovation: Leverage WEKA’s open source contributions to adopt cutting-edge storage and inference optimization in your own workflows.
Join Us in Shaping the Future
Whether you’re a developer, infrastructure architect, or AI leader, WEKA’s Augmented Memory Grid and our ongoing open source initiatives put industry-leading performance, flexibility, and scale directly in your hands. Together, we’re redefining what’s possible for AI inference and opening new frontiers for LLM-powered applications.Discover how WEKA is continuing to accelerate AI inference at a massive scale.