WEKA Sets A New Bar With 75X Faster Time to First Token (TTFT)

Betsy Chernoff. June 20, 2025

At WEKA, we’re committed to reimagining what’s possible for AI infrastructure. Today, we’re excited to share a major leap forward in large language model (LLM) inference performance. Recently, we shared our latest open source advancements with LM Cache and our NVIDIA Magnum IO GPUDirect Storage (GDS) integration for Augmented Memory Grid. In addition to LM Cache, we are happy to share that our open source integration now includes support for NVIDIA Tensor-RT LLM with GDS.

Breaking Through the Memory Barrier

As AI models incorporate more features—including coding agents (and overall, agentic AI), model context protocols (MCP), and reasoning—context windows are stretching into the hundreds of thousands of tokens. Memory and inference latency have become critical challenges in this landscape and organizations looking to deploy AI at scale are hitting a memory wall to try and navigate these use cases. Traditional approaches quickly hit hardware and efficiency limits, making large-context, high-throughput inference both expensive and complex.

WEKA’s Augmented Memory Grid changes the equation. By seamlessly extending GPU KV cache memory into a distributed, NVMe-backed storage layer, Augmented Memory Grid unlocks the ability to serve massive LLM workloads without the bottlenecks that have limited growth and innovation.

Next-Generation Performance—WEKA’s Latest Contribution for the Open Source Community

Augmented Memory Grid delivered breakthrough results for our latest benchmarks, using the Llama-3.1-70B model running on NVIDIA H100 systems. Augmented Memory Grid’s integration for NVIDIA TensorRT-LLM with GDS support, we observed dramatic improvements in both speed and scalability—delivering up to 75X faster prefill times for long-context prompts, and consistently lower latency across a wide range of token sizes and quantization strategies.

To see our open source contribution, check out our GPUDirect Storage (GDS) for KV Cache.

For the complete setup guide, here is the WEKA Augmented Memory Grid Setup.

For an FP16 test, we saw these results:

Tokens	Prefill Rate TensorRT-LLM (milliseconds)	WEKA GDS Prefill Rate (milliseconds)	% difference
50	27.925	25.328	10.25
1000	165.260	25.538	547.12
2000	304.856	29.350	938.69
8000	1199.311	38.787	2992.03
16000	2412.058	54.519	4324.28
24000	3648.305	61.598	5822.81
32000	4934.670	88.661	5465.75
64000	10369.098	157.749	6473.17
96000	16312.039	214.237	7514.03
128000	22998.583	301.481	7528.53

We believe in accelerating progress across the entire AI ecosystem. That’s why we’ve contributed our GDS integration for key-value (KV) cache management to the open source community, giving developers and researchers direct access to these same high-performance capabilities through popular optimization frameworks: LM Cache and now, NVIDIA TensorRT-LLM.

What Does This Mean for You?

Dramatic Reductions in Latency: Achieve near-instant response times even with large, complex prompts—enabling new user experiences and business applications.
Scalable, Efficient Infrastructure: Eliminate the need to over-provision GPUs, lower costs, and deploy bigger models, faster.
Open Innovation: Leverage WEKA’s open source contributions to adopt cutting-edge storage and inference optimization in your own workflows.

Join Us in Shaping the Future

Whether you’re a developer, infrastructure architect, or AI leader, WEKA’s Augmented Memory Grid and our ongoing open source initiatives put industry-leading performance, flexibility, and scale directly in your hands. Together, we’re redefining what’s possible for AI inference and opening new frontiers for LLM-powered applications. Discover how WEKA is continuing to accelerate AI inference at a massive scale.

Schedule a Meeting

PRODUCTS

DEPLOYMENT OPTIONS

USE CASES

INDUSTRIES

ARCHITECTURES

Learn AI

RESOURCES

TECHNICAL RESOURCES

ABOUT US

JOIN US

WEKA Sets A New Bar With 75X Faster Time to First Token (TTFT)

Breaking Through the Memory Barrier

Next-Generation Performance—WEKA’s Latest Contribution for the Open Source Community

What Does This Mean for You?

Join Us in Shaping the Future

Popular Blogs From Betsy Chernoff

WEKA Sets A New Bar With 75X Faster Time to First Token (TTFT)

Breaking Through the Memory Barrier

Next-Generation Performance—WEKA’s Latest Contribution for the Open Source Community

What Does This Mean for You?

Join Us in Shaping the Future

Share On Social:

Popular Blogs From Betsy Chernoff

Related Assets

The Impact of Storage on the AI Lifecycle

Scaling Smart: Future-Proofing Your AI Infrastructure

See NeuralMesh in Action