Turn Your AI Factory Into a
Lean, Mean Token Machine
Extend GPU memory into a persistent Token Warehouse™ with 1000x more
memory capacity and radically increase token throughput.
1000x More KV Cache Capacity
Get 1000x more capacity than DRAM for KV cache data that remains persistent across sessions and node failures.
41x Faster Time to First Token
Deliver up to 41x faster time to first token (TTFT) and up to 6x once DRAM capacity limits are exceeded to sustain high inference efficiency.
4x More Tokens per GPU
Achieve over 4x higher throughput per GPU, enabling greater concurrency and lower cost per token for large-context inference.
“The economics of large-scale inference are a major hurdle for enterprises. WEKA’s Augmented Memory Grid directly confronts this challenge. The 20x improvement in time-to-first-token we observed in joint testing on OCI isn’t just a performance metric; it fundamentally reshapes the cost structure of running AI workloads. For our customers, this makes deploying the next generation of AI easier and cheaper.”
Purpose-Built for AI Inference
Scalable, Persistent, Efficient
By offloading and persisting KV-cache data to a token warehouse in NeuralMesh, Augmented Memory Grid expands effective memory capacity by 1000x beyond DRAM. This eliminates redundant prefill computations, sustains high cache-hit rates, and significantly improves GPU efficiency.
Microsecond Data Path with RDMA and GPUDirect Storage
By leveraging NVIDIA Magnum IO GPUDirect Storage (GDS) and RDMA, Augmented Memory Grid enables GPUs to fetch tokens directly from the NVMe-backed token warehouse with microsecond-scale latency and DRAM-class throughput.
Open Source Ecosystem Integrations
Integrates natively with NVIDIA Dynamo and NIXL through WEKA’s open-source NIXL plugin, and our open source projects that support TensorRT-LLM and LM Cache, enabling seamless integration into existing inference pipelines.
Next-Gen AI Runs on NeuralMesh
Articles and Resources
Break the AI Memory Barrier with Augmented Memory Grid
AI builders can now streamline long-context reasoning and agentic AI workflows, transforming inference workloads into profitable business value.
Delivering High-Performance Inference on Oracle Cloud Infrastructure
Augmented Memory Grid makes long-context, multi-turn, and agentic inference achievable with 20x improved TTFT and GPU efficiency on OCI.
Fuel Your Inference Workloads
See how NeuralMesh with Augmented Memory Grid delivers high-efficiency memory design
that transforms inference scale, cost, and performance.