Augmented Memory Grid™

Turn Your AI Factory Into a
Lean, Mean Token Machine

Extend GPU memory into a persistent Token Warehouse™ with 1000x more
memory capacity and radically increase token throughput.

1000x More KV Cache Capacity

Get 1000x more capacity than DRAM for KV cache data that remains persistent across sessions and node failures.

6.5x Higher Token Throughput

Sustain up to 6.5x more input-token throughput under concurrent inference load, keeping GPUs fed even as working sets grow far beyond
DRAM capacity.

10x More Concurrent Users

Scale to 10x more simultaneous long-context sessions on the same GPU cluster, without adding hardware or sacrificing throughput.

Discover the NeuralMesh Ecosystem

“The economics of large-scale inference are a major hurdle for enterprises. WEKA’s Augmented Memory Grid directly confronts this challenge. The 20x improvement in time-to-first-token we observed in joint testing on OCI isn’t just a performance metric; it fundamentally reshapes the cost structure of running AI workloads. For our customers, this makes deploying the next generation of AI easier and cheaper.”

Nathan Thomas
Vice President, Multicloud, Oracle Cloud Infrastructure

Learn More About Oracle Cloud x NeuralMesh

Purpose-Built for AI Inference

Scalable, Persistent, Efficient

By offloading and persisting KV-cache data to a token warehouse in NeuralMesh, Augmented Memory Grid expands effective memory capacity by 1000x beyond DRAM. This eliminates redundant prefill computations, sustains high cache-hit rates, and significantly improves GPU efficiency.

Microsecond Data Path with RDMA and GPUDirect Storage

By leveraging NVIDIA Magnum IO GPUDirect Storage (GDS) and RDMA, Augmented Memory Grid enables GPUs to fetch tokens directly from the NVMe-backed token warehouse with microsecond-scale latency and DRAM-class throughput.

Open Source Ecosystem Integrations

Integrates natively with NVIDIA Dynamo and NIXL through WEKA’s open-source NIXL plugin, and our open source projects that support TensorRT-LLM and LM Cache, enabling seamless integration into existing inference pipelines.

Next-Gen AI Runs on NeuralMesh

Maximize GPU density and throughput per node. Reduce redundant computations, accelerate TTFT, and increase infrastructure efficiency to deliver more inference capacity per dollar.

Serve 100K+ token contexts and multi-session conversations without recomputation. Maintain persistent memory and stateful agents efficiently—scaling performance and profitability together.

Remove the performance limits of model APIs. Scale multi-agent and long-context applications with up to 100× greater token throughput and persistent memory.

Articles and Resources

Press Release

Serve 10x More AI Users on the Same GPU Footprint

Production-scale benchmarks on OCI show how WEKA Augmented Memory Grid removes KV cache bottlenecks to unlock higher throughput, persistent context, and better token economics for agentic AI.

Get the Details

PRESS RELEASE

Break the AI Memory Barrier with Augmented Memory Grid

AI builders can now streamline long-context reasoning and agentic AI workflows, transforming inference workloads into profitable business value.

Get the Details

Blog

Slash Inference Costs by Maximizing Token Efficiency

See the Numbers

Blog

Go Beyond GPU Memory Limits with a Token Warehouse™

Learn How

Blog

4.2x More Tokens per GPU. Zero New Hardware.

See the Proof

DATASHEET

Persistent GPU Memory for AI Inference at Scale

Get The Context

Fuel Your Inference Workloads

See how NeuralMesh with Augmented Memory Grid delivers high-efficiency memory design
that transforms inference scale, cost, and performance.

Show Me

PRODUCTS

DEPLOYMENT OPTIONS

USE CASES

INDUSTRIES

ARCHITECTURES

Learn AI

RESOURCES

TECHNICAL RESOURCES

ABOUT US

JOIN US

Turn Your AI Factory Into a
Lean, Mean Token Machine

1000x More KV Cache Capacity

6.5x Higher Token Throughput

10x More Concurrent Users

Purpose-Built for AI Inference

Scalable, Persistent, Efficient

Microsecond Data Path with RDMA and GPUDirect Storage

Open Source Ecosystem Integrations

Next-Gen AI Runs on NeuralMesh

Articles and Resources

Serve 10x More AI Users on the Same GPU Footprint

Break the AI Memory Barrier with Augmented Memory Grid

Slash Inference Costs by Maximizing Token Efficiency

Go Beyond GPU Memory Limits with a Token Warehouse™

4.2x More Tokens per GPU. Zero New Hardware.

Persistent GPU Memory for AI Inference at Scale

Fuel Your Inference Workloads

Turn Your AI Factory Into a Lean, Mean Token Machine

1000x More KV Cache Capacity

6.5x Higher Token Throughput

10x More Concurrent Users

Purpose-Built for AI Inference

Scalable, Persistent, Efficient

Microsecond Data Path with RDMA and GPUDirect Storage

Open Source Ecosystem Integrations

Next-Gen AI Runs on NeuralMesh

Articles and Resources

Serve 10x More AI Users on the Same GPU Footprint

Break the AI Memory Barrier with Augmented Memory Grid

Slash Inference Costs by Maximizing Token Efficiency

Go Beyond GPU Memory Limits with a Token Warehouse™

4.2x More Tokens per GPU. Zero New Hardware.

Persistent GPU Memory for AI Inference at Scale

Fuel Your Inference Workloads

Turn Your AI Factory Into a
Lean, Mean Token Machine