Augmented Memory Grid™

Turn Your AI Factory Into a
Lean, Mean Token Machine

Extend GPU memory into a persistent Token Warehouse™ with 1000x more
memory capacity and radically increase token throughput.

1000x More KV Cache Capacity

Get 1000x more capacity than DRAM for KV cache data that remains persistent across sessions and node failures.

41x Faster Time to First Token

Deliver up to 41x faster time to first token (TTFT) and up to 6x once DRAM capacity limits are exceeded to sustain high inference efficiency.

4x More Tokens per GPU

Achieve over 4x higher throughput per GPU, enabling greater concurrency and lower cost per token for large-context inference.

Discover the NeuralMesh Ecosystem

“The economics of large-scale inference are a major hurdle for enterprises. WEKA’s Augmented Memory Grid directly confronts this challenge. The 20x improvement in time-to-first-token we observed in joint testing on OCI isn’t just a performance metric; it fundamentally reshapes the cost structure of running AI workloads. For our customers, this makes deploying the next generation of AI easier and cheaper.”

Nathan Thomas
Vice President, Multicloud, Oracle Cloud Infrastructure

Learn More About Oracle Cloud x NeuralMesh

Purpose-Built for AI Inference

Scalable, Persistent, Efficient

By offloading and persisting KV-cache data to a token warehouse in NeuralMesh, Augmented Memory Grid expands effective memory capacity by 1000x beyond DRAM. This eliminates redundant prefill computations, sustains high cache-hit rates, and significantly improves GPU efficiency.

Microsecond Data Path with RDMA and GPUDirect Storage

By leveraging NVIDIA Magnum IO GPUDirect Storage (GDS) and RDMA, Augmented Memory Grid enables GPUs to fetch tokens directly from the NVMe-backed token warehouse with microsecond-scale latency and DRAM-class throughput.

Open Source Ecosystem Integrations

Integrates natively with NVIDIA Dynamo and NIXL through WEKA’s open-source NIXL plugin, and our open source projects that support TensorRT-LLM and LM Cache, enabling seamless integration into existing inference pipelines.

Next-Gen AI Runs on NeuralMesh

Maximize GPU density and throughput per node. Reduce redundant computations, accelerate TTFT, and increase infrastructure efficiency to deliver more inference capacity per dollar.

Serve 100K+ token contexts and multi-session conversations without recomputation. Maintain persistent memory and stateful agents efficiently—scaling performance and profitability together.

Remove the performance limits of model APIs. Scale multi-agent and long-context applications with up to 100× greater token throughput and persistent memory.

Articles and Resources

PRESS RELEASE

Break the AI Memory Barrier with Augmented Memory Grid

AI builders can now streamline long-context reasoning and agentic AI workflows, transforming inference workloads into profitable business value.

Get the Details

Blog

Delivering High-Performance Inference on Oracle Cloud Infrastructure

Augmented Memory Grid makes long-context, multi-turn, and agentic inference achievable with 20x improved TTFT and GPU efficiency on OCI.

Learn More

Blog

Slash Inference Costs by Maximizing Token Efficiency

See the Numbers

Blog

Go Beyond GPU Memory Limits with a Token Warehouse™

Learn How

Blog

4.2x More Tokens per GPU. Zero New Hardware.

See the Proof

DATASHEET

Persistent GPU Memory for AI Inference at Scale

Get The Context

Fuel Your Inference Workloads

See how NeuralMesh with Augmented Memory Grid delivers high-efficiency memory design
that transforms inference scale, cost, and performance.

Show Me

PRODUCTS

DEPLOYMENT OPTIONS

USE CASES

INDUSTRIES

ARCHITECTURES

Learn AI

RESOURCES

TECHNICAL RESOURCES

ABOUT US

JOIN US

Turn Your AI Factory Into a
Lean, Mean Token Machine

1000x More KV Cache Capacity

41x Faster Time to First Token

4x More Tokens per GPU

Purpose-Built for AI Inference

Scalable, Persistent, Efficient

Microsecond Data Path with RDMA and GPUDirect Storage

Open Source Ecosystem Integrations

Next-Gen AI Runs on NeuralMesh

Articles and Resources

Break the AI Memory Barrier with Augmented Memory Grid

Delivering High-Performance Inference on Oracle Cloud Infrastructure

Slash Inference Costs by Maximizing Token Efficiency

Go Beyond GPU Memory Limits with a Token Warehouse™

4.2x More Tokens per GPU. Zero New Hardware.

Persistent GPU Memory for AI Inference at Scale

Fuel Your Inference Workloads

Turn Your AI Factory Into a Lean, Mean Token Machine

1000x More KV Cache Capacity

41x Faster Time to First Token

4x More Tokens per GPU

Purpose-Built for AI Inference

Scalable, Persistent, Efficient

Microsecond Data Path with RDMA and GPUDirect Storage

Open Source Ecosystem Integrations

Next-Gen AI Runs on NeuralMesh

Articles and Resources

Break the AI Memory Barrier with Augmented Memory Grid

Delivering High-Performance Inference on Oracle Cloud Infrastructure

Slash Inference Costs by Maximizing Token Efficiency

Go Beyond GPU Memory Limits with a Token Warehouse™

4.2x More Tokens per GPU. Zero New Hardware.

Persistent GPU Memory for AI Inference at Scale

Fuel Your Inference Workloads

Turn Your AI Factory Into a
Lean, Mean Token Machine