How NeuralMesh Gets More Resilient as You Scale

Colin Gallagher. July 14, 2025

In The Magic of WEKA Happens at Scale, we explored how our architecture defies traditional infrastructure limits—getting better, not more brittle, as it grows. That blog looked at the “what” of what NeuralMesh™ delivers – antifragile performance at massive scale.

Now, it’s time for the “how.”

This post is the magician pulling back the curtain. If the last blog showed you the illusion, this one shows you some of the engineering tricks behind it. Specifically, how NeuralMesh delivers blazingly fast recovery, distributed resilience, and compounding reliability as your cluster scales.

Spoiler: it’s not magic. It just feels like it to our customers.

Traditional storage systems treat resiliency like an afterthought—something to be tolerated, not optimized. NeuralMesh flips that on its head. In fact, it thrives under pressure. The bigger the cluster, the faster it heals and the more resilient it becomes.

Here’s why.

At the heart of NeuralMesh is a distributed architecture built on multiples of 4K block stripes spread across all available failure domains (think: nodes, racks, or zones). No two blocks from the same stripe live in the same domain, so the loss of any single domain only affects one block per stripe—and the system keeps humming along with zero performance impact. The larger the number of failure domains the lower the probability of two failed nodes sharing the same data. Example: for a stripe size of 18 (16+2) and a cluster size of 20 nodes, the number of possible stripe combinations is 190 – (C(20,18). Adding one more node to bring the cluster size to 21 increases the possible number of combinations to 1330. As the cluster size grows to 25, the number of possible stripe combinations is now 480,700. The addition of just 5 more nodes has had a dramatic reduction in the probability of two servers sharing the same stripe chunk.

Then comes the magic: rebuilds.

All healthy failure domains in the cluster join forces to rebuild the missing blocks using distributed parity calculations. NeuralMesh will rebuild data using a parity calculation and write that data across all remaining healthy failure domains. Unlike traditional RAID or erasure coding systems, where only a few nodes handle recovery, NeuralMesh enlists every available compute core in the cluster. Even nodes that don’t store data pitch in. The result? Rebuilds that scale with your system.

Got 50 nodes and 1 failure? 49 help for a fast rebuild
Got 500 nodes and 1 failure? 499 help rebuild exponentially faster

That’s linear resilience at work.

NeuralMesh also prioritizes data that’s most at risk—specifically, stripes impacted by multiple failures. This helps the system return to a “protected” state faster, minimizing vulnerability windows and restoring redundancy fast, often before you even realize there was a problem.

Behind the Curtain: How WEKA Rebuilds Faster at Scale

To demonstrate the “trick” of how NeuralMesh becomes more resilient at scale, we modeled two clusters with identical configurations—except for size. Both used:

16+4 erasure coding (+4 protection level)
7.68TB NVMe drives (10 per host)
30 compute cores per host
70% capacity utilization
4 simultaneous failures
Host MTBF of 10 days

The only difference? One cluster had 100 nodes (3,000 cores total), and the other had 50 nodes (1,500 cores total).

As the table below shows, the 100-node system rebuilds data significantly faster and scale shortens the time the system spends fully exposed.

After four simultaneous failures in a 16+4 configuration, the system has no remaining redundancy and a fifth failure would result in data unavailability. That’s why rebuilding even one level of protection—back to +3—is critical. It restores fault tolerance and insures another copy of the data. In the 100-node cluster, that return to protected status happens in just ~1 minute. In the 50-node cluster, it takes ~10 minutes —a longer risk window. This demonstrates how larger clusters are capable of faster rebuild times by using all available cluster compute resources.

Protection Level	Rebuild Time (100 nodes) hh:mm:ss	Rebuild Time (50 nodes) hh:mm:ss
16 + 1	00:01:01	00:09:44
16 + 2	00:03:28	00:25:30
16 + 3	00:16:29	01:05:47
16 + 4	01:22:58	02:45:27

See the Magic in Action

Want proof it’s not just smoke and mirrors? Watch this short demo showing NeuralMesh rebuilds of a similar—but not exactly the same—configuration to see how quickly the system returns to a protected state, even after multiple failures. This is resilience that scales like magic (but runs on math).

Just like cache can make systems feel faster than they are, NeuralMesh makes recovery feel effortless, except this time, there’s no illusion. We explored cache as a clever sleight of hand in this blog, but here, you’re watching real infrastructure magic: data loss avoided, protection restored, and performance untouched.

More scale. More speed. More safety.
NeuralMesh doesn’t just survive at scale, it gets stronger—no wand required.

Explore all of NeuralMesh’s Powerful Capabilities

PRODUCTS

DEPLOYMENT OPTIONS

USE CASES

INDUSTRIES

ARCHITECTURES

Learn AI

RESOURCES

TECHNICAL RESOURCES

ABOUT US

JOIN US

How NeuralMesh Gets More Resilient as You Scale

Spoiler: it’s not magic. It just feels like it to our customers.

Here’s why.

Then comes the magic: rebuilds.

Behind the Curtain: How WEKA Rebuilds Faster at Scale

See the Magic in Action

Popular Blogs From Colin Gallagher

How NeuralMesh Gets More Resilient as You Scale

Spoiler: it’s not magic. It just feels like it to our customers.

Here’s why.

Then comes the magic: rebuilds.

Behind the Curtain: How WEKA Rebuilds Faster at Scale

See the Magic in Action

Share On Social:

Popular Blogs From Colin Gallagher

Related Assets

Gorilla Guide to High Performance Data in the Cloud

Supercharging GPU Cloud for AI Workloads

How To Tackle Business and Technical Challenges Impacting Data-Intensive AI Workloads