Why Storage Architecture is the New Bottleneck for HPC and AI Teams


TL;DR At SC25, I was invited to speak in both the Laplup and Hitachi booths about how data infrastructure—not compute—has become the primary bottleneck limiting AI and HPC performance (you can grab a link to the talk at the end of this post). After each session, people came up to me and continued the conversation about the problems they’re running into. The bottom line: Organizations building massive GPU clusters are seeing their expensive hardware sit idle because storage can’t keep pace. In this post, I break it down more.
Every year at Supercomputing, the conversations reveal where the industry’s pressure points really are. This year in St. Louis, the pattern was unmistakable: as GPU clusters scale, organizations are hitting a wall that has nothing to do with compute power.
The assumption used to be simple—add more GPUs, get faster results. But when data can’t reach those GPUs fast enough, even the most powerful accelerators underperform. I’ve seen this play out again and again: teams invest millions in building out GPU infrastructure, only to find their expensive compute sitting idle half the time.
The data layer has become the #1 bottleneck limiting training speed, HPC simulations, and analytics throughput.
Why GPU Performance No Longer Equals System Performance
AI and HPC workloads produce chaotic I/O patterns that legacy storage simply wasn’t designed to handle:
- Massively parallel random reads from distributed training jobs hitting storage from hundreds of nodes simultaneously.
- Small-block writes for checkpoints that save model state every few minutes across multi-day training runs.
- Iterative workloads with high-metadata demands where millions of small files need to be accessed with microsecond latency.
- Multi-tenant competition for bandwidth as different teams compete for the same storage resources.
Legacy storage architectures force a choice: optimize for throughput or optimize for IOPS. You can’t have both. The result is GPU starvation—compute resources waiting on data that should already be there.
How NeuralMesh Eliminates GPU Wait Time in AI Workloads
At SC25, I walked attendees through four transformative outcomes WEKA is seeing with customers using NeuralMesh™ in production:
90% reduction in deep learning training time. One of our customers training runs that previously took 40–80 hours now completes in 4–6 hours. That’s not an incremental improvement—it’s a fundamental shift in how quickly teams can iterate.
5x faster storage throughput. Thanks to optimized NVMe, POSIX clients, and microsecond-latency operations, NeuralMesh delivers the sustained performance AI workloads demand without the traditional trade-offs. Learn how we did this with the Center for AI Safety (CAIS).
6x expanded research capacity. More experiments running concurrently. More datasets accessible simultaneously. More parallel users working without contention. All without growing the physical footprint or power consumption (CAIS).
Up to 41x reduction in time-to-first-token for LLM inference. This is where NeuralMesh with Augmented Memory Grid™ makes its impact felt—extending GPU memory and delivering token throughput that changes what’s possible for inference workloads.
These aren’t theoretical estimates or lab benchmarks. These are measured production outcomes.
Why Multitenancy and Single Namespace Matter in AI
One pattern I see repeatedly: organizations build separate infrastructure silos for different workloads. Training gets one cluster. Inference gets another. Analytics gets a third. Each requires its own management, its own data copies, its own optimization.
NeuralMesh takes a different approach:
- POSIX + S3 + NFS + SMB in one namespace. Your applications access data however they need to, without translation layers or performance penalties.
- Scale without architectural redesign. Add capacity and performance independently as needs grow, without rebuilding the stack.
- Performance isolation for different teams. Multiple workloads coexist with guaranteed performance characteristics, no matter what else is running.
- Massive parallelism without fragmentation. Thousands of clients can access the same data simultaneously without hotspots or contention.
This matters because real-world AI environments don’t fit into neat categories. Training workloads need to pull inference results for analysis. Data scientists need to examine training checkpoints. Production systems need to access the same datasets training used. When all of this happens in a single namespace with consistent performance, the friction disappears.
Let’s pull this all together with some real WEKA customer examples examples.
Production Results: Storage Performance Impact Across Industries
Genomics at Scale: 10x Performance for 140 Petabytes
Genomics England was on a mission to sequence five million genomes, but faced a critical bottleneck. Their legacy NFS-based NAS had hit its scaling limit, and with projections reaching 140 petabytes, they needed a complete infrastructure transformation. The existing system also lacked disaster recovery, leaving sensitive NHS patient data vulnerable.
By deploying NeuralMesh with multi-tiered architecture, Genomics England achieved:
- 10x improvement in performance: NeuralMesh delivers over 135GB/s from the NVMe tier, enabling researchers to query the entire dataset without constraints.
- 75% reduction in storage cost per genome: Costs dropped from £52 to £13 per genome.
- 140 petabytes managed in a single namespace: 1.3PB of NVMe flash for active research unified with 40PB of object storage for long-term data lakes.
- Geo-distributed disaster recovery: Snap-to-object capabilities protect data across three sites 50 miles apart, ensuring continued access even during major disasters.
When researchers can access the complete dataset simultaneously without infrastructure constraints, faster insights lead to better healthcare outcomes for NHS patients participating in this landmark public health initiative.
16K Immersive Video: Real-Time Rendering for Live Entertainment
The entertainment industry has its own extreme data demands. Dead & Company’s Dead Forever residency at Sphere pushed the boundaries of what’s technically possible for live entertainment. The production team faced unprecedented challenges: managing 1.5 petabytes of immersive 3D video content while preserving the band’s improvisational performance style—no predetermined setlist, no time codes, just organic musical flow requiring instant content transitions.
Using WEKA’s plug-and-play WEKApod™ Appliance, the production team achieved what Rolling Stone called “the most dazzling visual show in Grateful Dead history”:
- 16K x 16K immersive 3D rendering: NeuralMesh enabled the team to render large video files in minutes instead of hours.
- 100 GB network at line-rate speed: NeuralMesh delivered actual 100 gigabit speeds across the network, moving massive files fast enough to support real-time creative decisions and live show delivery.
- 90TB of data per show: WEKApod functioned as the data hub for the entire production pipeline—from rendering and post-production to live playback across Sphere’s massive LED display—managing data at a scale that would have been impossible with traditional storage.
- Rendering cycles reduced from hours to minutes: Creative iterated rapidly, turning ideas into reality fast enough to support the show’s dynamic, improvisational nature.
As Brandon Kraemer, Technical Director for Dead Forever, put it: “Technology is at its best when it gets out of the way of the creative process, and that is exactly what WEKA did for us at Sphere.“
I highly encourage you to watch this video showing how WEKA helped The Grateful Dead….it’s worth 7-minutes of your day! Or you can read this case study.
Semiconductor Manufacturing Analytics: From 2-Day Data Loads to Same-Day Insights
A leading electronics manufacturer faced a critical bottleneck: daily data loading took over two days, making anomaly detection predictions irrelevant by the time they were produced. Disconnected data silos across Manufacturing Execution Systems and Management Information Systems prevented unified analysis of their production data.
By replacing their Hadoop-based ecosystem with a GPU-accelerated SQream deployment backed by NeuralMesh, they achieved:
- From infrastructure paralysis to real-time operations: What previously required hundreds of thousands of CPUs now runs on just three compute nodes with 12 GPUs, managing over 10 petabytes of data while generating 280 automated daily reports.
- Same-day data processing: Up to 100TB of raw sensor data from manufacturing equipment—previously taking two days to load—now transforms into analytics-ready insights within the same day.
- 90% cost reduction in data operations: The streamlined infrastructure eliminated the complexity and expense of maintaining massive compute clusters and disconnected data silos.
- Production yield improvement from 50% to 90%: Real-time anomaly detection now catches faults at the earliest possible stage, enabling immediate adjustments to production floor machines before defects propagate.
In semiconductor manufacturing, where yield improvements of fractions of a percent translate into millions of dollars, the ability to analyze production data in real-time—rather than days later—changes the economics of the entire operation.
Business Impact: How Storage Performance Accelerates AI ROI
A 90% reduction in training time isn’t just a performance metric. It translates into business outcomes:
- Faster iteration means more experiments per week, per month, per quarter.
- More experiments per cycle means better models that solve harder problems.
- Better models mean competitive advantages that compound over time.
- Accelerated time-to-value means AI projects deliver ROI faster.
A 6x expansion in research capacity changes the equation for national-scale programs and enterprise research:
- More concurrent users working without waiting for resources.
- More models in flight as different teams pursue parallel approaches.
- More throughput for programs operating at unprecedented scale.
For organizations working in science, medicine, manufacturing, and AI development, this level of performance improvement is transformational.
The question isn’t whether your infrastructure can keep up with today’s workloads. The question is whether it’s ready for what comes next—and whether it’s holding you back right now without you fully realizing it.
If you have 15-minutes, watch the talk I gave in the Lablup booth, “Accelerating HPC & AI Workloads with WEKA NeuralMesh”. If you want to connect with my team to talk more about the use cases I shared or how NeuralMesh can help optimize your HPC or AI workloads, contact us today.