Your Kubernetes Workloads Aren’t CPU Bound — They’re Waiting on Storage
TL;DR
  • Kubernetes workloads that scale compute without scaling storage become I/O bound — CPU stays low, but performance degrades.
  • The symptoms are measurable: rising I/O wait, GPU idle time, and flat throughput despite adding pods.
  • Traditional centralized storage suffers from controller and metadata bottlenecks, and limited parallel throughput, which compound at scale.
  • Disaggregated, software-defined storage like WEKA’s NeuralMesh™ system distributes both data and metadata across nodes, enabling storage to scale horizontally alongside compute.
  • NeuralMesh’s Kubernetes Operator enables native deployment, lifecycle management, and per-workload quality of service (QoS) — without modifying application code.
  • Real-world deployments have demonstrated 2–3x improvements in training throughput and increases in GPU utilization from ~43% to >90%.

Kubernetes isn’t a developer tool anymore. It’s the operating system of the modern enterprise.

Organizations across financial services, healthcare, software, and telecom have standardized on Kubernetes as mission-critical infrastructure, and 88% expect it to become the dominant platform for AI/ML workloads within two years. JPMorgan Chase runs its payment infrastructure on it. Netflix orchestrates global streaming at scale with it. Molina Healthcare manages HIPAA-compliant health records through it. 

Those workloads run in production on enterprise-grade clusters — Red Hat OpenShift, Amazon EKS, Azure Kubernetes Service, Google Kubernetes Engine — where compute orchestration is handled natively and tuned to scale.

That part works. What doesn’t? Storage.

At some point, almost every Kubernetes platform team runs into the same puzzle — and I have conversations like this with customers every day:

  • You scale a workload from 10 pods to 100.
  • Cluster CPU utilization is low. Memory isn’t under pressure.
  • But performance barely improves.
  • Adding more compute doesn’t help. Increasing pod replicas doesn’t help.
  • In some cases, the workload actually gets slower.

When this happens, the bottleneck isn’t scheduling or container overhead. It’s storage.

The Hidden I/O Bottleneck in Kubernetes

The moment workloads become highly parallel, the underlying storage architecture becomes the limiting factor. Databases like PostgreSQL, analytics frameworks like Apache Spark, observability platforms like Prometheus and AI training pipelines built on PyTorch — they all generate enormous amounts of concurrent I/O.

Kubernetes can scale compute across hundreds of nodes. But if the storage layer can’t match that parallelism, those workloads quickly become I/O bound.

According to CNCF research, an overwhelming number of organizations running Kubernetes in production report I/O performance as a top-three infrastructure concern. As cluster scale increases from tens to hundreds of nodes, the gap between available compute and storage throughput grows wider faster than most teams anticipate.

The root cause is an architectural mismatch. Kubernetes was designed to distribute compute horizontally. Traditional storage systems were not. They were built for sequential, single-stream workloads on dedicated hardware — not the thousands of simultaneous, parallel I/O streams that modern Kubernetes platforms generate.

How to Tell if Your Workload Is I/O Bound

Before changing architecture, confirm where the bottleneck actually is. The signs of I/O saturation are often visible in standard observability metrics:

System Symptom Diagnosis
Low CPU, rising latency Application latency increases under load while CPU utilization stays low — the workload is waiting on external resources. Common with databases like PostgreSQL. A CPU-bound workload shows high utilization; an I/O-bound workload shows the opposite.
High I/O wait Linux exposes I/O wait as part of CPU metrics. Rising percentages (visible via top, sar, or node exporter) indicate the system is blocked waiting for storage operations. Sustained values above 10% point to a storage bottleneck.
Storage latency scales with pod count If latency rises as more pods hit the same storage backend, the storage subsystem is the bottleneck — not compute. Each pod adds I/O pressure; centralized systems have a finite bandwidth ceiling.
Idle GPUs GPU utilization tools like nvidia-smi or DCGM show near-zero compute utilization while the data pipeline struggles to feed them. This is storage starvation.
Flat throughput despite more compute Adding nodes or pod replicas produces diminishing or zero returns. This is the diagnostic hallmark of an I/O bottleneck: compute is available but idle, waiting on storage.

Workloads Most Vulnerable to I/O Bottlenecks

Each of these workload types degrades non-linearly when storage access is constrained:

Workload Type Why It’s Vulnerable
Databases PostgreSQL, MySQL, and MongoDB issue continuous small random I/O — record lookups, index traversals, and WAL flushes. As the connection count scales, storage metadata operations become the limiting factor.
Data processing pipelines Spark and Flink scan large datasets in parallel. Each executor issues concurrent reads, often across shuffle stages. Centralized storage can’t serve these without throughput degradation.
Observability platforms Prometheus and Elasticsearch generate extremely high small-write rates. Metadata operations (file creates, lookups, checkpoints) accumulate rapidly at scale.
AI/ML training PyTorch and TensorFlow pipelines require sustained data throughput to keep GPUs busy. When storage can’t keep up, GPUs sit idle waiting for the next batch — turning expensive compute into a waiting room.

Why Traditional Storage Architectures Can’t Keep Up

Traditional storage systems — NFS, legacy SAN/NAS, and most object storage backends — were designed for a fundamentally different workload profile: high throughput in single-stream sequential scenarios, not thousands of simultaneous random I/O operations.

Three failure modes emerge at scale:

Controller Bottlenecks

A single storage controller manages all I/O requests from the cluster. Every read, write, metadata lookup, and lock passes through this single point. As pod count scales, the controller’s CPU, memory, and network bandwidth saturate long before the underlying media is exhausted. The disks themselves may be largely idle while the controller processes a backlog of queued requests.

Metadata Bottlenecks

Every file system operation — create, open, lookup, permission check — requires a metadata transaction. In high-concurrency Kubernetes workloads, metadata operations can represent 60–80% of total I/O. Traditional systems store metadata in a single, centralized namespace managed by a single node. At scale, this becomes a serialization bottleneck: operations that could run in parallel are forced to queue. This is especially severe in AI training, where DataLoaders open, read, and close millions of small files per epoch.

Limited Parallel Throughput

The software architecture serializes requests at the controller layer, so even when physical media has bandwidth to spare, the system can’t sustain the level of parallel access Kubernetes workloads demand.

The impact is measurable. IDC’s 2025 global survey of over 1,300 AI decision-makers found idle GPU time among the leading causes of AI budget waste — accelerators waiting on storage that can’t deliver data fast enough to keep pace with compute. These aren’t edge cases. They’re the expected outcome when distributed compute is paired with centralized storage.

A Different Approach: Disaggregated Storage That Scales with Kubernetes

The architectural fix is to make storage scale horizontally, as compute does.

WEKA’s NeuralMesh distributes both metadata and data across every node in a cluster. Instead of a centralized bottleneck, performance scales linearly as the storage cluster grows:

  • Distributed metadata services: metadata is sharded and replicated across all nodes, eliminating the single-node namespace bottleneck
  • Parallel data paths: every node serves data directly, with no central controller in the I/O path
  • Ultra-low latency networking: purpose-built for NVMe-over-Fabric and RDMA environments
  • Linear throughput scaling: aggregate bandwidth grows with node count

How NeuralMesh Works in a Kubernetes Environment

NeuralMesh operates as a set of distributed containerized processes that collectively provide file system services. Rather than a monolithic storage stack, the system decomposes into coordinated service components:

  • Frontend services for client access and protocol handling
  • Compute services for file system logic, metadata processing, and clustering
  • Drive services for managing NVMe devices and physical storage operations
  • Management services for cluster coordination and administration
  • Telemetry services for logging, auditing, and observability

All communication occurs over the network — even between services on the same physical server — so placement is flexible, and services can move or scale without architectural constraint.

This matters for Kubernetes because NeuralMesh was designed from the ground up as a containerized, microservices-based system. Each component runs as an independent container with its own resource boundaries, health checks, and lifecycle:

  • Service failures are isolated: a failing telemetry container doesn’t impact the data path
  • Resource allocation is granular: drive and compute services can be independently scaled and resource-capped
  • Upgrades are non-disruptive: individual containers roll forward without taking the storage cluster offline
  • Kubernetes-native observability: all services expose Prometheus metrics and structured logs compatible with standard pipelines

Through its Kubernetes Operator, NeuralMesh enables secure multitenancy, namespace isolation, and per-workload QoS — all managed through standard Kubernetes RBAC and custom resource definitions (CRDs). The Operator handles full lifecycle management: deployment, configuration, capacity expansion, node replacement, and rolling upgrades.

Getting Started

Deploying NeuralMesh on Kubernetes takes three steps:

  1. Install the WEKA Operator via Helm chart or OperatorHub
  2. Create a WekaCluster custom resource defining the initial cluster topology
  3. Apply StorageClass definitions to provision WEKA-backed PersistentVolumeClaims for your workloads

Removing I/O as the Scaling Limit

When storage performance scales alongside compute, the behavior changes dramatically. Multiple pods read and write simultaneously without contending for a single controller. Distributed Spark jobs scale executor counts without saturating storage. AI training pipelines keep GPUs fully utilized. Observability platforms sustain higher ingestion rates as clusters grow.

Instead of storage capping what your cluster can do, applications fully utilize the compute resources available to them. As Kubernetes workloads become increasingly data-intensive, the gap between compute and storage capabilities will only widen — unless storage architecture evolves alongside them.

NeuralMesh provides a path to closing that gap. For many Kubernetes environments, it’s the architectural shift that moves clusters from limited scaling to truly parallel performance.

Learn how NeuralMesh eliminates I/O bottlenecks by visiting our product page.

Full documentation, deployment guides, and reference architectures are available at docs.weka.io.