Your PostgreSQL Is Only As Fast As Its Storage

Your database looks CPU-idle. Your application team is filing tickets. The real problem is 3ms behind every query. It's the storage layer. This post runs the numbers.
PostgreSQL 16 is deployed on Amazon EKS comparing two storage backends, gp3 EBS and NeuralMesh™, running pgbench. The infrastructure is fully reproducible: Terraform provisions an EKS cluster with a dedicated WEKA backend, a single make bench command runs the full test suite and writes results to a Markdown file. All code, manifests, and raw results can be found at github.com/mbookham7/mb-k8s-databse-perf-blog.
Why this comparison
Most Kubernetes administrators deploying on EKS reach for gp3 EBS by default. The AWS EBS CSI driver provisions it without additional configuration: managed, reliable, and works out of the box. For early-stage workloads, that is the right call.
As adoption grows, the math changes. More concurrent database connections, larger datasets, and heavier write workloads push gp3 toward its provisioned-IOPS ceiling. The pattern is familiar to anyone who has managed Kubernetes at scale: PostgreSQL surfaces as the bottleneck, tuning parameters and node resizing help at the margins, and the database continues to look CPU-idle while latency climbs. The bottleneck is not configuration. It’s the architecture.
This post documents that pattern with reproducible numbers and shows a direct alternative: NeuralMesh, running as a native Kubernetes workload on the same EKS cluster. The comparison is the one most teams face in practice, not a storage bake-off. The question is when the default EKS storage path stops being good enough, and what the simplest architecturally sound path forward looks like.
NeuralMesh deploys via the WEKA Kubernetes Operator, which manages the full storage cluster (backend nodes, client containers, and CSI driver) as native Kubernetes workloads governed by CRDs (WekaCluster). It installs via Helm. There is no separate management console and no parallel operational model. Storage becomes observable through kubectl the same way any other cluster resource is.
Both PostgreSQL instances in this demo run on the same c6i.4xlarge node type with identical configuration. The only variable is the storage backend.
What we're testing and why
PostgreSQL generates several categories of storage I/O that are each sensitive to latency in different ways:
- Write-Ahead Log (WAL) writes — sequential, on the critical path of every transaction commit
- Checkpoint flushes — periodic bulk writes of dirty pages to disk
- Index lookups and random reads — heavily impacted by storage queue depth as concurrency increases
- Autovacuum I/O — background reads and writes that compete with foreground queries
When storage latency increases — even modestly — these operations stack up. Connection pools fill. Query latency climbs. Throughput plateaus. The workload looks CPU-idle while the database is effectively stalled waiting on storage.
pgbench simulates this by running a configurable number of concurrent clients against a PostgreSQL instance, measuring Transactions Per Second (TPS) and latency across different concurrency levels.
A critical note on dataset sizing
One of the most common benchmarking mistakes is using a dataset small enough to fit entirely inside PostgreSQL's shared_buffers. With shared_buffers=4GB, a scale-100 dataset (~1.5 GB) is fully cached in memory — the workload becomes CPU and WAL group-commit bound, and the storage layer is barely touched. In that scenario, you're not measuring storage at all.
To expose a real storage bottleneck, the dataset must exceed both shared_buffers and the OS page cache so that reads actually hit storage. This demo uses scale factor 1000 (~15 GB), which comfortably exceeds shared_buffers and the pod memory limit, forcing both PostgreSQL instances to do real storage I/O.
Environment
| Component | Detail |
|---|---|
| Cloud / Region | AWS eu-west-1 |
| Kubernetes | EKS 1.33 |
| PostgreSQL | 16 (shared_buffers=4GB, wal_buffers=64MB, max_connections=200, checkpoint_completion_target=0.9, random_page_cost=1.1) |
| Standard block storage | gp3 EBS via ebs.csi.aws.com — 3,000 IOPS, 125 MB/s |
| WEKA | 6× i3en.2xlarge backend (NVMe), mounted via csi.weka.io (dir/v1) |
| DB / client nodes | c6i.4xlarge (16 vCPU / 32 GiB), hyperthreading off |
| pgbench | -j 4 -T 60 -P 10, clients 4 / 16 / 32 / 64 |
| pgbench scale factor | 1,000 (~15 GB dataset) |
Both PostgreSQL instances run on the same c6i.4xlarge instance type. The only variable between the two deployments is the storage backend — same nodes, same PostgreSQL configuration, same pgbench parameters. The WEKA backend uses six i3en.2xlarge nodes, a production-scale deployment rather than a single-node lab setup. The results reflect architecture, not hardware cherry-picking.
Architecture
Step 1: Deploy the infrastructure
The demo uses three Terraform layers, orchestrated via Make. You'll need an AWS account, Terraform >= 1.5, kubectl, helm 3.x, jq, aws CLI, a get.weka.io token, and Quay.io credentials.
# Configure the WEKA backend layer
cp terraform/10-weka-backend/terraform.tfvars.example \
terraform/10-weka-backend/terraform.tfvars
# Edit terraform.tfvars and fill in your get_weka_io_token
# Deploy everything (~25–35 minutes)
export QUAY_USERNAME=...
export QUAY_PASSWORD=...
make infra # vpc → weka backend → eks
The three layers are deliberately separated so that the WEKA backend (~10–15 min to clusterize) and EKS cluster can be provisioned in parallel:
make vpc
make backend & # kick off WEKA backend
make eks # runs concurrently; safe to do while backend formsTerraform remote state connects subnet IDs across layers, eliminating manual copy-pasting of resource IDs.
Cost warning: This demo costs approximately $15–25/hr with the defaults (6× i3en.2xlarge WEKA backend + EKS nodes + NAT + ALB). Run make destroy as soon as you're done.
Step 2: Deploy PostgreSQL on both storage backends
Once infrastructure is up:
make apps # deploys WEKA operator + CSI + client, then both PostgreSQL instancesThis calls kubectl apply -f manifests/postgres/ which creates a benchmark namespace containing both deployments, both PVCs, and the pgbench runner pod.
What the WEKA Kubernetes Operator is doing
make apps deploys the WEKA Kubernetes Operator via Helm first, then reconciles the full storage cluster into existence. The Operator manages the complete WEKA storage stack (backend containers, client containers, and the CSI driver) as native Kubernetes workloads governed by CRDs. Once deployed, the storage cluster is observable through standard Kubernetes tooling: kubectl get pods, kubectl describe, and Prometheus metrics. There is no separate WEKA management console to log into and no separate runbook to follow when something fails. This is architecturally different from a CSI-only driver, which provisions volumes but leaves the storage cluster itself outside Kubernetes management entirely.
gp3 StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard-block
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "3000"
throughput: "125"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
WEKA StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: storageclass-wekafs-dir-api
provisioner: csi.weka.io
parameters:
filesystemName: default
volumeType: dir/v1
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowVolumeExpansion: true
PostgreSQL deployments
Both deployments are identical in every configuration detail — only the storageClassName in the PVC differs.
# PVC for standard block storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-pvc-standard
namespace: benchmark
spec:
accessModes:
- ReadWriteOnce
storageClassName: standard-block
resources:
requests:
storage: 200Gi
---
# PVC for WEKA
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-pvc-weka
namespace: benchmark
spec:
accessModes:
- ReadWriteOnce
storageClassName: storageclass-wekafs-dir-api
resources:
requests:
storage: 200Gi
Both PostgreSQL pods use the same configuration:
containers:
- name: postgres
image: postgres:16
args:
- "-c"
- "max_connections=200"
- "-c"
- "shared_buffers=4GB"
- "-c"
- "wal_buffers=64MB"
- "-c"
- "checkpoint_completion_target=0.9"
- "-c"
- "random_page_cost=1.1"
Step 3: Run the benchmark
make bench
# or with explicit parameters:
./scripts/run-benchmark.sh --scale 1000 --duration 60 --clients "4 16 32 64"
The script initializes a scale-1000 (~15 GB) pgbench dataset on both instances, then runs each concurrency level in sequence against both backends, writing a Markdown results table to results/results-<timestamp>.md.
Initialization (from a pgbench-runner pod in the same namespace):
pgbench -h postgres-standard-svc -U postgres -d benchdb -i -s 1000
pgbench -h postgres-weka-svc -U postgres -d benchdb -i -s 1000
Benchmark at each concurrency level:
# Standard block — 32 clients, 60-second run
pgbench -h postgres-standard-svc -U postgres -d benchdb \
-c 32 -j 4 -T 60 -P 10
# WEKA — 32 clients, 60-second run
pgbench -h postgres-weka-svc -U postgres -d benchdb \
-c 32 -j 4 -T 60 -P 10Flag reference:
-c— number of concurrent clients-j— number ofpgbenchworker threads-T— test duration in seconds-P 10— print progress every 10 seconds
Step 4: Results
Run on 2026-06-09 in eu-west-1. Scale factor 1000 (~15 GB), exceeding shared_buffers and the pod memory limit, ensuring reads hit storage.
Transactions Per Second (TPS)
| Concurrent Clients | gp3 EBS (TPS) | WEKA (TPS) | WEKA improvement |
|---|---|---|---|
| 4 | 1,159 | 2,307 | +99% |
| 16 | 1,216 | 5,328 | +338% |
| 32 | 1,588 | 6,491 | +308% |
| 64 | 1,811 | 7,211 | +298% |
Average Latency (ms)
| Concurrent Clients | gp3 EBS (ms) | WEKA (ms) | WEKA improvement |
|---|---|---|---|
| 4 | 3.44 | 1.73 | -49% |
| 16 | 13.06 | 2.95 | -77% |
| 32 | 19.94 | 4.76 | -76% |
| 64 | 34.95 | 8.39 | -75% |
What the numbers show
At four concurrent clients, NeuralMesh delivers nearly 2x the TPS at half the latency. NeuralMesh's NVMe-backed distributed architecture keeps latency low even at modest queue depth, where gp3's provisioned-IOPS ceiling is already visible.
NeuralMesh throughput scales with concurrency, from 2,307 TPS at 4 clients to 7,211 TPS at 64 clients, while latency stays nearly flat, rising from 1.73 ms to 8.39 ms across the same range. gp3 throughput plateaus in the 1,200-1,800 TPS range as latency climbs from 3 ms to nearly 35 ms. This is the I/O saturation pattern described above, made visible in real numbers: storage requests queue behind a constrained set of I/O paths, and each additional client deepens the queue.
At 64 clients, NeuralMesh delivers approximately 4x the TPS at approximately one-quarter the latency.
What's happening under the hood
The results reflect a fundamental architectural difference between the two storage backends.
gp3 EBS routes I/O through a constrained number of paths. At its provisioned 3,000 IOPS / 125 MB/s ceiling, PostgreSQL's WAL writer, checkpointer, and query backends all compete for the same bandwidth simultaneously. The I/O queue grows, latency increases, and more concurrent clients deepen the problem rather than distributing it.
NeuralMesh distributes both metadata and data operations across the full storage cluster. In this demo, that is six i3en.2xlarge backend nodes with local NVMe. As PostgreSQL generates more concurrent I/O, requests spread across more storage nodes and more parallel data paths. There is no single I/O path to saturate. NeuralMesh is a fully distributed, shared-nothing system: every node runs the same software and handles data, metadata, and I/O directly, with no dedicated metadata servers and no central control node. This is why the NeuralMesh latency curve stays nearly flat as concurrency increases. The architecture absorbs parallelism rather than queuing it.
Because the WEKA Kubernetes Operator manages the full cluster as Kubernetes workloads, adding backend capacity is an operator-level change. There is no separate storage management operation. The architecture scales with the workload, and the management model scales with it.
Teardown
make destroy # removes apps, then EKS, backend, and VPCIf a layer fails, destroy it directly: terraform -chdir=terraform/<layer> destroy.
What if you just provisioned more IOPS?
The natural response to these results is to provision gp3 at 16,000 IOPS. That is worth examining directly.
gp3 baseline provisioning is 3,000 IOPS at no additional charge. Above 3,000, the cost is $0.006/IOPS-month. Reaching 16,000 IOPS adds approximately $78/month per volume. That is real spend for a ceiling that remains architectural.
At high concurrency, the bottleneck is not the IOPS number. It is the number of I/O paths. Provisioning more IOPS on gp3 raises the ceiling on a constrained-path architecture. NeuralMesh spreads I/O across the full backend cluster, so the system absorbs more concurrent requests rather than queuing them. More IOPS provisioning compresses the gap at low concurrency. It does not change the saturation profile at 32 or 64 clients.
The choice is not between fewer and more IOPS. It is between spending more to approach a different ceiling and changing the architecture. A follow-up post will benchmark gp3 at 16,000 IOPS to show exactly where additional provisioning helps and where it doesn't.
Key takeaways
- Scale factor matters. A
pgbenchdataset that fits inshared_buffersmeasures CPU and WAL throughput, not storage. Always size the dataset to exceed available RAM if you want to see the storage layer. - Low concurrency understates the gap. gp3 performs adequately at 4 clients. The difference becomes dramatic at 16+, which is exactly the concurrency profile of any real production database workload.
- The bottleneck is architectural, not a configuration problem. Tuning PostgreSQL parameters or increasing gp3 IOPS provisioning helps at the margins but doesn't remove the throughput ceiling.
- Distributed storage scales with parallelism. NeuralMesh's parallel architecture means that more concurrent clients increase aggregate throughput rather than increasing contention. The latency curve staying flat at 64 clients is the clearest evidence of this.
- The operator model changes the deployment calculus. Deploying NeuralMesh on Kubernetes is not a storage migration project. The WEKA Kubernetes Operator manages the full storage cluster as native Kubernetes workloads, using the same operational model as everything else on the cluster. The infrastructure in this demo was provisioned from scratch and ready to benchmark in under 40 minutes.
- The same pattern applies beyond databases. Spark executors, Prometheus ingestion, and PyTorch data loaders exhibit identical saturation behaviour for the same underlying reasons.
Reproduce it yourself
All Terraform, manifests, scripts, and raw results are at github.com/mbookham7/mb-k8s-databse-perf-blog. You'll need an AWS account, get.weka.io credentials, and approximately 40 minutes.
make infra && make apps # ~40 min
make bench # writes results/results-<timestamp>.md
make destroy # don't forget thisgp3 is the right default for early-stage Kubernetes workloads. As concurrency grows and storage becomes the constraint, the architectural ceiling becomes the limiting factor. NeuralMesh, deployed natively on Kubernetes via the WEKA Kubernetes Operator, is the path to a different architecture rather than a higher ceiling on the existing one.
What's Next
Scale Production AI Faster with NeuralMesh
Your models aren't slow. Your data is. Fix AI bottlenecks with high-throughput infrastructure.


