Making and Breaking Records: Do Benchmarks Matter?
Joel Kaufman. January 27, 2022
Recently, there’s been an uptick in posts and blogs on the internet about how performance benchmarks are really hard to make sense of when compared to each other. In many cases, it boils down to a series of issues that are endemic to the industry that makes it very hard for customers and companies to evaluate the capabilities of various systems.
- Differing and sometimes opaque disclosure rules across benchmarks
- A wide variety of open tools that anyone can run without any disclosure
- Varied system configurations across each individual benchmark, including “lab queens”
In a general sense, testing is useful: customers want to validate that what they have meets a certain level of acceptance criteria – even if they aren’t exactly what they will deploy – and then once the system is deployed, they want to test improvements and tuning by baselining a system and then iteratively testing on a regular basis.
Earlier this year, Samsung Datacenter Technology and Cloud Solutions labs put together a storage system that used their newest PM9A3 NVMe drive to test its capabilities to power enterprise storage systems. After some consideration, they chose WEKA to bring the drives together in a storage cluster and see what performance they could get. This was a compact, high-density system with 90 PM9A3 NVMe drives all in a 12u footprint that achieved new number 1 records for 4 of the 5 SPEC Storage2020 audited benchmarks, all using an identical configuration across varied datasets and IO types.
Internal results using FIO indicated that the system generated 145GB/s throughput with only 8 clients attached to it, as well as 6.5 Million IOPs all at a 110us latency. This translates to 72,000 IOPs per NVMe drive, showing the performance of not only the Samsung drives but the ability of WEKA to deliver that performance in a consolidated manner across all of the FIO tests.
Was Samsung’s Benchmark Run a Good Indicator of Realistic Performance?
Much of the teeth-gnashing comes from the different benchmarks that exist. In the storage world, there are a plethora of test tools or benchmarks that can be run to exercise the system. On the basic tool’s side, you have widely known general IO generators such as FIO, IOmeter, IOzone, VDbench, and more. Then you get to more custom dataset-based tools such as NVIDIA’s MLperf-resnet and NCCL all_reduce data, and benchmarks such as the IO500. These generate specific IO against a specific dataset to see if your storage performs against that workload. And finally, you have other fully audited dataset-based benchmarks such as the Securities Technology Analysis Center (STAC) series of benchmarks for financial analysis workloads, and the Standard Performance Evaluation Corporation (SPEC) benchmarks. All these tools and benchmarks can be run to test systems and depending on HOW they are used are very good.
The largest issue with using these tools in a public forum is the reporting and publishing rules that help separate egregious marketing claims from something that has a stake in reality. The completeness of disclosure on how the system is built, configured, and how the test must be run is what creates value to people who use these comparisons to evaluate new technologies in their environment. SPEC and STAC are good because of very strong disclosure rules, including HW and software detailed configurations. If you purchase the equipment and configure your system to the published settings, you will achieve the results they show. Other benchmarks and tool results are more fluid, and vendor-self publishing of numbers should always be double-checked for details like IOPS and throughput and make sure that enough information about the configuration exists to understand how the results were achieved.
For this reason, while I use tools like FIO to test baselines for specific things and then do iterative comparisons, I tend to look to SPEC and STAC because of the rigor they have in developing the benchmark IO patterns and datasets to match more real-world data flows. In addition, the level of auditing and validating that the benchmark was correctly run *and disclosed* provides confidence in being able to do a system-to-system comparison within the benchmark.
Because they used SPEC, Samsung’s benchmarks can be viewed as a reference architecture: If you buy the same equipment and deploy using the configuration they describe, your results will be repeatable.
Lessons Learned While Benchmarking
There were a few things learned over the years that helped guide the configuration used for Samsung’s Storage test:
- Use as many default settings as you can to reduce complexity: While there were a few architectural choices made such as ethernet vs. infiniband for networking, how much CPU to use for storage services, and number and capacity of NVMe drives, after that, it was some minor tweaks to the client mount/access and that was it. It kept the time to iterate and test to a minimum.
- Clients matter: not all benchmarks will hit the limit of the storage unless you make changes to the clients. Many benchmarks have dataset structures and access patterns that are sensitive to how much CPU is in the client, or in some cases, the client OS (linux, windows, Etc.) may have a limit that forces you to do some configuration changes. In some cases, those limits may force you to use more clients to get all the performance out of the storage. Samsung settled on 8 client servers in this case as a balance of getting good performance and reducing infrastructure complexity and sprawl.
- The individual components are important. Choices in number and type of SSDs, NICs, and switches, how much RAM and speed of CPU can have an impact on the testing. A plan ahead of time to choose appropriate components for the type of testing being done will go a long way to achieving the testing goals.
- Networking can be a PITA…. Make sure it’s consistent! Make sure that identical configurations are used and automated for every storage host and client used. One server with a bad network config can be very hard to identify as the issue, especially at scale.
Overall, the results show off several benefits that can come from deploying This Samsung + WEKA configuration: Excellent Multi-workload performance for not only the Samsung PM9A3 drives but also for the WEKA Data Platform for AI as well, resulting in no need to specialize in only one type of workload. A broad set of capabilities and *zero tuning* needed across the various workloads minimizes the storage administration and effort needed to optimize the system. This comes from the WEKA Zero-copy architecture, which is important as customers want to consolidate, and not create discrete silos for each workload in a high-performance data pipeline.