How To Shorten Deep Learning Training Times
Barbara Murphy. July 1, 2018
The AI workflow is the production line for deep learning (DL) model development and deployment. Data is ingested into the system from various sources including sensors, machines, vehicles, logs or user data. It is then cleaned, tagged and transformed into a data set that is used to “train” the DL model. On completion of the training phase the model is deployed to production for inference against real “production” data input. AI storage is a major consideration during the training portion where data storage capacity and I/O can have a significant impact on the time to production.
Most storage vendors depict the typical DL training workflow in a simplistic way (shown in figure 1), ignoring the inference phase. Typically inference is considered a process that happens in production, which would not require interaction with the infrastructure that was used to train the neural network. In reality this is a significant oversight because it assumes the model emerges from the training phase as an accurate model on the first pass. However inference is also used to validate the accuracy of a neural network that is being trained, enabling more informed tuning of that neural network to improve performance or accuracy.
Figure 1: Simplistic model of AI training workflow
Figure 1 is missing a continuous improvement feedback loop to ensure model accuracy. A more representative model workflow is outlined in figure 2, incorporating a model validation phase. If the model fails in the validation stage, it has to be tuned and trained further and this iteration of continuous model improvement is a critical step in the model development. It is analogous to the model validation – or simulation – phase in EDA chip design or other software development efforts, and can add significant wall clock time to the training phase.
Figure 2: Real-world workflow during deep learning training phase
WekaIO has partnered with the Hewlett Packard Enterprise (HPE) Deep Learning Lab to explore the I/O impact on a more representative pipeline for DL model training, focusing on the inference training cycle. The HPE Deep Learning team conducted a series of industry standard benchmarks utilizing a single HPE Apollo™ 6500 with NVIDIA Tesla™ V100 GPU processors, and compared local NVMe disk performance to WekaIO shared storage over a Mellanox™ InfiniBand network. The purpose of the benchmarks was to understand the impact of storage I/O during the model validation phase, comparing the performance of local NVMe drives with Matrix enabled shared storage.
Local file systems are considered by many to be the gold standard for best-possible I/O performance. The HPE Deep Learning Lab conducted tests across a series of popular benchmarks including ResNet 152, ResNet 50, GoogleNet and VGG 16, comparing local NVMe drive performance on the Apollo 6500 with WekaIO Matrix on external storage. HPE conducted a suite of tests scaling from 2 GPU processors to 8 and a summary of the results can be reviewed in HPE’s Deep Learning Performance Guide, report #10.
The performance numbers outlined in figure 3 compare the inference benchmarks measurements on local NVMe drives and WekaIO Matrix.
Figure 3: Inference benchmarks on single node with 8 GPU processors
WekaIO Matrix consistently outperformed the local NVMe drives across every benchmark peaking at just under 42,000 images per second on GoogleNet, while local NVMe was I/O bound at around 17,000 images per second. The results demonstrate that Matrix parallel file system out-performs a local file system while the local file system was I/O bound resulting in data starvation to the Apollo 6500 GPUs.
The perception that local file systems are higher performing than shared file systems has led to the practice of local-file copy in the AI workflow. In fact this is the case for shared storage solutions built on legacy NFS as the protocol bottlenecks at around 1.5GBytes/second. NFS (Not for Speed) was developed when networks were slow and disk drives could easily match the performance of the network. Today’s InfiniBand and Ethernet networks reach speeds of 100Gbits/second (12.5GBytes/second) and the contrast between NFS performance and underlying network capability are significant. WekaIO Matrix parallel file system leverages this network bandwidth to deliver much higher performance than either local NVMe drives or NFS.
HPE also benchmarked WekaIO Matrix against DRAM and the shared parallel file system delivered close to DRAM performance from the external storage system. These numbers are outlined in HPE’s Deep Learning Performance Guide, report #11 and shared below in figure 4.
The work conducted by the HPE Deep Labs highlight the need for a high performance shared storage system in the DL workflow to ensure the training workflow is not starved of I/O during the training validation phase. These benchmarks also support WekaIO’s claim that the Matrix file system delivers faster than local drive performance, nearing DRAM performance. While NFS will be hard pressed to deliver more than 1GBytes/second on a single network connection, WekaIO scales to over 10GBytes/second on a single 100Gbit link. Matrix outperforms a local NVMe based file system because it can parallelize the I/O from many more NVMe drives over a network connection that is faster than what is available to a local NVMe drive.
For more information on deep learning check out HPE’s cookbook which contains a set of tools, benchmarks and reference designs for deep learning.