NVIDIA DGX BasePOD: How it Works and More
What is NVIDIA DGX BasePOD?
DGX BasePOD, (with “DGX” meaning “datacenter GPU accelerator”) provides a prescriptive enterprise AI infrastructure designed for scaling. This eliminates the traditional design challenges, lengthy deployment cycle, and management complexity associated with scaling AI infrastructure.
NVIDIA DGX BasePOD is an integrated ready-to-deploy offering consisting of hardware and software components, MLOps solutions, and third-party storage. It allows users to leverage validated NVIDIA partner solutions and products to ensure they implement best practices for efficient and manageable scale-out AI development platform design.
DGX BasePOD reference architecture solution designs support developer needs, simplify IT manageability, and facilitate infrastructure scaling up to dozens of nodes with certified storage platforms. Optional MLOps solutions can integrate with DGX BasePOD to enable a full stack solution; this improves model training efficiency, shortens AI development cycles, and speeds ROI for AI initiatives.
Typically, users must build and tune their own integration points before deploying applications. However, each layer of the NVIDIA DGX platform is an integration point, allowing users to simplify system deployment and optimization using validated prescriptive online or data center AI infrastructure.
NVIDIA DGX BasePOD networking. InfiniBand and Ethernet technologies enable networking in DGX BasePOD and ensure there are no bottlenecks or instances of performance degradation for AI workloads.
NVIDIA partner storage appliance. DGX BasePOD rests on a proven storage technology ecosystem. NVIDIA-validated storage partners introduce and qualify new storage technologies with DGX BasePOD to ensure design compatibility with known high-performance workloads and high performance.
NVIDIA DGX software. NVIDIA Base Command powers every DGX BasePOD and offers enterprise-grade orchestration and cluster management, storage and network infrastructure, and an operating system (OS) optimized for AI workloads. Base Command provides integrated cluster management and supports workflow management. Slurm or Kubernetes can be used for optimal scheduling and management of system resources in a multi-user environment.
NVIDIA hardware. DGX BasePOD hardware is optimized with acceleration libraries that speed data movement, access, and management across the fabric.
What is the difference between NVIDIA HGX and DGX?
HGX targets hyperscale data centers, while DGX supports AI supercomputing capabilities.
The NVIDIA hyperscale GPU accelerator (HGX) architecture supports cloud service providers and large-scale data centers. HGX enables high-performance computing and accelerated AL and ML workloads. Typical use cases include scientific simulations and deep learning training.
What is the difference between NVIDIA DGX and EGX?
EGX targets edge computing environments, and as mentioned above, DGX is focused on AI computing capabilities.
NVIDIA edge computing GPU accelerator (EGX) brings AI and GPU acceleration closer to the data in edge computing environments. This enables real-time processing and reduces latency and bandwidth requirements. Typical use cases include industrial IoT, retail analytics, and smart cities.
DGX BasePOD Architecture Explained
The flexible DGX BasePOD solution works with multiple adaptable, prescriptive architectures that support enterprises as they design, deploy, and manage AI workloads and their evolving demands.
The DGX reference architecture typically has multiple nodes and each one has four compute connections for networking. The system uses high-speed networking provided by NVIDIA Mellanox InfiniBand switches. The complete DGX BasePOD architecture has three networks, an Ethernet fabric for system management and storage, an InfiniBand-based compute network, and an out-of-the-box management network.
DGX BasePOD designs may be built and deployed with active optical cables, direct attached copper, or transceivers and fiber cables.
NVIDIA DGX systems are equipped with multiple NVIDIA GPU-accelerated servers optimized for AI tasks. They serve as the computational engines for running deep learning algorithms.
The comprehensive NVIDIA DGX software stack includes cuDNN for deep neural network acceleration, CUDA for parallel computing, and frameworks like TensorFlow and PyTorch for building and training AI models.
The architecture also includes management and orchestration tools for scaling and deploying AI workloads and resources across NVIDIA DGX servers efficiently.
DGX BasePOD Core Components
The compute, HCA, and switch resources form the foundation of the DGX BasePOD. Consider how NVIDIA DGX H100 systems are composed as an example:
The NVIDIA DGX H100 system is designed for compute density, high performance, and flexibility. Key specifications include eight NVIDIA H100 GPUs and 640GB GPU memory. Four single-port HCAs are used for the compute fabric, and dual-port HCAs provide parallel pathways to management and storage fabrics. The out-of-band port provides BMC access.
DGX NVIDIA configurations can be equipped with four types of networking switches:
NVIDIA QM9700 and QM8700 InfiniBand Switches offer compute power for fabrics in NDR BasePOD configurations. Each NVIDIA DGX system has dual connections to each of the InfiniBand switches, providing multiple low-latency, high-bandwidth paths between the systems. NVIDIA SN5600 switches are used for GPU-to-GPU fabrics, and NVIDIA SN4600 switches provide redundant connectivity for in-band management of the DGX BasePOD.
DGX BasePOD Use Cases
Some common use cases for NVIDIA DGX BasePOD include:
- AI research and development. DGX BasePOD provides a powerful infrastructure for AI researchers and data scientists to develop and train deep learning models. It accelerates model training tasks, reducing iteration times and enabling faster experimentation.
- Deep learning training at scale. Organizations with large datasets and complex deep learning models can scale up training capabilities with multiple DGX BasePODs in a unified architecture, distributing training across GPUs and speeding the training process.
- AI model deployment. DGX BasePOD supports AI deployments at scale, providing computational power and infrastructure to serve predictions in real-time.
- High-performance computing (HPC). DGX BasePOD can be used for a variety of applications including scientific simulations, computational fluid dynamics, molecular modeling, and other traditional HPC tasks.AI-powered analytics for big data platforms. DGX BasePOD can help extract valuable insights from big data for generative AI applications such as natural language processing, image recognition, recommendation systems, and predictive analytics
WEKA and DGX BasePOD
The WEKA® Data Platform has recently passed the suite of NVIDIA tests that are required to achieve DGX BasePOD™ H100 Certification status. WEKA has already been BasePOD certified against the NVIDIA DGX A100 platform, and this new round of testing is the certification against the NVIDIA DGX H100 platform. This testing also provides initial estimation data to NVIDIA to extrapolate how the system will perform when talking to a DGX SuperPOD system.
An entry-level WEKA cluster for BasePOD requires eight nodes for full availability with the ability to survive up to a two-node failure, and the cluster can be easily scaled to hundreds of nodes. Each server has a CPU, NVMe storage, and high-bandwidth networking.
The WEKA Data Platform is ideal for data-intensive AI and HPC workloads and provides exceptional performance for high-intensity workloads. The platform also scales linearly as more servers are added to the storage cluster allowing the infrastructure to scale with the increasing demands of the business.
In addition to POSIX access, WEKA supports standard file access protocols, including NFS, SMB, and S3, for maximum compatibility and interoperability. WEKA delivers best-of-breed performance from the NVMe flash tier, and the namespace can expand to any S3 object store, on-premises or in the cloud. This optional hybrid storage model with the ability to develop the global namespace to lower-cost hard disk drives in an object store delivers a cost-effective data lake without compromising performance.
WEKA’s instant and space-efficient snapshots provide the capability for experiment reproducibility and explainability. With crucial management integration, WEKA’s integrated snapshots and end-to-end encryption features ensure that data is always backed up and secure throughout its lifecycle.
WEKA’s Snap-to-Object feature also enables secure data portability from on-premises to the public cloud for organizations that require access to on-demand GPU resources in the public cloud. Using the WEKA Kubernetes CSI plug-in, organizations now have flexibility in how and where they deploy containerized applications. It provides easy data mobility from on-premises to the cloud and back while delivering the best storage performance and latency.
An NVIDIA DGX BasePOD with a datastore that leverages the WEKA Data Platform will enable us to extend these benefits to our customers at a much larger scale. Key highlights of the solution include:
- The WEKA Data Platform and NVIDIA DGX BasePOD are now directly applicable to mission-critical enterprise AI workflows, including natural language processing and larger-scale workloads for customers in the life sciences, healthcare, and financial services industries, among many others. WEKA can efficiently serve large and small files across various workload types.
- WEKA’s continued innovation and support of Magnum IO GPUDirect Storage technology provides low-latency, direct access between GPU memory and storage. This frees CPU cycles from servicing the I/O operations and delivers higher performance for other workloads.
To achieve BasePOD certification NVIDIA requires the completion of a suite of tests that measure performance using an automated IO test suite. NVIDIA focuses on the NVIDIA ‘Condor’ test suite, which measures overall storage performance and IO scaling across multiple parameters. NVIDIA BasePOD testing configurations using Condor are designed to talk from the storage to the CPU side of the DGX systems. Condor leverages FIO as the IO engine when running its tests.
NVIDIA values consistency of performance and the ability to scale to meet the needs of a DGX cluster as it grows. The Condor tests start at 1 client (‘node’) and then scale up to 16 nodes during testing. Condor tests combinations of different IO sizes, different numbers of threads, buffered IO vs. Direct IO, random reads, re-reads, and more, resulting in nearly 500 different tests, with each test run 6 times to verify results and gather the data needed to evaluate the storage system. The data is then audited by NVIDIA engineering to decide whether a storage system will pass the BasePOD certificationEngineers at WEKA and NVIDIA partnered to architect a scalable and robust infrastructure that pushes the boundaries of AI innovation and performance. The validation showed robust linear performance scalability from one to sixteen DGX H100 systems, allowing organizations to start small and grow seamlessly as AI projects ramp. The results demonstrate that scaling GPU infrastructure to accelerate time to insights will be well supported by WEKA. The validated WEKA configuration makes it easy for teams to focus on developing new products and gain new faster insights with AI/ML.