Introduction to HPC & What is an HPC Cluster?

July 25, 2020

This is a 3-part series on High-Performance Computing (HPC) Storage:

Part 1: Introduction to HPC & What is an HPC Cluster?
Part 2: What is HPC Storage & HPC Storage Architecture
Part 3: HPC Storage Use Cases

Introduction to High-Performance Computing (HPC)

Many of the most pressing challenges today in diverse fields such as manufacturing, life sciences, energy exploration and extraction, financial services, government & defense, and scientific research are using the power of High-Performance Computing (HPC) to create more optimized products or understand the physical world around us. HPC has become interwoven into many organizations’ workflows and not just a side business that operates independently. With such a wide range of domains using HPC today, designing and implementing an efficient infrastructure requires knowledge of the applications that will be used as well as planning for future growth. Many organizations may be using HPC in more than one industry, which requires the solution to be flexible for different computing, storage, and networking requirements. The market for HPC servers, storage, middleware, applications, and service was estimated to be  $27 B in 2018 and will grow at a 7.2% CAGR to about $ 39B in 2023. (Reference 1). HPC storage alone will grow from an estimated $5.5B in 2018 to $7.8 B in 2023.

HPC systems infrastructures contain several sub-systems that must work together efficiently and scale together. The main sub-systems that an HPC system will provide are:

  • Compute – The compute portion runs the application, takes the data given to it, and generates an answer. In the past 20 years, most algorithms have been parallelized, and where the overall problem is broken up into many parts, each run on a separate computer or core. Periodically the resulting answers or partial answers must be communicated with the other calculations or stored on a storage device. Modern servers contain two to four sockets (chips), each with up to 64 cores. Since each of these cores may need to store recently computed information, the demands on the storage device may increase as core counts increase. Many modern applications now take advantage of accelerators that are tightly connected with the main CPU that can accelerate certain portions of the application. Usually referred to as GPGPUs, these accelerators can speed up specific applications by 50X and can place their demands on the storage or networking infrastructure.
  • Storage – At the start of or during a long-running simulation, massive amounts of data are needed to get the simulation running. During the running of the application, depending on the algorithm, more data may be required as well. For example, when simulating the interaction of different drug molecules, thousands of molecular descriptions will have to be ingested at once. During the running of the application, more of these descriptions may have to be investigated, requiring a low latency and high bandwidth storage infrastructure. In large installations, the amount of on-line hot storage may be in the Petabyte range. HPC storage is a key component for the smooth and efficient running of an HPC cluster.
  • Networking – The communications between the servers and storage devices should not limit the overall performance of the entire system. Each core that is performing the computations may need to communicate with thousands of other cores, and request information from the other nodes. The network needs to be designed to handle this server to server communication as well as multiple servers concurrently to the storage system.
  • Application Software – Software that simulates physical processes and runs across many cores is typically sophisticated. The complexity is not just the mathematics underlying the simulation, but the reliance on highly tuned libraries to manage the networking, work distribution, and the input and output to storage systems. The application needs to be architected to keep the overall system busy, otherwise, the investment for a high performing infrastructure will not have a high Return on Investment.
  • Orchestration – Setting up a part of a large cluster can be challenging. A massive supercomputer will rarely have the entire system dedicated to a single application. Therefore, software is needed to allow a scientist or engineer to allocate a certain number of servers, GPUs if needed, network bandwidth, and storage capabilities and capacities. All of these sub-systems, as well as the installation of an Operating System (OS) and associated software on the allocated notes needs to be handled seamlessly and effortlessly. Setting up the software for HPC storage is critical for applications which require fast data access.

What Defines an HPC Cluster?

An HPC Cluster is defined as a collection of the components that allow for applications to be executed. The software will then typically run across many nodes and access storage for both the reading and writing of the data. All of this, including the communication between nodes and the storage system, needs seamless communication. Typically, several different types of nodes comprise an HPC Cluster. HPC cluster components will include:

  • Head node or login node – This node validates users and may set up specific software on the compute nodes.
  • Compute Nodes – These nodes perform the numerical computations and will probably contain the highest clock rates available and affordable with the maximum number of cores at the given clock rate. The persistent storage on these nodes may be minimal, while the DRAM memory will be high.
  • Accelerator Nodes – Some nodes may include one or more accelerators, since not all applications can take advantage of these accelerators. Smaller HPC clusters, designed for a specific use may be set up where all nodes contain an accelerator
  • Storage Nodes or Storage System – An efficient HPC cluster will need to contain a high performance, parallel file system (PFS). A PFS allows for all nodes to communicate in parallel to the storage drives. HPC storage allows for the compute nodes to operate with minimal wait times.
  • Network Fabric – In HPC clusters, typically, an Infiniband or high performing Ethernet network switch will be used, due to the requirement of low latency and high bandwidth features.
  • Software – An HPC cluster cannot operate without the underlying software required for the applications to be executed on, or the software that controls the underlying infrastructure. Software for an efficiently running of an HPC cluster needs to manage the massive amounts of I/O that are inherent to HPC applications. Reading and writing of data in parallel from the large numbers of CPUs to the storage system (whether internal or external to the servers) should be considered critical and not be ignored.

1 –

You may also like:
Lustre File System Explained
5 Reasons Why IBM Spectrum Scale is Not Suitable for AI Workloads
Isilon vs. Flashblade vs. Weka
Gorilla Guide to The AI Revolution: For Those Who Are Solving Big Problems
NAS vs. SAN vs. DAS

Related Resources