Tech-Guide

How to Get Your Data Center Ready for AI? Part Two: Cluster Computing

Sep 03, 2024 by GIGABYTE

In part one of GIGABYTE Technology’s Tech Guide on how you can prepare your data center for the era of AI, we explored the advanced cooling solutions that will help you compute faster with a smaller carbon footprint. In part two, we delve into the key role that cluster computing plays in AI data centers. As the datasets used in AI development become more massive and complex, data centers need servers that will not only perform superbly at critical tasks, but also work with one another to be more than the sum of their parts. This is the basis of cluster computing. GIGABYTE can help you leverage it in your AI data center.

Like the advanced cooling solutions that we discussed in the first part of this Tech Guide, cluster computing is not a new invention, but it has gained prominence due to the advent of artificial intelligence (AI). A major driving force is the fact that modern AI development, which has led to the creation of large language models (LLMs) and generative AI, revolves around processing enormous datasets with billions, or even trillions of parameters through AI training. AI inference, which is what happens when AI uses its pre-trained model (or models) to provide services to users, can also be very resource-intensive. Clearly, this is not the type of workload that any computer can handle on its own.

Cluster computing solves this problem by distributing the workload among interconnected servers, workstations, and even personal computers. It is a form of “parallelism” that is comparable to grid computing and parallel computing. The main benefits of cluster computing are high availability, load balancing—and perhaps most pertinent to the topic of AI, high performance computing (HPC). As AI becomes an indelible part of our lives, it should come as no surprise that AI hardware and software providers are incorporating cluster computing technology into their offerings.

Further Reading:
《Cluster Computing: An Advanced Form of Distributed Computing》

The most anticipated cluster computing solution in 2024 is probably NVIDIA’s GB200 NVL72, a rack-scale exascale AI supercomputer that runs on 36 NVIDIA Grace CPUs and 72 NVIDIA Blackwell GPUs. The processors are connected through NVIDIA’s proprietary NVLink-Chip-to-Chip (C2C) interface that delivers 900 GB/s of bidirectional bandwidth, while the individual nodes are connected through NVIDIA’s NVLink switch system. It is worth noting that the Grace Blackwell Superchip—and its predecessor, the Grace Hopper Superchip, and also the AMD Instinct™ MI300A APU—exemplify the revolutionary designs that are meant to tackle the demanding requirements of AI and HPC. By integrating different types of chips into one package and applying their strengths to the disparate aspects of a single task, this new breed of processors can handle supercomputing workloads that are unprecedented in human history.

GIGABYTE Technology, a leading provider of AI server solutions, can help customers set up their own computing clusters. From server-level and rack-level use cases, in which GIGABYTE assists clients in deploying clusters for biomedical studies, semiconductor research, cloud computing, and more; to GIGABYTE’s data center-level multi-rack cluster computing solution, the GIGAPOD, which comprises up to nine server racks interconnected to form a cohesive computing unit—GIGABYTE has the products and experience to make sure customers can benefit from the latest advancements in data center technologies. In the next sections, we will illustrate the different iterations of cluster computing with concrete examples so you can decide which products are most suitable for your AI data center.

Learn More:
《Visit GIGABYTE’s Artificial Intelligence Solution Page》
《Browse GIGABYTE’s AI Training and Inference Servers》

Server and Rack-level Cluster Computing: Two Case Studies

GIGABYTE can combine multiple servers into a cluster based on the user’s budget and requirements. The cluster can be managed with the client’s own software or with GIGABYTE Management Console (GMC) and GIGABYTE Server Management (GSM), which are available for all GIGABYTE servers, free of charge. Here are two success cases that may offer an informative glimpse into how GIGABYTE can inject cluster computing into your IT infrastructure.

● Case #1

The University of Rey Juan Carlos (URJC) in Spain worked with GIGABYTE to build a computing cluster named “Talos” to study cellular aging mechanisms. Researchers employ AI algorithms and machine learning to detect patterns in medical big data and extract new insights. They also utilize spatial-temporal modeling and generative models in their work. Their requirements were: A, state-of-the-art double-precision processors which can deliver results through “explainable AI”; B, the use of parallel computing to expedite the process; and C, scalability both in terms of computing and storage. Based on these needs, the GIGABYTE team assembled the ideal AI cluster for the customer.

A part of the Talos computing cluster that GIGABYTE built for the University of Rey Juan Carlos. Not only does GIGABYTE have suitable products for different nodes in the cluster, but GIGABYTE also provides cluster management software, free of charge.

The result is a cluster made out of two R182-Z91 Rack Servers for computing, four G492-ZD2 GPU Servers to provide acceleration, one S451-3R1 Storage Server for data storage, and another R182-Z91 to be the “head” or “control” node of the cluster. Both the R182-Z91 and G492-ZD2 were selected for their dual-socket CPU design, which offers the maximum capacity of CPU cores and threads. The four GPU Servers were outfitted with NVIDIA HGX™ A100 8-GPU modules, which contain eight A100 GPUs with blazing-fast interconnection, putting hundreds of thousands of cores at the researchers’ disposal for double-precision computations and parallel computing. The S451-3R1 brought 36 3.5” SAS/SATA drives and six 2.5” hybrid NVMe/SATA/SAS drive bays into the mix for scalable storage, while the head node managed communication between the servers through the NVIDIA Quantum InfiniBand® networking platform. GIGABYTE’s GMC and GSM were installed on the servers alongside a combination of open-source software to present URJC with a complete and cost-effective cluster computing solution.

Further Reading:
《How to Pick the Right Server for AI? Part One: CPU & GPU》

● Case #2

In the case of the Advanced IC Lab at Taiwan’s Yang Ming Chiao Tung University (NCYU), the academics need a cutting-edge computing cluster to enable the efficient testing of integrated circuit (IC) designs. The GIGABYTE team put together a cluster that includes six H282-ZC1 High Density Servers for computing and two R282-Z91 Rack Servers for storage. Each of the High Density Servers contains four nodes, and each node supports dual processors, resulting in over 2,000 CPU cores interconnected through PCIe interfaces that deliver 128GB/s of bandwidth for fast and stable connectivity. The two Rack Servers not only feature hundreds of terabytes of storage, but also 20GB/s data transmission between the nodes through “bonding” network switches. For the cherry on top, the lab implemented their own server traffic control system for cluster management.

The computing cluster that GIGABYTE built for Yang Ming Chiao Tung University allows up to 500 users to compute simultaneously and reduces the time it takes to test IC designs from multiple hours to a matter of minutes.

The upshot is that now 500 users can work simultaneously with the cluster. Intricate IC design testing, which used to take hours to run, can be completed in a matter of minutes. The lab is even looking to implement AI to aid in chip design. All this has been made possible thanks to the computing cluster built by GIGABYTE.

More Cluster Computing Case Studies:
《NCKU Shatters Supercomputing World Records with GIGABYTE Cluster》
《Japan’s Waseda University Builds a Climate Research Computing Cluster with GIGABYTE》

Data Center-level Cluster Computing: Introducing the GIGAPOD

Drawing from the wealth of experience in constructing computing clusters for clients in different industries, and with an eye toward optimizing compute power for data centers in the era of AI, GIGABYTE added the GIGAPOD to its line of AI solutions in 2023. For customers looking for an AI development engine that combines dozens of servers and hundreds of processors to form a massive supercomputer that can tackle the most demanding AI workloads, and which can be deployed as an independent, standalone unit or as one of many nodes in a sprawling AI data center, the GIGAPOD is the summation of GIGABYTE’s cluster computing expertise distilled into a single data center-level solution.

The GIGAPOD took center stage at COMPUTEX 2024. The nine racks contain 32 GPU Servers housing hundreds of advanced GPUs linked through blazing-fast interconnections, allowing them to compute as a single cohesive unit and tackle the most demanding AI workloads.

At the architectural level, the GIGAPOD is made up of 32 GIGABYTE GPU Servers of the same model type and internal configuration. Each server supports an 8-GPU acceleration module. The servers are usually installed in eight racks, four servers to a rack; however, thanks to GIGABYTE’s proprietary cooling technology, an air-cooled 5U (five rack units) server like the G593-SD1-AAX3 may support the 8-GPU module without any loss in performance, so the 32 servers may fit in just four racks to achieve minimal footprint and unrivaled compute density. One additional rack is used to house the control node for cluster management, as well as the storage nodes. This supporting rack is positioned at the exact center of a five or nine-rack array to complete what is termed the “spine-leaf” architecture.

If we look back at the previous examples of computing clusters, we can see that this setup is essentially a streamlined and modularized cluster. The head and storage nodes are concentrated in the central rack that acts as the “spine”, while the heavy-duty compute nodes are distributed among the “leaves” on either side of the spine. Switches at the top of the racks facilitate communication between the servers in the cluster (known as east-west traffic) and between the cluster and the outside world (north-south traffic). Identical GPUs and server models are used in the compute nodes to ensure maximum synergy, empowering the cluster to function as though it were one giant server or accelerator.

The GIGAPOD presents three additional added-value features for its users: bespoke GPU configurations according to customer needs, the choice of liquid cooling for even better performance and stability, and software suites for management and AI development.

Industry veterans familiar with GIGABYTE’s dedication to high-tech solutions and user experience will not be surprised to learn that GIGABYTE has included additional features in the GIGAPOD for extraordinary customer value. While this list is by no means exhaustive, here are three important benefits that put the GIGAPOD a grade above other cluster computing products on the market:

● Customizable options

Not only does GIGABYTE have a comprehensive portfolio of compute servers, storage servers, and servers for control nodes to choose from when assembling the GIGAPOD, even the components in the nodes can be selected according to customer need. To use GPUs as an example, clients may opt for NVIDIA HGX™ H100/H200/B100/B200 modules for their unmatched AI software ecosystem and NVLink interconnect technology, or they may go for AMD Instinct™ MI300X for the excellent memory capacity and AMD Infinity Fabric™ interconnect, which can improve chip-to-chip transactional speed. Intel® Gaudi® is a fresh alternative for workloads related to AI inference. Outside of compute nodes, GIGABYTE works hand-in-hand with multiple suppliers to provide flexible choices for networking, storage, power distribution units (PDUs), and more. This wealth of options, coupled with the second feature we’ll get to in a moment, is why there are multiple configurations available for the GIGAPOD, ensuring that customers will always find their ideal solution.

● Advanced cooling

As already mentioned, GIGABYTE’s proprietary cooling technology allows GPU modules to fit inside air-cooled servers with incredibly compact form factors, resulting in the GIGAPOD’s industry-leading compute density—such as 32 air-cooled servers installed in just four 48U racks, not counting the additional “spine” rack for management and storage. GIGAPOD also supports advanced cooling technologies like direct liquid cooling (DLC), which infuses the servers with the potential for even better performance and stability. GIGABYTE works closely with verified partners to offer a complete solution covering everything from cold plates and leak sensor boards in the servers to manifolds and coolant distribution units (CDUs) at the rack level. The CDUs may be installed inside the rack or as a separate external unit. Rear Door Heat Exchangers (RDHx) may be installed to further improve energy efficiency.

Learn More:
《May We Suggest a Cool Idea? Visit GIGABYTE’s Advanced Cooling Solution Page》
《Experience GIGABYTE’s Direct Liquid Cooling Solution》

● Software suite

GIGABYTE works with its investee company, MyelinTek Inc., to offer a GIGAPOD management platform, GPM, which is loaded with features to provide an optimized data center solution. The platform includes a dashboard that puts device monitoring, workload allocation, cluster management, and one-click software or firmware upgrade at the operators’ fingertips. It even comes with a GUI that gives users a simulated view of the servers’ physical locations to better manage device health and respond to critical events and activities. MyelinTek also provides MLSteam, which is an MLOps platform developed to streamline AI development. It may be used with the GIGAPOD for additional functions such as GPU partitioning, flavor (hardware configuration) definition, and more.

Whether you would like to consider a data center-level solution like the GIGAPOD, or you wish to individually choose the servers and workstations that will make up your cluster, GIGABYTE can help you get a start on incorporating cluster computing into your IT infrastructure. The AI trend has ushered in many computing advancements that are here to stay. Learning to leverage them will upgrade your productivity and ensure that you retain your competitive advantage.

Thank you for reading GIGABYTE’s Tech Guide on “How to Get Your Data Center Ready for AI? Part Two: Cluster Computing”. We hope this article has been helpful and informative. For further consultation on how you can incorporate cluster computing in your data center, we welcome you to reach out to our representatives at marketing@gigacomputing.com.

Learn More:
《AI Servers Getting Too Hot? Dive into Single-phase Immersion Cooling with GIGABYTE》
《The Power of AI: How to Benefit from AI in the Healthcare & Medical Industry》

Get the inside scoop on the latest tech trends, subscribe today!

Get Updates

Like What You Read

And Want More?

Get Updates

Artificial Intelligence (AI)

Machine learning (ML)

AI Training

AI Inference

Generative AI (GenAI)

Get the inside scoop on the latest tech trends, subscribe today!

Get Updates