04/14 2025
347
Produced by Zhineng Zhixin
In the realm of High-Performance Computing (HPC), Google Cloud has unveiled ambitious plans to disrupt the traditional scientific computing market with its latest H4D instances and NVIDIA Blackwell GPU-based A4 instances.
The H4D instances feature AMD's cutting-edge 5th generation 'Turin' Epyc 9005 processors, boasting up to 12 TFLOPS of FP64 performance—a substantial leap from previous generations. These instances also leverage the Titanium Offload Engine and Falcon Transport Layer to achieve a 200 Gb/s low-latency network, significantly enhancing the operational efficiency of HPC workloads.
Meanwhile, the A4 instances deliver 720 PFLOPS of AI computing power at FP8 precision, addressing the needs of both AI and HPC. By offering a flexible cloud-based solution that leverages high-performance CPUs and network technology, Google aims to attract HPC centers with limited budgets, providing an alternative to traditional procurement methods.
In this article, we delve into how Google Cloud is breaking through budget and performance bottlenecks in the HPC market through hardware and network innovations. We analyze the technical architecture, performance advantages, and market positioning of the H4D instances, and discuss their potential impact on the traditional HPC ecosystem.
Part 1
Technical Architecture and Innovations of H4D Instances
● Hardware Core: AMD Turin Epyc 9655 Processor
At the heart of the H4D instances lies AMD's 5th generation 'Turin' Epyc 9005 series processors, specifically the dual-socket Epyc 9655. Each processor boasts 96 Zen 5 cores, totaling 192 physical cores, with Simultaneous Multithreading (SMT) disabled to optimize HPC performance. This design avoids thread contention in compute-intensive tasks, ensuring efficient operation of each core.
◎ Processor Performance: The Epyc 9655 employs the Zen 5 architecture, which, with its full L3 cache, is well-suited for cache-sensitive HPC applications such as fluid dynamics (OpenFOAM), molecular dynamics (GROMACS), and weather simulation (WRF). In the High-Performance LINPACK (HPL) benchmark, the H4D instances achieve 12 TFLOPS of FP64 performance, a 5x improvement over previous C2D instances and an 1.8x boost over C3D instances. The single-core performance in the HPL test is about 40% higher than Intel's Golden Cove cores, highlighting Zen 5's advantage in floating-point operations.
◎ Memory and Storage Configuration: The H4D offers three configurations: 720 GB of main memory, 1488 GB of main memory, and 1488 GB of memory paired with 3.75 TB of local NVMe flash storage. This high memory capacity supports rapid access to large datasets, making it ideal for memory-intensive tasks like astrophysical simulations or genomic analysis. The local flash storage provides high throughput for temporary data storage, reducing reliance on external storage systems and enhancing I/O efficiency.
● Network Technology: Breakthroughs with Titanium and Falcon
The H4D instances introduce Google's Titanium Offload Engine to HPC scenarios for the first time, enabling low-latency communication via a 200 Gb/s Cloud RDMA network, which significantly optimizes the performance of distributed HPC tasks.
◎ Titanium Offload Engine: Titanium employs a two-stage offload architecture that reduces network overhead on the host CPU, allowing more computational resources to focus on core tasks. Compared to traditional ROCE v2, Titanium's Cloud RDMA achieves higher throughput and lower latency through hardware acceleration, making it particularly suited for HPC applications requiring frequent inter-node communication.
◎ Falcon Transport Layer: Falcon is Google's newly introduced hardware-assisted transport layer that moves transport functions from software to network card hardware, supporting RDMA and NVM-Express protocols. Falcon is binary-compatible with Ethernet and InfiniBand protocols, allowing traditional HPC applications to run without recompilation, which is crucial for HPC centers relying on protocols like MPI (Message Passing Interface).
In OpenFOAM and STAR-CCM tests, Falcon-supported Cloud RDMA significantly enhanced communication efficiency between virtual machines, outperforming traditional Ethernet protocols.
◎ Network Performance Validation: In the STREAM Triad test, the H4D's memory bandwidth was about 30% higher than the C3D instances, showcasing the Turin chip's advantage in data-intensive tasks. In distributed tasks, the synergy of Cloud RDMA and Falcon reduced cross-node communication latency by about 20%-30%, leading to notable overall performance improvements. For instance, in WRF weather simulations, the H4D's runtime was reduced by about 40% compared to the C3D, demonstrating the practical value of network optimization.
The H4D surpasses its predecessors, the C2D and C3D instances, in both performance and efficiency. The C2D, based on AMD Zen 3 architecture, is limited in memory bandwidth and computing power, while the C3D, despite using Intel Sapphire Rapids, falls short in core count (88 cores) and single-core performance compared to the H4D.
● The H4D stands out in various HPC workloads:
◎ Molecular Dynamics (GROMACS): Performance improved by about 50% over C3D.
◎ Fluid Dynamics (OpenFOAM): Operational efficiency increased by approximately 45%.
◎ Weather Simulation (WRF): Performance boosted by about 40%.
These enhancements are attributed to the H4D's higher core density, optimized memory bandwidth, and low-latency characteristics of the Titanium/Falcon network.
Part 2
Google Cloud's HPC Strategy and Market Impact
● Targeting HPC Centers with Limited Budgets
Due to budget constraints, HPC centers often prefer building their own x86 clusters, spreading costs over several years. However, cloud services offer instant scalability, which becomes a key advantage for time-sensitive or ultra-large-scale computing tasks.
Google addresses this need with the H4D instances, offering high-performance CPU instances for traditional HPC workloads that have not yet been ported to GPUs.
◎ Cost-Benefit Analysis: Based on previous pricing, we estimate the on-demand price of H4D instances (192 cores, 12 TFLOPS) to be approximately $7.8777 per hour, with an annual rental cost of about $69,056 and a cost per TFLOPS of $5755. While this is higher than GPU instances per unit of computing power, the H4D's 1488 GB of memory and 3.75 TB of local storage make it more suitable for memory-intensive HPC tasks, and it requires no modification of existing code, reducing migration costs.
◎ Flexibility and Applicability: The H4D's three configurations (720 GB, 1488 GB, 1488 GB + 3.75 TB) provide HPC centers with diverse options. Cloud services' on-demand scalability also allows HPC centers to avoid the costs of hardware maintenance and data center operations, making it particularly suitable for temporary or sporadic tasks.
Google has simultaneously launched the A4 and A4X instances, based on the NVIDIA Blackwell B200 GPU, offering 72 PFLOPS (8 GPUs) and 720 PFLOPS (72 GPUs) of FP8 performance, respectively. This dual-track layout enhances Google Cloud's appeal to the HPC market, addressing both traditional CPU workloads with H4D and the convergence of AI and HPC with A4/A4X.
For example, molecular dynamics can leverage the FP64 performance of H4D, while machine learning-driven material simulations can benefit from the FP8 computing power of A4.
In the HPC cloud services market, Google Cloud faces competition from AWS and Azure. However, Google's innovations in network technology provide it with a differentiated advantage. The low latency and high compatibility of Titanium and Falcon lower the migration threshold for HPC applications, potentially attracting more academic and research institutions.
While the acceptance of cloud services by HPC centers is constrained by budgets and cultural inertia, Google needs to further reduce barriers through competitive pricing and ecosystem support, such as optimizing open-source HPC toolchains.
The narrowing performance gap between AMD CPUs and NVIDIA GPUs in FP64 reflects the GPU design's bias towards AI low-precision computing, while CPUs' dominance in traditional HPC is unlikely to be shaken in the short term.
By deploying H4D and A4 in a coordinated manner, Google Cloud not only responds to the real-world needs of HPC centers but also paves the way for the convergence of AI and HPC. In the long run, this strategy may drive more institutions to shift from on-premises clusters to the cloud, accelerating the digital transformation of scientific research.
Summary
With the launch of H4D and A4 instances, Google Cloud has demonstrated its far-reaching strategy in the HPC market. The H4D, with its AMD Turin Epyc 9655 processors, 1488 GB of memory, and low-latency communication via the Titanium/Falcon network, provides a cost-effective cloud solution for HPC centers with limited budgets. The A4 instances, with 720 PFLOPS of FP8 computing power, cater to both AI and HPC needs, showcasing Google's insight into hybrid computing scenarios. Compared to AWS and Azure, Google's innovations in network optimization and compatibility give it a competitive edge, but it must continue to strive in pricing and ecosystem development.