Google Cloud’s Latest Releases Drive Innovation For HPC and AI/ML Workloads; Come Visit Google Cloud at ISC 2024
Google Cloud is rapidly expanding its portfolio of solutions for high performance computing, highlighted by a flurry of announcements at its latest event, Google Cloud Next ‘24. Conference attendees were introduced to new products and updates for the company’s highly customizable range of products specifically designed for the HPC community.
H3 VMs
Among the updates are several new benchmark findings, including those for Google Cloud’s H3 Virtual Machine Series. The H3 VMs enable HPC workloads for applications like climate modeling, scientific computing, engineering simulation, and more. Built on Intel’s 4th generation Xeon platform, H3 VMs combine compute, networking, and storage into one HPC-optimized platform.
Recent industry-standard benchmark results show H3 VMs offer up to 3x improvement in per-node performance, improved scalability for multi-node workloads, and up to 2x better price performance than the company’s prior-generation C2 instances.
A3 VMs Shine in MLPerf Inference v4.0
Turning to AI performance, Google Cloud’s A3 VMs showed impressive results in the latest MLPerf Inference v4.0 benchmark testing. A3 VMs are designed for training sophisticated AI models like LLMs and combine NVIDIA H100 GPUs with Google’s leading networking technology.
Google submitted 20 results across seven models for MLPerf, including Stable Diffusion XL and Llama 2 (70B) using A3 VMs. All results were within 0-5% of the peak performance demonstrated by NVIDIA’s submissions.
Parallelstore
Part of Google Cloud’s HPC Toolkit, Parallelstore is a storage system based on Intel’s open source DAOS project. Parallelstore optimizes resources for data-intensive AI/ML workloads by eliminating redundant data storage, reducing costs and idle GPU time.
This service is currently in private preview, but the latest benchmarks from Intel show there is reason to anticipate a wider release: Performance results of 96 GiB/s read, 60 GiB/s write, and IO latencies as low as 0.28ms random read and 0.36ms per random write were demonstrated in a distributed Google Cloud environment.
Cloud HPC Toolkit Additions: Blueprints for ML and CAE
There are two intriguing new blueprints within the Cloud HPC Toolkit. The first is a blueprint for ML workloads (including LLM training) that allows users to spin up an HPC system running on A3 VMs with NVIDIA H100 Tensor Core GPUs which require attentive management of infrastructure and network configuration. The Cloud HPC Toolkit ML blueprint enables this through components including the open source scheduler Slurm, a fully managed Filestore, pre-configured user environments, and more.
The second new solution is a blueprint for computer aided engineering. CAE workloads are compute-intensive applications including structural analysis, fluid dynamics, thermal analysis, and electromagnetic analysis. The innovative CAE Reference Architecture blueprint harnesses the power of H3 and C3 VM families to deliver robust performance for major CAE software such as Ansys Fluent and Siemens Simcenter STAR-CCM+, ensuring efficient handling of memory-intensive workloads and complex resource management.
Customer Success Story: Stanford University
Stanford’s Doerr School of Sustainability is leveraging Google Cloud’s HPC Toolkit to meet the growing demands of its researchers. The toolkit’s flexible deployment options allow Stanford to seamlessly integrate cloud computing with on-prem resources, providing a consistent and familiar user interface through Chrome Remote Desktop. This approach allows researchers to access interactive nodes remotely while maintaining an experience akin to using on-premises clusters.
As a testament to the unmatched customization the HPC Toolkit offers, the school has developed its own modules for secure and efficient use of Vertex AI instances for code development.
Robert Clapp, a Stanford senior research engineer, explains how HPC Toolkit enables fast, secure, and consistent HPC deployment at scale: “With the Toolkit, we can stand up clusters with different partitions depending on our users’ needs, so that they can take advantage of the latest hardware like NVIDIA GPUs when needed and leverage Google Cloud’s workload-optimized VMs to reach price-performance targets. Dynamic cluster sizes, the ability to use spot VMs when appropriate in cluster partitions, and the ability to quickly get researchers up and running in environments they are used to have all been enhanced by the Toolkit.”
Visit Google Cloud at ISC 2024
The rapid pace of innovation shows this is an exhilarating time for HPC customers. Coming right off the heels of an inspiring Google Cloud Next ‘24 just a few weeks ago is another major event: ISC High Performance 2024 in Hamburg, Germany. The May 12-16 conference and exhibition will highlight the latest advances in HPC, machine learning, data analytics, and quantum computing. The Google Cloud team will be there to connect with the HPC community and demonstrate its continually expanding innovations in HPC. Visit Booth D19 to learn more.